We have shown many cases in previous articles about where and how SQL Server R Services can be used, how to plot and visualize the data, and discussed the benefits. This article will tackle, for many people in the SQL community, an even more interesting topic: how a database administrator can profit from using R Services for the SQL Server that he is administrating on a daily, weekly, or monthly basis. Some approaches might be unusual, some very much obvious, but it is using such approaches that really brings the benefit.
If we were focusing on the methodology on analyzing sales data in my previous two articles. The focus in this article we will not that much on the methodology or algorithms but more on tasks that a DBA may face. I will not go into the discussion what the DBA tasks are, because it might be different from DBA to DBA. Work at a particular company might include or exclude particular tasks. Rather, I will focus on tasks that every enterprise needs to monitor, to have in operation, and to know when and what action to take when particular events are triggered or alarmed.
So the typical and much needed tasks that need attention are:
- database growth and potential free disk space
- queries that take higher amount of resources to execute
- statistics that can be gathered around these items.
Therefore, we will focus on these tasks, bringing R and statistics usage in and give the DBA additional insight. Prior to exploring this, there are couple of ground rules that we need to point out. What is abaseline and how to have a baseline created and measured. Thresholds are also important, as they represent triggers to particular events: automatic ticket creating, escalating to different people, stopping processes, or any kind of business rule. Beside the baseline, there are always thresholds for benchmarks and monitoring. These approaches are very important as they usually provide very good ways of handling exceptions or unwanted event in your system.
Having said that, in this article I will focus on following cases, where we will be able to use R Services with SQL Server:
- Disk space usage
- Clustering queries
- Plotting a heatmap of executed queries based on query statistics
Let me know your feedback, so we can also do following three topics in next article: parameter sniffing, stale statistics and heap tables
Both cases will include usage of R Language to do the statistical analysis, data visualization and predictive analysis.
Disk space usage
Authors, database administrators, developers, MVPs, and others have written about storage, volumes of data, and monitoring capacities over time. Systems have requirements and businesses have needs in order for application to work normally. I am sure we all have found ourselves in situation of a “DBA by accident” situation, when disk usage went sky high during the night and nobody knew why. This is why it is important to have baseline captured. A baseline is like a control group that will tell you that a situation is unusual or it is normal with your test group. In IT, normally the baseline is the reference state, which was measured and recorded in a planned time or environment. Therefore, the DBA can spot any unplanned and unwanted activities or unusually system behavior, especially when they compare it to the baseline. In this manner, raising red flag would result in less false positive states.
In addition, we always come across disk size problems. Based on this problem, we will demo database growth, store the data, and then run predictions against the collected data to be able at the end, to predict, when DBA can expect the disk space problems.
To illustrate this scenario, I will create a small database of 8 MB and no possibility of growth. I will create two tables – one will serve as a baseline - DataPack_Info_SMALL, and one will serve as a so called everyday log, where everything will be stored for unexpected cases or undesired behavior. This will persist in table DataPack_Info_LARGE.
First, create a database:
USE [master]; GO CREATE DATABASE FixSizeDB CONTAINMENT = NONE ON PRIMARY ( NAME = N'FixSizeDB', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\FixSizeDB.mdf' , SIZE = 8192KB , FILEGROWTH = 0) LOG ON ( NAME = N'FixSizeDB_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\FixSizeDB_log.ldf', SIZE = 8192KB , FILEGROWTH = 0) GO ALTER DATABASE [FixSizeDB] SET COMPATIBILITY_LEVEL = 130 GO ALTER DATABASE [FixSizeDB] SET RECOVERY SIMPLE GO
We will use following table for generating inserts and later deletes:
CREATE TABLE DataPack
(
DataPackID BIGINT IDENTITY NOT NULL
,col1 VARCHAR(1000) NOT NULL
,col2 VARCHAR(1000) NOT NULL
)Populating the DataPack table will be done with the following simple WHILE loop:
DECLARE @i INT = 1;
BEGIN TRAN
WHILE @i <= 1000
BEGIN
INSERT dbo.DataPack(col1, col2)
SELECT
REPLICATE('A',200)
,REPLICATE('B',300);
SET @i = @i + 1;
END
COMMIT;
GOInstead of using the sp_spaceused procedure, we will be capturing disk space changes with following query:
SELECT
t.NAME AS TableName
,s.Name AS SchemaName
,p.rows AS RowCounts
,SUM(a.total_pages) * 8 AS TotalSpaceKB
,SUM(a.used_pages) * 8 AS UsedSpaceKB
,(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
FROM
sys.tables t
INNER JOIN sys.indexes AS i
ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions AS p
ON i.object_id = p.OBJECT_ID
AND i.index_id = p.index_id
INNER JOIN sys.allocation_units AS a
ON p.partition_id = a.container_id
LEFT OUTER JOIN sys.schemas AS s
ON t.schema_id = s.schema_id
WHERE
t.NAME NOT LIKE 'dt%'
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
AND t.Name = 'DataPack'
GROUP BY t.Name, s.Name, p.RowsThe Log table will be filled along with the DataPack table in order to gather immediate changes in disk space:
DECLARE @nof_steps INT = 0
WHILE @nof_steps < 15
BEGIN
BEGIN TRAN
-- insert some data
DECLARE @i INT = 1;
WHILE @i <= 1000 -- step is 100 rows
BEGIN
INSERT dbo.DataPack(col1, col2)
SELECT
REPLICATE('A',FLOOR(RAND()*200))
,REPLICATE('B',FLOOR(RAND()*300));
SET @i = @i + 1;
END
-- run statistics on table
INSERT INTO dbo.DataPack_Info_SMALL
SELECT
t.NAME AS TableName
,s.Name AS SchemaName
,p.rows AS RowCounts
,SUM(a.total_pages) * 8 AS TotalSpaceKB
,SUM(a.used_pages) * 8 AS UsedSpaceKB
,(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
,GETDATE() AS TimeMeasure
FROM
sys.tables AS t
INNER JOIN sys.indexes AS i
ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions AS p
ON i.object_id = p.OBJECT_ID
AND i.index_id = p.index_id
INNER JOIN sys.allocation_units AS a
ON p.partition_id = a.container_id
LEFT OUTER JOIN sys.schemas AS s
ON t.schema_id = s.schema_id
WHERE
t.NAME NOT LIKE 'dt%'
AND t.is_ms_shipped = 0
AND t.name = 'DataPack'
AND i.OBJECT_ID > 255
GROUP BY t.Name, s.Name, p.Rows
WAITFOR DELAY '00:00:02'
COMMIT;
END
And this will serve as our baseline when comparing the results. When querying table DataPack_Log_Small, the results are the following:
We can use R to check how the columns, Rowcounts and UsedSpaceKB, are correlating. In order to do this, we will use external procedure, sp_execute_external_script:
DECLARE @RScript nvarchar(max)
SET @RScript = N'
library(Hmisc)
mydata <- InputDataSet
all_sub <- mydata[2:3]
c <- cor(all_sub, use="complete.obs", method="pearson")
t <- rcorr(as.matrix(all_sub), type="pearson")
c <- cor(all_sub, use="complete.obs", method="pearson")
c <- data.frame(c)
OutputDataSet <- c'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT
TableName
,RowCounts
,UsedSpaceKB
,TimeMeasure
FROM DataPack_Info_SMALL'
EXECUTE sp_execute_external_script
@language = N'R'
,@script = @RScript
,@input_data_1 = @SQLScript
WITH result SETS ((RowCounts VARCHAR(100)
,UsedSpaceKB VARCHAR(100)));
GOAs a result we get a strong and positive correlation between columns RowCounts and UsedSpaceKB.
This can easly be interepreted as this. When the value for RowCounts goes up, the value for UsedSpaceKB also goes up. This is the only logical explanation. It would be somehow strange to have a negative correlation. Now we will try to simulate the random deletes and inserts and observe the similar behavior with following code:
DECLARE @nof_steps INT = 0
WHILE @nof_steps < 15
BEGIN
BEGIN TRAN
-- insert some data
DECLARE @i INT = 1;
DECLARE @insertedRows INT = 0;
DECLARE @deletedRows INT = 0;
DECLARE @Rand DECIMAL(10,2) = RAND()*10
IF @Rand < 5
BEGIN
WHILE @i <= 1000 -- step is 100 rows
BEGIN
INSERT dbo.DataPack(col1, col2)
SELECT
REPLICATE('A',FLOOR(RAND()*200)) -- pages are filling up differently
,REPLICATE('B',FLOOR(RAND()*300));
SET @i = @i + 1;
END
SET @insertedRows = 1000
END
IF @Rand >= 5
BEGIN
SET @deletedRows = (SELECT COUNT(*) FROM dbo.DataPack WHERE DataPackID % 3 = 0)
DELETE FROM dbo.DataPack
WHERE
DataPackID % 3 = 0 OR DataPackID % 5 = 0
END
-- run statistics on table
INSERT INTO dbo.DataPack_Info_LARGE
SELECT
t.NAME AS TableName
,s.Name AS SchemaName
,p.rows AS RowCounts
,SUM(a.total_pages) * 8 AS TotalSpaceKB
,SUM(a.used_pages) * 8 AS UsedSpaceKB
,(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
,GETDATE() AS TimeMeasure
,CASE WHEN @Rand < 5 THEN 'Insert'
WHEN @Rand >= 5 THEN 'Delete'
ELSE 'meeeh' END AS Operation
,CASE WHEN @Rand < 5 THEN @insertedRows
WHEN @Rand >= 5 THEN @deletedRows
ELSE 0 END AS NofRowsOperation
FROM
sys.tables AS t
INNER JOIN sys.indexes AS i
ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions AS p
ON i.object_id = p.OBJECT_ID
AND i.index_id = p.index_id
INNER JOIN sys.allocation_units AS a
ON p.partition_id = a.container_id
LEFT OUTER JOIN sys.schemas AS s
ON t.schema_id = s.schema_id
WHERE
t.NAME NOT LIKE 'dt%'
AND t.is_ms_shipped = 0
AND t.name = 'DataPack'
AND i.OBJECT_ID > 255
GROUP BY t.Name, s.Name, p.Rows
WAITFOR DELAY '00:00:01'
COMMIT;
ENDWe have also added a delete statement, as well as rowcounts so that the demo will not be so straightforward.
By calculating the correlation coefficient,
DECLARE @RScript nvarchar(max)
SET @RScript = N'
library(Hmisc)
mydata <- InputDataSet
all_sub <- mydata[2:3]
c <- cor(all_sub, use="complete.obs", method="pearson")
c <- data.frame(c)
OutputDataSet <- c'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT
TableName
,RowCounts
,UsedSpaceKB
,TimeMeasure
FROM DataPack_Info_LARGE'
EXECUTE sp_execute_external_script
@language = N'R'
,@script = @RScript
,@input_data_1 = @SQLScript
WITH result SETS ( (
RowCounts VARCHAR(100)
,UsedSpaceKB VARCHAR(100)
));
GOWe can see, that we still get very strong and positive correlation:
We will now compare our LARGE test with the baseline by running same correlation coefficient on different datasets. First on our baseline (DataPack_Info_SMALL) with the second one from our test table (DataPack_Info_LARGE):
DECLARE @RScript1 nvarchar(max)
SET @RScript1 = N'
library(Hmisc)
mydata <- InputDataSet
all_sub <- mydata[4:5]
c <- cor(all_sub, use="complete.obs", method="pearson")
c <- data.frame(c)
OutputDataSet <- c'
DECLARE @SQLScript1 nvarchar(max)
SET @SQLScript1 = N'SELECT
TableName
,RowCounts
,TimeMeasure
,UsedSpaceKB
,UnusedSpaceKB
FROM DataPack_Info_SMALL
WHERE RowCounts <> 0'
EXECUTE sp_execute_external_script
@language = N'R'
,@script = @RScript1
,@input_data_1 = @SQLScript1
WITH result SETS ( (
RowCounts VARCHAR(100)
,UsedSpaceKB VARCHAR(100)
));
DECLARE @RScript2 nvarchar(max)
SET @RScript2 = N'
library(Hmisc)
mydata <- InputDataSet
all_sub <- mydata[4:5]
c <- cor(all_sub, use="complete.obs", method="pearson")
c <- data.frame(c)
OutputDataSet <- c'
DECLARE @SQLScript2 nvarchar(max)
SET @SQLScript2 = N'SELECT
TableName
,RowCounts
,TimeMeasure
,UsedSpaceKB
,UnusedSpaceKB
FROM DataPack_Info_LARGE
WHERE NofRowsOperation <> 0
AND RowCounts <> 0'
EXECUTE sp_execute_external_script
@language = N'R'
,@script = @RScript2
,@input_data_1 = @SQLScript2
WITH result SETS ( (
RowCounts VARCHAR(100)
,UsedSpaceKB VARCHAR(100)
)
);
GOHere are the results:
The results are very interesting. The baseline shows no correlation between UsedSpaceKB and UnusedSpaceKB (it is -0.049) whereas our test shows almost 3x stronger negative correlation (it is -0,109). A couple of words on this correlation: this shows that UsedSpaceKB is negatively correlated with UnUsedSpaceKB, which is still too small to draw any concrete conclusions but it shows, how slight change can cause difference in a simple correlation.
You can gather disk space usage information with T-SQL, or using PowerShell, or by implementing .NET assembly, or creating SQL Server Job or any other way. The important part and the biggest advantage is that, with using R and data collected is that now you will not be only monitoring and reacting upon past data, but also be able to predict what will happen.
Let’s go a step further and assume that with following query and dataset taken from our sample created:
SELECT TableName ,Operation ,NofRowsOperation ,UsedSpaceKB ,UnusedSpaceKB FROM dbo.DataPack_Info_LARGE
We will give a prediction on the size of the usedSpaceKB based on historical data. Out input will be TableName, Operation and NofRowsOperation for a give number to predict on. I will be using general linear model (Glm algorithm for predicting usedDiskSpace!
Before you all jump and start saying this is absurd, this cannot be done due to dbcc caching, page brakes, indexes, stall statistics and many other parameters, I would point out, that all thid information can be added into the algorithm and make the prediction even better. Since my queries are very simple INSERT and DELETE statements, you should also know, based on what kind of queries are you predicting. In addition, such approach can be good also for code testing, unit testing, stress testing before deployment etc.
With following R code we can start creating predictions:
-- GLM prediction
DECLARE @SQL_input AS NVARCHAR(MAX)
SET @SQL_input = N'SELECT
TableName
,CASE WHEN Operation = ''Insert'' THEN 1 ELSE 0 END AS Operation
,NofRowsOperation
,UsedSpaceKB
,UnusedSpaceKB
FROM dbo.DataPack_Info_LARGE
WHERE
NofRowsOperation <> 0';
DECLARE @R_code AS NVARCHAR(MAX)
SET @R_code = N'library(RevoScaleR)
library(dplyr)
DPLogR <- rxGlm(UsedSpaceKB ~ Operation + NofRowsOperation + UnusedSpaceKB, data = DataPack_info, family = Gamma)
df_predict <- data.frame(TableName=("DataPack"), Operation=(1), NofRowsOperation=(451), UnusedSpaceKB=(20))
predictions <- rxPredict(modelObject = DPLogR, data = df_predict, outData = NULL,
predVarNames = "UsedSpaceKB", type = "response",checkFactorLevels=FALSE);
OutputDataSet <- predictions'
EXEC sys.sp_execute_external_script
@language = N'R'
,@script = @R_code
,@input_data_1 = @SQL_input
,@input_data_1_name = N'DataPack_info'
WITH RESULT SETS ((
UsedSpaceKB_predict INT
));
GOWith this code, I predict the size of UsedSpaceKB based on following data:
As presented in R, the code is:
df_predict <- data.frame(TableName=("DataPack"), Operation=(1), NofRowsOperation=(451), UnusedSpaceKB=(20))Of course couple of things to clear out: The following R code with the xp_execute_external_script would work much better as a stored procedure with input parameters for these columns: TableName, Operation, NofRowsOperation and UnusedSpaceKB. Furthermore, to avoid unnecessary computational time for model building, it is usually the practice to store a serialized model in a SQL table and just deserialize it when running predictions. At last, since this was just a demo, make sure that numbers used in predictions make sense; as seen in our example, the UsedSpaceKB would much better be predicted if absolutely calcualated, rather than cumulative value. Only later, the cumulative value is calculated.
To sum up this rather long demo, let’s create a procedure and run couple of predictions, to get the feeling how efficient this is. The stored procedure is:
CREATE PROCEDURE Predict_UsedSpace
(
@TableName NVARCHAR(100)
,@Operation CHAR(1) -- 1 = Insert; 0 = Delete
,@NofRowsOperation NVARCHAR(10)
,@UnusedSpaceKB NVARCHAR(10)
)
AS
DECLARE @SQL_input AS NVARCHAR(MAX)
SET @SQL_input = N'SELECT
TableName
,CASE WHEN Operation = ''Insert'' THEN 1 ELSE 0 END AS Operation
,NofRowsOperation
,UsedSpaceKB
,UnusedSpaceKB
FROM dbo.DataPack_Info_LARGE
WHERE
NofRowsOperation <> 0';
DECLARE @R_code AS NVARCHAR(MAX)
SET @R_code = N'library(RevoScaleR)
DPLogR <- rxGlm(UsedSpaceKB ~ Operation + NofRowsOperation + UnusedSpaceKB, data = DataPack_info, family = Gamma)
df_predict <- data.frame(TableName=("'+@TableName+'"), Operation=('+@Operation+'),
NofRowsOperation=('+@NofRowsOperation+'), UnusedSpaceKB=('+@UnusedSpaceKB+'))
predictions <- rxPredict(modelObject = DPLogR, data = df_predict, outData = NULL, predVarNames = "UsedSpaceKB", type = "response",checkFactorLevels=FALSE);
OutputDataSet <- predictions'
EXEC sys.sp_execute_external_script
@language = N'R'
,@script = @R_code
,@input_data_1 = @SQL_input
,@input_data_1_name = N'DataPack_info'
WITH RESULT SETS ((
UsedSpaceKB_predict INT
));
GONow running the procedure two times in a row:
EXECUTE Predict_UsedSpace
@TableName = 'DataPack'
,@Operation = 1
,@NofRowsOperation = 120
,@UnusedSpaceKB = 2;
GO
EXECUTE Predict_UsedSpace
@TableName = 'DataPack'
,@Operation = 1
,@NofRowsOperation = 500
,@UnusedSpaceKB = 12;
GOWe get these results:
Both predictions on used space disk are predictions based on our demo data. Of course, for even better predictions, some baseline statistics could aslo be included, to have the prediction even better. With every model, we also need to test the predictions, how good they are.
Clustering queries
In previous articles, we have talked about clustering sales data. Because the visualization technique is relatively easy to read and to quickly find the queries that might take larger amout of resources, I have decided to add clustering. Clustering of executed queries will help to detect where problems are pilling up.
For this demo, I will use WideWorldImportersDW database and create some random reads and IO. First we will clean buffers and cache:
-- clean buffer DECLARE @dbid INTEGER SELECT @dbid = [dbid] FROM master..sysdatabases WHERE name = 'WideWorldImportersDW' DBCC FLUSHPROCINDB (@dbid); GO
and then create some random load:
USE WideWorldImportersDW;
GO
-- Arbitrary query:
SELECT cu.[Customer Key] AS CustomerKey, cu.Customer,
ci.[City Key] AS CityKey, ci.City,
ci.[State Province] AS StateProvince, ci.[Sales Territory] AS SalesTeritory,
d.Date, d.[Calendar Month Label] AS CalendarMonth,
d.[Calendar Year] AS CalendarYear,
s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product, s.Color,
e.[Employee Key] AS EmployeeKey, e.Employee,
f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit
FROM Fact.Sale AS f
INNER JOIN Dimension.Customer AS cu
ON f.[Customer Key] = cu.[Customer Key]
INNER JOIN Dimension.City AS ci
ON f.[City Key] = ci.[City Key]
INNER JOIN Dimension.[Stock Item] AS s
ON f.[Stock Item Key] = s.[Stock Item Key]
INNER JOIN Dimension.Employee AS e
ON f.[Salesperson Key] = e.[Employee Key]
INNER JOIN Dimension.Date AS d
ON f.[Delivery Date Key] = d.Date;
GO 3
-- FactSales Query
SELECT * FROM Fact.Sale
GO 4
-- Person Query
SELECT * FROM [Dimension].[Customer]
WHERE [Buying Group] <> 'Tailspin Toys'
OR [WWI Customer ID] > 500
ORDER BY [Customer],[Bill To Customer]
GO 4
-- Purchase Query
SELECT *
FROM [Fact].[Order] AS o
INNER JOIN [Fact].[Purchase] AS p
ON o.[Order Key] = p.[WWI Purchase Order ID]
GO 3And we will collect query statistics for this purpose:
SELECT
(total_logical_reads + total_logical_writes) AS total_logical_io
,(total_logical_reads / execution_count) AS avg_logical_reads
,(total_logical_writes / execution_count) AS avg_logical_writes
,(total_physical_reads / execution_count) AS avg_phys_reads
,substring(st.text,(qs.statement_start_offset / 2) + 1, ((CASE qs.statement_end_offset
WHEN - 1 THEN datalength(st.text)
ELSE qs.statement_end_offset END - qs.statement_start_offset) / 2) + 1) AS statement_text
,*
INTO query_stats_LOG_2
FROM
sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY
total_logical_io DESCOnce we have the data collected, we can use these data in the R query:
CREATE PROCEDURE [dbo].[SP_Query_Stats_Cluster]
AS
DECLARE @RScript nvarchar(max)
SET @RScript = N'
library(cluster)
All <- InputDataSet
image_file <- tempfile()
jpeg(filename = image_file, width = 500, height = 500)
d <- dist(All, method = "euclidean")
fit <- hclust(d, method="ward.D")
plot(fit,xlab=" ", ylab=NULL, main=NULL, sub=" ")
groups <- cutree(fit, k=3)
rect.hclust(fit, k=3, border="DarkRed")
dev.off()
OutputDataSet <- data.frame(data=readBin(file(image_file, "rb"), what=raw(), n=1e6))'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'
SELECT
[total_logical_io]
,[avg_logical_reads]
,[avg_phys_reads]
,execution_count
,[total_physical_reads]
,[total_elapsed_time]
,total_dop
,[text]
,ROW_NUMBER() OVER (order by (select 1)) as ID
FROM query_stats_LOG_2
WHERE Number is null';
EXECUTE sp_execute_external_script
@language = N'R',
@script = @RScript,
@input_data_1 = @SQLScript
WITH RESULT SETS ((Plot varbinary(max)));
GO
And you will get a query generate a graph.
This graph will tell you that queries are grouping together as:
Group1: 5,20,8 and 33;
Group2: 12 and 25
Group3: all others
We could beforehand rename the values of the query and give them proper names, but when running larger amount of queries, this might be somehow time consuming. On my
test run, it looks something like this:
Now let's just query out from out log table queries from Group1 and Group2:
SELECT
*
FROM
(
SELECT
[total_logical_io]
,[avg_logical_reads]
,[avg_phys_reads]
,execution_count
,[total_physical_reads]
,[total_elapsed_time]
,total_dop
,[text]
,ROW_NUMBER() OVER (order by (select 1)) as ID
FROM query_stats_LOG_2
WHERE Number is null
) AS x
WHERE x.ID IN (5,20,8,33,12,25)And you can further analyse where your problems and which queries cause resource problems, stall or performance issues. Of course, you also have to be very carefull what kind of statistics and data are you feeding th clustering algorithm. This will certainly effect, how grouping will be prepared, since all these statistics are like attributes and we are grouping based on similarities of these attributes.
Plotting a Heatmap of Executed Queries
Clustering helped us very quickly group queries that go together based on selected attributes. With heatmap, we will be giving an overview of correlations between those attributes. Since heatmap is a very great presentation of multiple correlations among several different observations (in our case of several queries), I have decided to add it to this article.
To start, we will prepare another demo:
USE AdventureWorks; GO --- Turn on the query store ALTER DATABASE AdventureWorks SET QUERY_STORE = ON; GO
We will use the advantage of Query store and information available and stored. And we will execute following queries, to gather demo statistics, available in Query Store.
-- run bunch of queries :-) SELECT * FROM AdventureWorks.[Production].[ScrapReason]; SELECT * FROM AdventureWorks.[HumanResources].[Shift]; SELECT * FROM AdventureWorks.[Production].[ProductCategory]; SELECT * FROM AdventureWorks.[Purchasing].[ShipMethod]; SELECT * FROM AdventureWorks.[Production].[ProductCostHistory]; SELECT * FROM AdventureWorks.[Production].[ProductDescription]; SELECT * FROM AdventureWorks.[Sales].[ShoppingCartItem]; SELECT * FROM AdventureWorks.[Production].[ProductDocument]; SELECT * FROM AdventureWorks.[dbo].[DatabaseLog]; SELECT * FROM AdventureWorks.[Production].[ProductInventory]; SELECT * FROM AdventureWorks.[Sales].[SpecialOffer]; SELECT * FROM AdventureWorks.[Production].[ProductListPriceHistory]; SELECT * FROM AdventureWorks.[Person].[Address]; SELECT * FROM AdventureWorks.[Sales].[SpecialOfferProduct]; SELECT * FROM AdventureWorks.[Production].[ProductModel]; SELECT * FROM AdventureWorks.[Person].[AddressType]; SELECT * FROM AdventureWorks.[Person].[StateProvince]; SELECT * FROM AdventureWorks.[Production].[ProductModelIllustration]; SELECT * FROM AdventureWorks.[Production].[ProductModelProductDescriptionCulture]; SELECT * FROM AdventureWorks.[Production].[BillOfMaterials]; SELECT * FROM AdventureWorks.[Sales].[Store]; SELECT * FROM AdventureWorks.[Production].[ProductPhoto]; SELECT * FROM AdventureWorks.[Production].[ProductProductPhoto]; SELECT * FROM AdventureWorks.[Production].[TransactionHistory]; SELECT * FROM AdventureWorks.[Production].[ProductReview]; SELECT * FROM AdventureWorks.[Person].[BusinessEntity]; SELECT * FROM AdventureWorks.[Production].[TransactionHistoryArchive]; SELECT * FROM AdventureWorks.[Production].[ProductSubcategory]; SELECT * FROM AdventureWorks.[Person].[BusinessEntityAddress]; SELECT * FROM AdventureWorks.[Purchasing].[ProductVendor]; SELECT * FROM AdventureWorks.[Person].[BusinessEntityContact]; SELECT * FROM AdventureWorks.[Production].[UnitMeasure]; SELECT * FROM AdventureWorks.[Purchasing].[Vendor]; SELECT * FROM AdventureWorks.[Person].[ContactType]; SELECT * FROM AdventureWorks.[Sales].[CountryRegionCurrency]; SELECT * FROM AdventureWorks.[Person].[CountryRegion]; SELECT * FROM AdventureWorks.[Production].[WorkOrder]; SELECT * FROM AdventureWorks.[Purchasing].[PurchaseOrderDetail]; SELECT * FROM AdventureWorks.[Sales].[CreditCard]; SELECT * FROM AdventureWorks.[Production].[Culture]; SELECT * FROM AdventureWorks.[Production].[WorkOrderRouting]; SELECT * FROM AdventureWorks.[Sales].[Currency]; SELECT * FROM AdventureWorks.[Purchasing].[PurchaseOrderHeader]; SELECT * FROM AdventureWorks.[Sales].[CurrencyRate]; SELECT * FROM AdventureWorks.[Sales].[Customer]; SELECT * FROM AdventureWorks.[HumanResources].[Department]; SELECT * FROM AdventureWorks.[Production].[Document]; SELECT * FROM AdventureWorks.[Sales].[SalesOrderDetail]; SELECT * FROM AdventureWorks.[Person].[EmailAddress]; SELECT * FROM AdventureWorks.[HumanResources].[Employee]; SELECT * FROM AdventureWorks.[Sales].[SalesOrderHeader]; SELECT * FROM AdventureWorks.[HumanResources].[EmployeeDepartmentHistory]; SELECT * FROM AdventureWorks.[HumanResources].[EmployeePayHistory]; SELECT * FROM AdventureWorks.[Sales].[SalesOrderHeaderSalesReason]; SELECT * FROM AdventureWorks.[Sales].[SalesPerson]; SELECT * FROM AdventureWorks.[Production].[Illustration]; SELECT * FROM AdventureWorks.[HumanResources].[JobCandidate]; SELECT * FROM AdventureWorks.[Production].[Location]; SELECT * FROM AdventureWorks.[Person].[Password]; SELECT * FROM AdventureWorks.[dbo].[Orders]; SELECT * FROM AdventureWorks.[Sales].[SalesPersonQuotaHistory]; SELECT * FROM AdventureWorks.[Person].[Person]; SELECT * FROM AdventureWorks.[Sales].[SalesReason]; SELECT * FROM AdventureWorks.[Sales].[SalesTaxRate]; SELECT * FROM AdventureWorks.[Sales].[PersonCreditCard]; SELECT * FROM AdventureWorks.[Person].[PersonPhone]; SELECT * FROM AdventureWorks.[Sales].[SalesTerritory]; SELECT * FROM AdventureWorks.[Person].[PhoneNumberType]; SELECT * FROM AdventureWorks.[Production].[Product]; ; SELECT * FROM AdventureWorks.[Sales].[SalesTerritoryHistory];
Information on these executions will be returned with the following query:
-- Collect the data from Query Store
SELECT
qsq.*
,query_sql_text
INTO QS_Query_stats_bck
FROM sys.query_store_query AS qsq
JOIN sys.query_store_query_text AS qsqt
ON qsq.query_text_id = qsqt.query_text_id
WHERE
Query_id >= 41
ORDER BY
qsq.query_idOnce we have this, we can run statistics, available in the Query Store with the following query:
-- for R table
SELECT
LEFT(query_sql_text,70) AS Query_Name
,last_compile_batch_offset_start
,last_compile_batch_offset_end
,count_compiles
,avg_compile_duration
,avg_bind_duration
,avg_bind_cpu_time
,avg_optimize_duration
,avg_optimize_cpu_time
,avg_compile_memory_kb
FROM QS_Query_stats_bck
WHERE
LEFT(query_sql_text,70) LIKE 'SELECT * FROM AdventureWorks.%'
ORDER BY Query_NameAs you can see, I have cleaned up a output, by just taking the queries that are relating to AdventureWorks tables. And in addition, I have choosen only query store statistics, concerning durations. With all this data, we simply use this R command:
DECLARE @SQL_heat NVARCHAR(MAX) SET @SQL_heat = 'SELECT LEFT(query_sql_text,70) AS Query_Name ,last_compile_batch_offset_start ,last_compile_batch_offset_end ,count_compiles ,avg_compile_duration ,avg_bind_duration ,avg_bind_cpu_time ,avg_optimize_duration ,avg_optimize_cpu_time ,avg_compile_memory_kb FROM QS_Query_stats_bck WHERE LEFT(query_sql_text,70) LIKE ''SELECT * FROM AdventureWorks.%'' ORDER BY Query_Name'; DECLARE @RScript NVARCHAR(MAX) SET @RScript = N' library(d3heatmap) All <- InputDataSet image_file <- tempfile() jpeg(filename = image_file, width = 500, height = 500) # sort data All <- All[order(All$avg_compile_duration),] #row_names row.names(All) <- All$Query_Name All <- All[,2:10] All_QS_matrix <- data.matrix(All) heatmap(All_QS_matrix, Rowv=NA, Colv=NA, col = heat.colors(256), scale="column", margins=c(5,10)) dev.off() OutputDataSet <- data.frame(data=readBin(file(image_file, "rb"), what=raw(), n=1e6))' EXECUTE sp_execute_external_script @language = N'R', @script = @RScript, @input_data_1 = @SQL_heat WITH RESULT SETS ((Plot VARBINARY(MAX))); GO
And again, we get a nice presentation of attributes (statistics from query store) with all the specified queries. Now we can see which queries are similar or different to one another or to particular attribute. Based on ordering of the values in particular attribute, you can have a view predefined and later easier to read.
Just not to forget, to turn off your query store, if you will not needed no long.
ALTER DATABASE [DBA4R] SET QUERY_STORE = OFF;
Conclusion
Using SQL Server R for purposes of any kind of DBA task – as we have shown here – it is not always hard-core statistics or predictive analytics, I can also be some simple statistical understanding that underlay the connection and relations between attributes queries, gathered statistics, indexes, etc. But we have seen, that also predicting events that are usually only monitored, can be a huge advantage for a DBA and a very welcoming feature for core systems.
With R integration into SQL Server, such daily, weekly or monthly tasks can be automated to different, before not uses yet, extent. And such, it can help give different insight to DBA and also people responsible for system maintainance.
Author: Tomaz Kastrun (tomaz.kastrun@gmail.com)
Twitter: @tomaz_tsql