We have shown many cases in previous articles about where and how SQL Server R Services can be used, how to plot and visualize the data, and discussed the benefits. This article will tackle, for many people in the SQL community, an even more interesting topic: how a database administrator can profit from using R Services for the SQL Server that he is administrating on a daily, weekly, or monthly basis. Some approaches might be unusual, some very much obvious, but it is using such approaches that really brings the benefit.

If we were focusing on the methodology on analyzing sales data in my previous two articles. The focus in this article we will not that much on the methodology or algorithms but more on tasks that a DBA may face. I will not go into the discussion what the DBA tasks are, because it might be different from DBA to DBA. Work at a particular company might include or exclude particular tasks. Rather, I will focus on tasks that every enterprise needs to monitor, to have in operation, and to know when and what action to take when particular events are triggered or alarmed.

So the typical and much needed tasks that need attention are:

database growth and potential free disk space
queries that take higher amount of resources to execute
statistics that can be gathered around these items.

Therefore, we will focus on these tasks, bringing R and statistics usage in and give the DBA additional insight. Prior to exploring this, there are couple of ground rules that we need to point out. What is abaseline and how to have a baseline created and measured. Thresholds are also important, as they represent triggers to particular events: automatic ticket creating, escalating to different people, stopping processes, or any kind of business rule. Beside the baseline, there are always thresholds for benchmarks and monitoring. These approaches are very important as they usually provide very good ways of handling exceptions or unwanted event in your system.

Having said that, in this article I will focus on following cases, where we will be able to use R Services with SQL Server:

Disk space usage
Clustering queries
Plotting a heatmap of executed queries based on query statistics

Let me know your feedback, so we can also do following three topics in next article: parameter sniffing, stale statistics and heap tables

Both cases will include usage of R Language to do the statistical analysis, data visualization and predictive analysis.

Disk space usage

Authors, database administrators, developers, MVPs, and others have written about storage, volumes of data, and monitoring capacities over time. Systems have requirements and businesses have needs in order for application to work normally. I am sure we all have found ourselves in situation of a “DBA by accident” situation, when disk usage went sky high during the night and nobody knew why. This is why it is important to have baseline captured. A baseline is like a control group that will tell you that a situation is unusual or it is normal with your test group. In IT, normally the baseline is the reference state, which was measured and recorded in a planned time or environment. Therefore, the DBA can spot any unplanned and unwanted activities or unusually system behavior, especially when they compare it to the baseline. In this manner, raising red flag would result in less false positive states.

In addition, we always come across disk size problems. Based on this problem, we will demo database growth, store the data, and then run predictions against the collected data to be able at the end, to predict, when DBA can expect the disk space problems.

To illustrate this scenario, I will create a small database of 8 MB and no possibility of growth. I will create two tables – one will serve as a baseline - DataPack_Info_SMALL, and one will serve as a so called everyday log, where everything will be stored for unexpected cases or undesired behavior. This will persist in table DataPack_Info_LARGE.

First, create a database:

USE [master];
GO
CREATE DATABASE FixSizeDB
 CONTAINMENT = NONE
 ON  PRIMARY
( NAME = N'FixSizeDB', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\FixSizeDB.mdf' , 
SIZE = 8192KB , FILEGROWTH = 0)
 LOG ON
( NAME = N'FixSizeDB_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\FixSizeDB_log.ldf',
SIZE = 8192KB , FILEGROWTH = 0)
GO
ALTER DATABASE [FixSizeDB] SET COMPATIBILITY_LEVEL = 130
GO
ALTER DATABASE [FixSizeDB] SET RECOVERY SIMPLE
GO

We will use following table for generating inserts and later deletes:

CREATE TABLE DataPack
       (
         DataPackID BIGINT IDENTITY NOT NULL
        ,col1 VARCHAR(1000) NOT NULL
        ,col2 VARCHAR(1000) NOT NULL
        )

Populating the DataPack table will be done with the following simple WHILE loop:

DECLARE @i INT = 1;
BEGIN TRAN
       WHILE @i <= 1000
              BEGIN
                     INSERT dbo.DataPack(col1, col2)
                           SELECT
                                    REPLICATE('A',200)
                                   ,REPLICATE('B',300);
                     SET @i = @i + 1;
              END
COMMIT;
GO

Instead of using the sp_spaceused procedure, we will be capturing disk space changes with following query:

SELECT
    t.NAME AS TableName
    ,s.Name AS SchemaName
    ,p.rows AS RowCounts
    ,SUM(a.total_pages) * 8 AS TotalSpaceKB
    ,SUM(a.used_pages) * 8 AS UsedSpaceKB
    ,(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
FROM
    sys.tables t
INNER JOIN sys.indexes AS i
       ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions AS p
       ON i.object_id = p.OBJECT_ID
       AND i.index_id = p.index_id
INNER JOIN sys.allocation_units AS a
       ON p.partition_id = a.container_id
LEFT OUTER JOIN sys.schemas AS s
       ON t.schema_id = s.schema_id
WHERE
            t.NAME NOT LIKE 'dt%'
    AND t.is_ms_shipped = 0
    AND i.OBJECT_ID > 255
    AND t.Name = 'DataPack'
GROUP BY t.Name, s.Name, p.Rows

The Log table will be filled along with the DataPack table in order to gather immediate changes in disk space:

DECLARE @nof_steps INT = 0
WHILE @nof_steps < 15
BEGIN
       BEGIN TRAN
              -- insert some data
              DECLARE @i INT = 1;
              WHILE @i <= 1000 -- step is 100 rows
                                  BEGIN
                                         INSERT dbo.DataPack(col1, col2)
                                                SELECT
                                                         REPLICATE('A',FLOOR(RAND()*200))
                                                        ,REPLICATE('B',FLOOR(RAND()*300));
                                         SET @i = @i + 1;
                                  END
              -- run statistics on table
              INSERT INTO dbo.DataPack_Info_SMALL
              SELECT
                     t.NAME AS TableName
                     ,s.Name AS SchemaName
                     ,p.rows AS RowCounts
                     ,SUM(a.total_pages) * 8 AS TotalSpaceKB
                     ,SUM(a.used_pages) * 8 AS UsedSpaceKB
                     ,(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
                     ,GETDATE() AS TimeMeasure
              FROM 
                           sys.tables AS t
                     INNER JOIN sys.indexes AS i
                     ON t.OBJECT_ID = i.object_id
                     INNER JOIN sys.partitions AS p
                     ON i.object_id = p.OBJECT_ID
                     AND i.index_id = p.index_id
                     INNER JOIN sys.allocation_units AS a
                     ON p.partition_id = a.container_id
                     LEFT OUTER JOIN sys.schemas AS s
                     ON t.schema_id = s.schema_id
              WHERE
                            t.NAME NOT LIKE 'dt%'
                     AND t.is_ms_shipped = 0
                     AND t.name = 'DataPack'
                     AND i.OBJECT_ID > 255
              GROUP BY t.Name, s.Name, p.Rows
              WAITFOR DELAY '00:00:02'
       COMMIT;
END

And this will serve as our baseline when comparing the results. When querying table DataPack_Log_Small, the results are the following:

We can use R to check how the columns, Rowcounts and UsedSpaceKB, are correlating. In order to do this, we will use external procedure, sp_execute_external_script:

DECLARE @RScript nvarchar(max)
SET @RScript = N'
                            library(Hmisc)     
                            mydata <- InputDataSet
                            all_sub <- mydata[2:3]
                            c <- cor(all_sub, use="complete.obs", method="pearson")
                            t <- rcorr(as.matrix(all_sub), type="pearson")
                             c <- cor(all_sub, use="complete.obs", method="pearson")
                            c <- data.frame(c)
                            OutputDataSet <- c'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT
                                          TableName
                                         ,RowCounts
                                         ,UsedSpaceKB
                                         ,TimeMeasure
                                         FROM DataPack_Info_SMALL'
EXECUTE sp_execute_external_script
        @language = N'R'
       ,@script = @RScript
       ,@input_data_1 = @SQLScript
       WITH result SETS ((RowCounts VARCHAR(100)
                         ,UsedSpaceKB  VARCHAR(100)));
GO

As a result we get a strong and positive correlation between columns RowCounts and UsedSpaceKB.

This can easly be interepreted as this. When the value for RowCounts goes up, the value for UsedSpaceKB also goes up. This is the only logical explanation. It would be somehow strange to have a negative correlation. Now we will try to simulate the random deletes and inserts and observe the similar behavior with following code:

DECLARE @nof_steps INT = 0
WHILE @nof_steps < 15
BEGIN
       BEGIN TRAN
              -- insert some data
              DECLARE @i INT = 1;
              DECLARE @insertedRows INT = 0;
              DECLARE @deletedRows INT = 0;
              DECLARE @Rand DECIMAL(10,2) = RAND()*10
              IF @Rand < 5
                BEGIN
                                  WHILE @i <= 1000 -- step is 100 rows
                                                       BEGIN
                                                              INSERT dbo.DataPack(col1, col2)
                                                                     SELECT
                                                                             REPLICATE('A',FLOOR(RAND()*200))  -- pages are filling up differently
                                                                            ,REPLICATE('B',FLOOR(RAND()*300));
                                                               SET @i = @i + 1;
                                                       END
                                  SET @insertedRows = 1000                                
                     END
               IF @Rand  >= 5
                     BEGIN                                             
                                  SET @deletedRows = (SELECT COUNT(*) FROM dbo.DataPack WHERE DataPackID % 3 = 0)
                                  DELETE FROM dbo.DataPack
                                                WHERE
                                  DataPackID % 3 = 0 OR DataPackID % 5 = 0
                     END
              -- run statistics on table
              INSERT INTO dbo.DataPack_Info_LARGE
              SELECT
                     t.NAME AS TableName
                     ,s.Name AS SchemaName
                     ,p.rows AS RowCounts
                     ,SUM(a.total_pages) * 8 AS TotalSpaceKB
                     ,SUM(a.used_pages) * 8 AS UsedSpaceKB
                     ,(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
                     ,GETDATE() AS TimeMeasure
                     ,CASE WHEN @Rand < 5 THEN 'Insert'
                             WHEN @Rand >= 5 THEN 'Delete'
                             ELSE 'meeeh' END AS Operation
                     ,CASE WHEN @Rand < 5 THEN @insertedRows
                             WHEN @Rand >= 5 THEN @deletedRows
                             ELSE 0 END AS NofRowsOperation
              FROM 
                           sys.tables AS t
                     INNER JOIN sys.indexes AS i
                     ON t.OBJECT_ID = i.object_id
                     INNER JOIN sys.partitions AS p
                     ON i.object_id = p.OBJECT_ID
                     AND i.index_id = p.index_id
                     INNER JOIN sys.allocation_units AS a
                     ON p.partition_id = a.container_id
                     LEFT OUTER JOIN sys.schemas AS s
                     ON t.schema_id = s.schema_id
              WHERE
                            t.NAME NOT LIKE 'dt%'
                     AND t.is_ms_shipped = 0
                     AND t.name = 'DataPack'
                     AND i.OBJECT_ID > 255
              GROUP BY t.Name, s.Name, p.Rows
              WAITFOR DELAY '00:00:01'
       COMMIT;
END

We have also added a delete statement, as well as rowcounts so that the demo will not be so straightforward.

By calculating the correlation coefficient,

DECLARE @RScript nvarchar(max)
SET @RScript = N'
                            library(Hmisc)     
                            mydata <- InputDataSet
                            all_sub <- mydata[2:3]
                            c <- cor(all_sub, use="complete.obs", method="pearson")
                            c <- data.frame(c)
                            OutputDataSet <- c'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'SELECT
                                          TableName
                                         ,RowCounts
                                         ,UsedSpaceKB
                                         ,TimeMeasure
                                         FROM DataPack_Info_LARGE'
EXECUTE sp_execute_external_script
        @language = N'R'
       ,@script = @RScript
       ,@input_data_1 = @SQLScript
       WITH result SETS ( (
                                          RowCounts VARCHAR(100)
                                         ,UsedSpaceKB  VARCHAR(100)
                         )); 
GO

We can see, that we still get very strong and positive correlation:

We will now compare our LARGE test with the baseline by running same correlation coefficient on different datasets. First on our baseline (DataPack_Info_SMALL) with the second one from our test table (DataPack_Info_LARGE):

DECLARE @RScript1 nvarchar(max)
SET @RScript1 = N'
                            library(Hmisc)     
                            mydata <- InputDataSet
                            all_sub <- mydata[4:5]
                            c <- cor(all_sub, use="complete.obs", method="pearson")
                            c <- data.frame(c)
                            OutputDataSet <- c'
DECLARE @SQLScript1 nvarchar(max)
SET @SQLScript1 = N'SELECT
                                          TableName
                                         ,RowCounts
                                         ,TimeMeasure
                                         ,UsedSpaceKB 
                                         ,UnusedSpaceKB
                                         FROM DataPack_Info_SMALL
                                         WHERE RowCounts <> 0'
EXECUTE sp_execute_external_script
        @language = N'R'
       ,@script = @RScript1
       ,@input_data_1 = @SQLScript1
       WITH result SETS ( (
                                          RowCounts VARCHAR(100)
                                         ,UsedSpaceKB  VARCHAR(100)
                                         ));

DECLARE @RScript2 nvarchar(max)
SET @RScript2 = N'
                            library(Hmisc)     
                            mydata <- InputDataSet
                            all_sub <- mydata[4:5]
                            c <- cor(all_sub, use="complete.obs", method="pearson")
                            c <- data.frame(c)
                            OutputDataSet <- c'
DECLARE @SQLScript2 nvarchar(max)
SET @SQLScript2 = N'SELECT
                                          TableName
                                         ,RowCounts
                                         ,TimeMeasure
                                         ,UsedSpaceKB 
                                         ,UnusedSpaceKB
                                         FROM DataPack_Info_LARGE
                                         WHERE NofRowsOperation <> 0
                                         AND RowCounts <> 0'
EXECUTE sp_execute_external_script
        @language = N'R'
       ,@script = @RScript2
       ,@input_data_1 = @SQLScript2
       WITH result SETS ( (
                                          RowCounts VARCHAR(100)
                                         ,UsedSpaceKB  VARCHAR(100)
                                         )
                                   );
GO

Here are the results:

The results are very interesting. The baseline shows no correlation between UsedSpaceKB and UnusedSpaceKB (it is -0.049) whereas our test shows almost 3x stronger negative correlation (it is -0,109). A couple of words on this correlation: this shows that UsedSpaceKB is negatively correlated with UnUsedSpaceKB, which is still too small to draw any concrete conclusions but it shows, how slight change can cause difference in a simple correlation.

You can gather disk space usage information with T-SQL, or using PowerShell, or by implementing .NET assembly, or creating SQL Server Job or any other way. The important part and the biggest advantage is that, with using R and data collected is that now you will not be only monitoring and reacting upon past data, but also be able to predict what will happen.

Let’s go a step further and assume that with following query and dataset taken from our sample created:

SELECT
  TableName
 ,Operation
 ,NofRowsOperation
 ,UsedSpaceKB
 ,UnusedSpaceKB
FROM dbo.DataPack_Info_LARGE

We will give a prediction on the size of the usedSpaceKB based on historical data. Out input will be TableName, Operation and NofRowsOperation for a give number to predict on. I will be using general linear model (Glm algorithm for predicting usedDiskSpace!

Before you all jump and start saying this is absurd, this cannot be done due to dbcc caching, page brakes, indexes, stall statistics and many other parameters, I would point out, that all thid information can be added into the algorithm and make the prediction even better. Since my queries are very simple INSERT and DELETE statements, you should also know, based on what kind of queries are you predicting. In addition, such approach can be good also for code testing, unit testing, stress testing before deployment etc.

With following R code we can start creating predictions:

-- GLM prediction
DECLARE @SQL_input AS NVARCHAR(MAX)
SET @SQL_input = N'SELECT
                                  TableName
                                  ,CASE WHEN Operation = ''Insert'' THEN 1 ELSE 0 END AS Operation
                                  ,NofRowsOperation
                                  ,UsedSpaceKB
                                  ,UnusedSpaceKB
                                   FROM dbo.DataPack_Info_LARGE
                                   WHERE
                                         NofRowsOperation <> 0';
DECLARE @R_code AS NVARCHAR(MAX)
SET @R_code = N'library(RevoScaleR)
                library(dplyr)
                DPLogR <- rxGlm(UsedSpaceKB ~ Operation + NofRowsOperation + UnusedSpaceKB, data = DataPack_info, family = Gamma)
                df_predict <- data.frame(TableName=("DataPack"), Operation=(1), NofRowsOperation=(451), UnusedSpaceKB=(20))
                predictions <- rxPredict(modelObject = DPLogR, data = df_predict, outData = NULL,  
                                predVarNames = "UsedSpaceKB", type = "response",checkFactorLevels=FALSE);
                OutputDataSet <- predictions'
EXEC sys.sp_execute_external_script
     @language = N'R'
    ,@script = @R_code
    ,@input_data_1 = @SQL_input
    ,@input_data_1_name = N'DataPack_info'
       WITH RESULT SETS ((
                         UsedSpaceKB_predict INT
                         ));
GO

With this code, I predict the size of UsedSpaceKB based on following data:

As presented in R, the code is:

df_predict <- data.frame(TableName=("DataPack"), Operation=(1), NofRowsOperation=(451), UnusedSpaceKB=(20))

Of course couple of things to clear out: The following R code with the xp_execute_external_script would work much better as a stored procedure with input parameters for these columns: TableName, Operation, NofRowsOperation and UnusedSpaceKB. Furthermore, to avoid unnecessary computational time for model building, it is usually the practice to store a serialized model in a SQL table and just deserialize it when running predictions. At last, since this was just a demo, make sure that numbers used in predictions make sense; as seen in our example, the UsedSpaceKB would much better be predicted if absolutely calcualated, rather than cumulative value. Only later, the cumulative value is calculated.

To sum up this rather long demo, let’s create a procedure and run couple of predictions, to get the feeling how efficient this is. The stored procedure is:

CREATE PROCEDURE Predict_UsedSpace
(
 @TableName NVARCHAR(100)
,@Operation CHAR(1)  -- 1  = Insert; 0 = Delete
,@NofRowsOperation NVARCHAR(10)
,@UnusedSpaceKB NVARCHAR(10)
)
AS
DECLARE @SQL_input AS NVARCHAR(MAX)
SET @SQL_input = N'SELECT
                                  TableName
                                  ,CASE WHEN Operation = ''Insert'' THEN 1 ELSE 0 END AS Operation
                                  ,NofRowsOperation
                                  ,UsedSpaceKB
                                  ,UnusedSpaceKB
                                   FROM dbo.DataPack_Info_LARGE
                                   WHERE
                                         NofRowsOperation <> 0';
DECLARE @R_code AS NVARCHAR(MAX)
SET @R_code = N'library(RevoScaleR)
                DPLogR <- rxGlm(UsedSpaceKB ~ Operation + NofRowsOperation + UnusedSpaceKB, data = DataPack_info, family = Gamma)
                df_predict <- data.frame(TableName=("'+@TableName+'"), Operation=('+@Operation+'), 
                          NofRowsOperation=('+@NofRowsOperation+'), UnusedSpaceKB=('+@UnusedSpaceKB+'))
                predictions <- rxPredict(modelObject = DPLogR, data = df_predict, outData = NULL,  predVarNames = "UsedSpaceKB", type = "response",checkFactorLevels=FALSE);
                OutputDataSet <- predictions'
EXEC sys.sp_execute_external_script
     @language = N'R'
    ,@script = @R_code
    ,@input_data_1 = @SQL_input
    ,@input_data_1_name = N'DataPack_info'
WITH RESULT SETS ((
                                    UsedSpaceKB_predict INT
                                   ));
GO

Now running the procedure two times in a row:

EXECUTE Predict_UsedSpace
                     @TableName = 'DataPack'
                     ,@Operation = 1
                     ,@NofRowsOperation = 120
                     ,@UnusedSpaceKB = 2;
GO
EXECUTE Predict_UsedSpace
                     @TableName = 'DataPack'
                     ,@Operation = 1
                     ,@NofRowsOperation = 500
                     ,@UnusedSpaceKB = 12;
GO

We get these results:

Both predictions on used space disk are predictions based on our demo data. Of course, for even better predictions, some baseline statistics could aslo be included, to have the prediction even better. With every model, we also need to test the predictions, how good they are.

Clustering queries

In previous articles, we have talked about clustering sales data. Because the visualization technique is relatively easy to read and to quickly find the queries that might take larger amout of resources, I have decided to add clustering. Clustering of executed queries will help to detect where problems are pilling up.

For this demo, I will use WideWorldImportersDW database and create some random reads and IO. First we will clean buffers and cache:

-- clean buffer
DECLARE @dbid INTEGER
SELECT @dbid = [dbid]
FROM master..sysdatabases
WHERE name = 'WideWorldImportersDW'
DBCC FLUSHPROCINDB (@dbid);
GO

and then create some random load:

USE WideWorldImportersDW;
GO
-- Arbitrary query:
SELECT cu.[Customer Key] AS CustomerKey, cu.Customer,
  ci.[City Key] AS CityKey, ci.City,
  ci.[State Province] AS StateProvince, ci.[Sales Territory] AS SalesTeritory,
  d.Date, d.[Calendar Month Label] AS CalendarMonth,
  d.[Calendar Year] AS CalendarYear,
  s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product, s.Color,
  e.[Employee Key] AS EmployeeKey, e.Employee,
  f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit
FROM Fact.Sale AS f
  INNER JOIN Dimension.Customer AS cu
    ON f.[Customer Key] = cu.[Customer Key]
  INNER JOIN Dimension.City AS ci
    ON f.[City Key] = ci.[City Key]
  INNER JOIN Dimension.[Stock Item] AS s
    ON f.[Stock Item Key] = s.[Stock Item Key]
  INNER JOIN Dimension.Employee AS e
    ON f.[Salesperson Key] = e.[Employee Key]
  INNER JOIN Dimension.Date AS d
    ON f.[Delivery Date Key] = d.Date;
GO 3

-- FactSales Query
SELECT * FROM Fact.Sale
GO 4

-- Person Query
SELECT * FROM [Dimension].[Customer]
WHERE [Buying Group] <> 'Tailspin Toys'
  OR [WWI Customer ID] > 500
ORDER BY [Customer],[Bill To Customer]
GO 4

-- Purchase Query
SELECT *
FROM [Fact].[Order] AS o
INNER JOIN [Fact].[Purchase] AS p
ON o.[Order Key] = p.[WWI Purchase Order ID]
GO 3

And we will collect query statistics for this purpose:

SELECT
    (total_logical_reads + total_logical_writes) AS total_logical_io
    ,(total_logical_reads / execution_count) AS avg_logical_reads
    ,(total_logical_writes / execution_count) AS avg_logical_writes
    ,(total_physical_reads / execution_count) AS avg_phys_reads
    ,substring(st.text,(qs.statement_start_offset / 2) + 1,  ((CASE qs.statement_end_offset
                                                                WHEN - 1 THEN datalength(st.text)
                                                                ELSE qs.statement_end_offset END  - qs.statement_start_offset) / 2) + 1) AS statement_text
    ,*
INTO query_stats_LOG_2
FROM
        sys.dm_exec_query_stats AS qs
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY
total_logical_io DESC

Once we have the data collected, we can use these data in the R query:

CREATE PROCEDURE [dbo].[SP_Query_Stats_Cluster]
AS
DECLARE @RScript nvarchar(max)
SET @RScript = N'
                 library(cluster)
                 All <- InputDataSet
                 image_file <- tempfile()
                 jpeg(filename = image_file, width = 500, height = 500)
                    d <- dist(All, method = "euclidean")
                    fit <- hclust(d, method="ward.D")
                    plot(fit,xlab=" ", ylab=NULL, main=NULL, sub=" ")
                    groups <- cutree(fit, k=3)
                    rect.hclust(fit, k=3, border="DarkRed")            
                 dev.off()
                 OutputDataSet <- data.frame(data=readBin(file(image_file, "rb"), what=raw(), n=1e6))'
DECLARE @SQLScript nvarchar(max)
SET @SQLScript = N'
                SELECT
                     [total_logical_io]
                    ,[avg_logical_reads]
                    ,[avg_phys_reads]
                    ,execution_count
                    ,[total_physical_reads]
                    ,[total_elapsed_time]
                    ,total_dop
                    ,[text]
                    ,ROW_NUMBER() OVER (order by (select 1)) as ID
                 FROM query_stats_LOG_2
                WHERE Number is null';
EXECUTE sp_execute_external_script
@language = N'R',
@script = @RScript,
@input_data_1 = @SQLScript
WITH RESULT SETS ((Plot varbinary(max)));
GO

And you will get a query generate a graph.

This graph will tell you that queries are grouping together as:

Group1: 5,20,8 and 33;

Group2: 12 and 25

Group3: all others

We could beforehand rename the values of the query and give them proper names, but when running larger amount of queries, this might be somehow time consuming. On my

test run, it looks something like this:

Now let's just query out from out log table queries from Group1 and Group2:

SELECT
 * 
FROM
 (
   SELECT
                     [total_logical_io]
                    ,[avg_logical_reads]
                    ,[avg_phys_reads]
                    ,execution_count
                    ,[total_physical_reads]
                    ,[total_elapsed_time]
                    ,total_dop
                    ,[text]
                    ,ROW_NUMBER() OVER (order by (select 1)) as ID
                 FROM query_stats_LOG_2
                WHERE Number is null
) AS x
WHERE x.ID IN (5,20,8,33,12,25)

And you can further analyse where your problems and which queries cause resource problems, stall or performance issues. Of course, you also have to be very carefull what kind of statistics and data are you feeding th clustering algorithm. This will certainly effect, how grouping will be prepared, since all these statistics are like attributes and we are grouping based on similarities of these attributes.

Plotting a Heatmap of Executed Queries

Clustering helped us very quickly group queries that go together based on selected attributes. With heatmap, we will be giving an overview of correlations between those attributes. Since heatmap is a very great presentation of multiple correlations among several different observations (in our case of several queries), I have decided to add it to this article.

To start, we will prepare another demo:

USE AdventureWorks;
GO
--- Turn on the query store
ALTER DATABASE AdventureWorks SET QUERY_STORE = ON;  
GO

We will use the advantage of Query store and information available and stored. And we will execute following queries, to gather demo statistics, available in Query Store.

-- run bunch of queries :-)
SELECT * FROM AdventureWorks.[Production].[ScrapReason];
SELECT * FROM AdventureWorks.[HumanResources].[Shift];
SELECT * FROM AdventureWorks.[Production].[ProductCategory];
SELECT * FROM AdventureWorks.[Purchasing].[ShipMethod];
SELECT * FROM AdventureWorks.[Production].[ProductCostHistory];
SELECT * FROM AdventureWorks.[Production].[ProductDescription];
SELECT * FROM AdventureWorks.[Sales].[ShoppingCartItem];
SELECT * FROM AdventureWorks.[Production].[ProductDocument];
SELECT * FROM AdventureWorks.[dbo].[DatabaseLog];
SELECT * FROM AdventureWorks.[Production].[ProductInventory];
SELECT * FROM AdventureWorks.[Sales].[SpecialOffer];
SELECT * FROM AdventureWorks.[Production].[ProductListPriceHistory];
SELECT * FROM AdventureWorks.[Person].[Address];
SELECT * FROM AdventureWorks.[Sales].[SpecialOfferProduct];
SELECT * FROM AdventureWorks.[Production].[ProductModel];
SELECT * FROM AdventureWorks.[Person].[AddressType];
SELECT * FROM AdventureWorks.[Person].[StateProvince];
SELECT * FROM AdventureWorks.[Production].[ProductModelIllustration];
SELECT * FROM AdventureWorks.[Production].[ProductModelProductDescriptionCulture];
SELECT * FROM AdventureWorks.[Production].[BillOfMaterials];
SELECT * FROM AdventureWorks.[Sales].[Store];
SELECT * FROM AdventureWorks.[Production].[ProductPhoto];
SELECT * FROM AdventureWorks.[Production].[ProductProductPhoto];
SELECT * FROM AdventureWorks.[Production].[TransactionHistory];
SELECT * FROM AdventureWorks.[Production].[ProductReview];
SELECT * FROM AdventureWorks.[Person].[BusinessEntity];
SELECT * FROM AdventureWorks.[Production].[TransactionHistoryArchive];
SELECT * FROM AdventureWorks.[Production].[ProductSubcategory];
SELECT * FROM AdventureWorks.[Person].[BusinessEntityAddress];
SELECT * FROM AdventureWorks.[Purchasing].[ProductVendor];
SELECT * FROM AdventureWorks.[Person].[BusinessEntityContact];
SELECT * FROM AdventureWorks.[Production].[UnitMeasure];
SELECT * FROM AdventureWorks.[Purchasing].[Vendor];
SELECT * FROM AdventureWorks.[Person].[ContactType];
SELECT * FROM AdventureWorks.[Sales].[CountryRegionCurrency];
SELECT * FROM AdventureWorks.[Person].[CountryRegion];
SELECT * FROM AdventureWorks.[Production].[WorkOrder];
SELECT * FROM AdventureWorks.[Purchasing].[PurchaseOrderDetail];
SELECT * FROM AdventureWorks.[Sales].[CreditCard];
SELECT * FROM AdventureWorks.[Production].[Culture];
SELECT * FROM AdventureWorks.[Production].[WorkOrderRouting];
SELECT * FROM AdventureWorks.[Sales].[Currency];
SELECT * FROM AdventureWorks.[Purchasing].[PurchaseOrderHeader];
SELECT * FROM AdventureWorks.[Sales].[CurrencyRate];
SELECT * FROM AdventureWorks.[Sales].[Customer];
SELECT * FROM AdventureWorks.[HumanResources].[Department];
SELECT * FROM AdventureWorks.[Production].[Document];
SELECT * FROM AdventureWorks.[Sales].[SalesOrderDetail];
SELECT * FROM AdventureWorks.[Person].[EmailAddress];
SELECT * FROM AdventureWorks.[HumanResources].[Employee];
SELECT * FROM AdventureWorks.[Sales].[SalesOrderHeader];
SELECT * FROM AdventureWorks.[HumanResources].[EmployeeDepartmentHistory];
SELECT * FROM AdventureWorks.[HumanResources].[EmployeePayHistory];
SELECT * FROM AdventureWorks.[Sales].[SalesOrderHeaderSalesReason];
SELECT * FROM AdventureWorks.[Sales].[SalesPerson];
SELECT * FROM AdventureWorks.[Production].[Illustration];
SELECT * FROM AdventureWorks.[HumanResources].[JobCandidate];
SELECT * FROM AdventureWorks.[Production].[Location];
SELECT * FROM AdventureWorks.[Person].[Password];
SELECT * FROM AdventureWorks.[dbo].[Orders];
SELECT * FROM AdventureWorks.[Sales].[SalesPersonQuotaHistory];
SELECT * FROM AdventureWorks.[Person].[Person];
SELECT * FROM AdventureWorks.[Sales].[SalesReason];
SELECT * FROM AdventureWorks.[Sales].[SalesTaxRate];
SELECT * FROM AdventureWorks.[Sales].[PersonCreditCard];
SELECT * FROM AdventureWorks.[Person].[PersonPhone];
SELECT * FROM AdventureWorks.[Sales].[SalesTerritory];
SELECT * FROM AdventureWorks.[Person].[PhoneNumberType];
SELECT * FROM AdventureWorks.[Production].[Product]; ;
SELECT * FROM AdventureWorks.[Sales].[SalesTerritoryHistory];

Information on these executions will be returned with the following query:

-- Collect the data from Query Store
SELECT
      qsq.*
     ,query_sql_text
    INTO QS_Query_stats_bck
FROM sys.query_store_query AS qsq
JOIN sys.query_store_query_text AS qsqt
ON qsq.query_text_id = qsqt.query_text_id
WHERE
        Query_id >= 41
ORDER BY
        qsq.query_id

Once we have this, we can run statistics, available in the Query Store with the following query:

-- for R table
SELECT  
     LEFT(query_sql_text,70) AS Query_Name
    ,last_compile_batch_offset_start
    ,last_compile_batch_offset_end
    ,count_compiles
    ,avg_compile_duration
    ,avg_bind_duration
    ,avg_bind_cpu_time
    ,avg_optimize_duration
    ,avg_optimize_cpu_time
    ,avg_compile_memory_kb
               
FROM QS_Query_stats_bck
WHERE
    LEFT(query_sql_text,70) LIKE 'SELECT * FROM AdventureWorks.%'
ORDER BY  Query_Name

As you can see, I have cleaned up a output, by just taking the queries that are relating to AdventureWorks tables. And in addition, I have choosen only query store statistics, concerning durations. With all this data, we simply use this R command:

DECLARE @SQL_heat NVARCHAR(MAX)
SET @SQL_heat = 'SELECT  
     LEFT(query_sql_text,70) AS Query_Name
    ,last_compile_batch_offset_start
    ,last_compile_batch_offset_end
    ,count_compiles
    ,avg_compile_duration
    ,avg_bind_duration
    ,avg_bind_cpu_time
    ,avg_optimize_duration
    ,avg_optimize_cpu_time
    ,avg_compile_memory_kb
              
        FROM QS_Query_stats_bck
        WHERE
            LEFT(query_sql_text,70) LIKE ''SELECT * FROM AdventureWorks.%''
        ORDER BY  Query_Name';
DECLARE @RScript NVARCHAR(MAX)
SET @RScript = N'
                 library(d3heatmap)
                 All <- InputDataSet
                 image_file <- tempfile()
                 jpeg(filename = image_file, width = 500, height = 500)        
                    # sort data
                 All <- All[order(All$avg_compile_duration),]
                    #row_names
                 row.names(All) <- All$Query_Name
                 All <- All[,2:10]
                 All_QS_matrix <- data.matrix(All)                            
                 heatmap(All_QS_matrix, Rowv=NA, Colv=NA, col = heat.colors(256), scale="column", margins=c(5,10))
                 dev.off()
                 OutputDataSet <- data.frame(data=readBin(file(image_file, "rb"), what=raw(), n=1e6))'
EXECUTE sp_execute_external_script
@language = N'R',
@script = @RScript,
@input_data_1 = @SQL_heat
WITH RESULT SETS ((Plot VARBINARY(MAX)));
GO

And again, we get a nice presentation of attributes (statistics from query store) with all the specified queries. Now we can see which queries are similar or different to one another or to particular attribute. Based on ordering of the values in particular attribute, you can have a view predefined and later easier to read.

Just not to forget, to turn off your query store, if you will not needed no long.

ALTER DATABASE [DBA4R] SET QUERY_STORE = OFF;

Conclusion

Using SQL Server R for purposes of any kind of DBA task – as we have shown here – it is not always hard-core statistics or predictive analytics, I can also be some simple statistical understanding that underlay the connection and relations between attributes queries, gathered statistics, indexes, etc. But we have seen, that also predicting events that are usually only monitored, can be a huge advantage for a DBA and a very welcoming feature for core systems.

With R integration into SQL Server, such daily, weekly or monthly tasks can be automated to different, before not uses yet, extent. And such, it can help give different insight to DBA and also people responsible for system maintainance.

Author: Tomaz Kastrun (tomaz.kastrun@gmail.com)

Twitter: @tomaz_tsql

Blog: http://tomaztsql.wordpress.com

Using SQL Server and R Services for analyzing DBA Tasks

Disk space usage

Clustering queries

Plotting a Heatmap of Executed Queries

Resources

Rate

Share

Categories

Share

Rate

Using SQL Server and R Services for analyzing DBA Tasks

Disk space usage

Clustering queries

Plotting a Heatmap of Executed Queries

Resources

Rate

Share

Categories

Share

Rate

Related content

Analyzing Formula 1 results in R

Machine Learning for Outlier Detection in R

Scoring Outliers with R

What is normal? Finding outliers with R

Getting to know your customers better – cohort analysis and RFM segmentation in R