SQLServerCentral Article

Using SQL Server and R Services for analyzing Sales data (Part 3)

,

In my previous three articles, I have discussed R Server from a more technical perspective, from general overview to installing packages. I have covered the general concepts of using R with SQL Server in an enterprise environment. With everything installed, putting the R Server into use is next thing.

The purpose of this article will be to show examples how R Services can support your business decisions, shorten the amount of time and effort needed to extract the insights from your OLTP or business warehouse, and bring the analytical processes that might have been part of different team or tasks closer to the data.

We will use SQL Server new sample database, WideWorldImporters (WWI). Since this is a retail-orientated database, we will show how we can benefit from using R services to understanding retail processes. Some of the processes can, of course, be achieved by using T-SQL, but many statistics can be run much faster and with less lines of code using R integration. The more complex the statistics are, the much easier R integration becomes.  

Discovering data of any kind of nature will certainly close the gap between the generation of data and understanding of data. Both are independent of each other. Usually data generation is happening much faster and on a grander scale, so understanding the data and looking into the patterns is like identifying opportunities that can be turned into profit.

Based on the WWI dataset, and since it is a retail-orientated dataset, we have to take that into consideration. Some of the approaches, or algorithms, in R can be used in favor of a particular dataset, and some of the approaches are applicable to any kind of dataset. Since this is more statistics and mathematics, I will not go further into detail here.

This article will cover:

  • General WWI dataset understanding
  • How to do cluster analysis
  • Association rules analysis

With each of the area, I will also give hints on visualization of the data.

Get a General Data Understanding

This process is by far the most time consuming and tedious. Getting to see and “feel” the data just takes time. An a dvantage of having R integration in T-SQL is you can do this much easier, faster, and also bring statistics into the T-SQL world that otherwise might have required data analysts/wranglers to write additional T-SQL functions, procedures, or even CLR objects.

For running frequency values on sales in the cities, your query would look something like:

USE WideWorldImportersDW;
GO
SELECT
        c.[City Key]
       ,c.[City]
       ,COUNT(*) AS products_sold
       ,SUM(fs.quantity) AS qof_products_sold
FROM
     Fact.Sale AS fs
JOIN dimension.city AS c
ON c.[City Key] = fs.[City Key]
GROUP BY
        c.[City Key]
       ,c.[City]
ORDER BY products_sold DESC;
GO

Here the first 11 rows are shown:

With a simple result set and a simple query, usually more questions arise in terms of understanding the business meaning. Now that we know the number of sold products per city, we would definitely want more information: the percentage sold in each of the cities, the sales of each of the cities, and some of the descriptive statistics (min, max values, variance, median, skewness, etc.). If we would want to add the percentage of the products sold to each of the rows, we would need to add some additional T-SQL code.

By adding a simple line into SELECT list

,CAST(COUNT(*)*1.0/(SELECT COUNT(*) FROM Fact.Sale)*100 AS DECIMAL (5,3)) AS pct_total

one can get a percentage of products sold by city

Of course, this percentage can be achieved also easier with window function OVER:

,CAST((COUNT(*)*1.0/SUM(COUNT(*)) OVER())*100 AS DECIMAL(5,3)) AS pct_total

If we want to have a cumulative percentage to total, so that we can also identify quantiles or percentiles, the T-SQL Code expands. Using an OVER() clause and unbounded preceding with current ROW, I was able to get te running total.

,SUM(fs.quantity) OVER (ORDER BY c.[City] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS runTot

Unfortunately, I had to do additional select on pre-calculated values to get summed value of quantity.

SELECT
        inner_select.[City Key]
       ,inner_select.City
       ,inner_select.products_sold
       ,inner_select.qof_products_sold
       ,inner_select.pct_total
       ,SUM(inner_select.qof_products_sold) OVER (ORDER BY rn ASC ROWS BETWEEN UNBOUNDED 
        PRECEDING AND CURRENT ROW) AS RunningTotal
FROM
       (
       SELECT
               c.[City Key]
              ,c.[City]
              ,COUNT(*) AS products_sold
              ,SUM(fs.quantity) AS qof_products_sold
              ,CAST(COUNT(*)*1.0/(SELECT COUNT(*) FROM Fact.Sale)*100 AS DECIMAL (5,3)) AS pct_total
              ,ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) AS rn -- HelperRow
       FROM
               Fact.Sale AS fs
       JOIN dimension.city AS c
       ON c.[City Key] = fs.[City Key]
       GROUP BY
               c.[City Key]
              ,c.[City]
       ) AS inner_select
GROUP BY
        inner_select.[City Key]
       ,inner_select.City
       ,inner_select.products_sold
       ,inner_select.qof_products_sold
       ,inner_select.pct_total
       ,rn
ORDER BY rn ASC;
GO

Here are the first few results:

Of course, the approach for similar calculations can be done also in SSRS, or any other tool, but the for the understating of the data, let’s see how cumulative percent to total can be delivered using R.

One can see that the complexity can grow very fast, as shown in the example above, where grouping data and running cumulative calculation can be time consuming. Therefore, R integration plays a key role for easier data wrangling. To achieve same results, using SP_EXECUTE_EXTERNAL_SCRIPT procedure and original query:

DECLARE @SQLStat NVARCHAR(4000)
SET @SQLStat = 'SELECT
                     c.[City Key]
                    ,c.[City]
                    ,COUNT(*) AS products_sold
                    ,SUM(fs.quantity) AS qof_products_sold
                FROM
                     Fact.Sale AS fs
                JOIN dimension.city AS c
                ON c.[City Key] = fs.[City Key]
                GROUP BY
                     c.[City Key]
                    ,c.[City]
                ORDER BY products_sold DESC;'
DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(dplyr)
              OutputDataSet <- Sales %>%
                    mutate(
                             running_total = cumsum(qof_products_sold)
                            ,running_total_avg = cummean(qof_products_sold)    
                           )'
EXECUTE sp_execute_external_script
      @language = N'R'
    ,@script = @RStat
    ,@input_data_1 = @SQLStat
    ,@input_data_1_name = N'Sales'
WITH RESULT SETS ((
                     [City Key]  NVARCHAR(100)
                    ,[City]   NVARCHAR(100)
                    ,products_sold INT
                    ,qof_products_sold INT
                    ,running_total INT
                    ,running_total_avg DECIMAL(10,3)
                    ))

With results returned as a table in SSMS:

The results are the same and I have added the Running_total_avg column that calculates the average value of running total summed values. 

A couple of words on the R code:

DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(dplyr)
               OutputDataSet <- Sales %>%
               mutate(
               running_total = cumsum(qof_products_sold)
              ,running_total_avg = cummean(qof_products_sold)
              )'

Using the package, dplyr (to learn how to install additional packages, please refer to my previous article), and by simply calling function cumsum, I get a running sum total on column qof_products_sold. In addition, I have also used function, cummean, to see the average value of running total. The number of additional R code in comparison to additional T-SQL code quickly leans toward the R language. The aggregate functions used here are part of base R and part of dplyr package; cumsum(), cummin() and cummax() are part of base R and cumall(), cumany(), and cummean() are part of the dplyr package.

Besides running frequency values and cumulative values, we can also add some of the descriptive values based on slightly changed original query. In addition, since we are using Fact.Sale table, I will group the data based on the sale key or the invoice

SELECT
        [WWI Invoice ID] AS SalesID
       ,fs.[City Key] AS CityKey
       ,c.[City] AS City
       ,c.[State Province] AS StateProvince
       ,c.[Sales Territory] AS SalesTerritory
       ,fs.[Customer Key] AS CustomerKey
       ,COUNT(fs.[Stock Item Key]) AS Nof_Items
       ,SUM(fs.[Quantity]) AS Quantity
       ,SUM(fs.[Total Including Tax]) AS Total
       ,SUM(fs.[Profit]) AS Profit
FROM
              Fact.Sale AS fs
JOIN dimension.city AS c
ON c.[City Key] = fs.[City Key]
GROUP BY
        [WWI Invoice ID]
       ,fs.[City Key]
       ,c.[City]
       ,c.[State Province]
       ,c.[Sales Territory]
       ,fs.[Customer Key];

We can proceed with calculating descriptive values, such as, mean, max, quatiles, median, variance and deviation. All these measures describe the value variable, and let us decide we want to see descriptive statistics for column Quantity in table Fact.Sale using R and sp_execute_external_script.

DECLARE @SQLStat NVARCHAR(4000)
SET @SQLStat = 'SELECT
                                   [WWI Invoice ID] AS SalesID
                                  ,fs.[City Key] AS CityKey
                                  ,c.[City] AS City
                                  ,c.[State Province] AS StateProvince
                                  ,c.[Sales Territory] AS SalesTerritory
                                  ,fs.[Customer Key] AS CustomerKey
                                  ,COUNT(fs.[Stock Item Key]) AS Nof_Items
                                  ,SUM(fs.[Quantity]) AS Quantity
                                  ,SUM(fs.[Total Including Tax]) AS Total
                                  ,SUM(fs.[Profit]) AS Profit
                           FROM
                                         Fact.Sale AS fs
                           JOIN dimension.city AS c
                           ON c.[City Key] = fs.[City Key]
                           GROUP BY
                                   [WWI Invoice ID]
                                  ,fs.[City Key]
                                  ,c.[City]
                                  ,c.[State Province]
                                  ,c.[Sales Territory]
                                  ,fs.[Customer Key];'
DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(dplyr)
                       OutputDataSet <- Sales %>%
                                         group_by(SalesTerritory) %>%
                                         mutate(
                                         QuantityI = as.integer(Quantity)) %>%
                                         summarise(
                                          nof_invoices = n()
                                         ,avg_q = mean(QuantityI)
                                         ,med_q = median(QuantityI)
                                         ,sum_q = sum(QuantityI)
                                         ,min_q = min(QuantityI)
                                         ,max_q = max(QuantityI)
                                         ,var_q = var(QuantityI)
                                         ,sd_q = sd(QuantityI)
                                         ,quan25 = quantile(QuantityI, probs=0.25)
                                         ,quan50 = quantile(QuantityI, probs=0.50)
                                         ,quan75 = quantile(QuantityI, probs=0.75)
                                         ,perc99 = quantile(QuantityI, probs=0.99)
                                                                             )'
EXECUTE sp_execute_external_script
        @language = N'R'
       ,@script = @RStat
       ,@input_data_1 = @SQLStat
       ,@input_data_1_name = N'Sales'
WITH RESULT SETS ((
                                   SalesTerritory  NVARCHAR(100)
                                  ,nof_invoices INT
                                  ,avg_q DECIMAL(10,2)
                                  ,med_q INT
                                  ,sum_q INT
                                  ,min_q INT
                                  ,max_q INT
                                  ,var_q DECIMAL(10,2)
                                  ,sd_q  DECIMAL(10,2)
                                  ,quan25 INT
                                  ,quan50 INT
                                  ,quan75 INT
                                  ,perc99 INT
                                  ))

And descriptive values based on the Sales territory:

Now we have more insights of the sales data per Sales Territory based on quantity. As it is obvious, there are is little difference between the sales territory in terms of quantity. The median and average values are the same, also variance (var_q) and standard deviation (sd_q).At the beginning I pointed out that this part is usually the most time consuming, because one has to understand and get the insight of the data. I deliberately chose the quantity, since it is more dependent on products as to the actual sales. The variable Total should yield more interesting results.

By running same query with slight change to selected column (Total, before Quantity), the results are:

There are some differences between the total sum of Total revenue per Sales territory, as well as minimum and maximum values of invoices. This gives us some additional insights, but now we can also use the advantage of R and check if these differences are statistically significant. We will check if the sum values of invoices between sales territories are significant with the following query:

DECLARE @Signif VARCHAR(1000)
DECLARE @SQLStat NVARCHAR(4000)
SET @SQLStat = 'SELECT
                                   [WWI Invoice ID] AS SalesID
                                  ,fs.[City Key] AS CityKey
                                  ,c.[City] AS City
                                  ,c.[State Province] AS StateProvince
                                  ,c.[Sales Territory] AS SalesTerritory
                                  ,fs.[Customer Key] AS CustomerKey
                                  ,COUNT(fs.[Stock Item Key]) AS Nof_Items
                                   ,SUM(fs.[Quantity]) AS Quantity
                                  ,SUM(fs.[Total Including Tax]) AS Total
                                  ,SUM(fs.[Profit]) AS Profit
                                  ,CASE
                                  WHEN c.[Sales Territory] = ''Rocky Mountain'' THEN 1
                                  WHEN c.[Sales Territory] = ''Mideast'' THEN 2
                                  WHEN c.[Sales Territory] = ''New England'' THEN 3
                                  WHEN c.[Sales Territory] = ''Plains'' THEN 4
                                  WHEN c.[Sales Territory] = ''Southeast'' THEN 5
                                  WHEN c.[Sales Territory] = ''Great Lakes'' THEN 6
                                  WHEN c.[Sales Territory] = ''Southwest'' THEN 7
                                  WHEN c.[Sales Territory] = ''Far West'' THEN 8
                                   END AS SalesTerritoryID
                           FROM
                                         Fact.Sale AS fs
                           JOIN dimension.city AS c
                           ON c.[City Key] = fs.[City Key]
                           GROUP BY
                                   [WWI Invoice ID]
                                  ,fs.[City Key]
                                  ,c.[City]
                                  ,c.[State Province]
                                  ,c.[Sales Territory]
                                  ,fs.[Customer Key]
                                  ,CASE
                                  WHEN c.[Sales Territory] = ''Rocky Mountain'' THEN 1
                                  WHEN c.[Sales Territory] = ''Mideast'' THEN 2
                                  WHEN c.[Sales Territory] = ''New England'' THEN 3
                                  WHEN c.[Sales Territory] = ''Plains'' THEN 4
                                  WHEN c.[Sales Territory] = ''Southeast'' THEN 5
                                  WHEN c.[Sales Territory] = ''Great Lakes'' THEN 6
                                  WHEN c.[Sales Territory] = ''Southwest'' THEN 7
                                  WHEN c.[Sales Territory] = ''Far West'' THEN 8
                                   END;'
DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(Hmisc)
                     cor <- Sales[,c("Total", "SalesTerritoryID")]
                     results <- rcorr(as.matrix(cor), type="spearman")
                     Sig <- as.character(results$P[2,1])
                     OutputDataSet <- data.frame(Value_of_corr = results$r[2,1])';
EXECUTE sp_execute_external_script
        @language = N'R'
       ,@script = @RStat
       ,@input_data_1 = @SQLStat
       ,@input_data_1_name = N'Sales'
       ,@params = N'@Sig VARCHAR(1000) OUTPUT'
       ,@Sig = @Signif OUTPUT
WITH RESULT SETS (( Value_of_corr DECIMAL(7,6)  ))
                                                            
SELECT @Signif AS Stats_significance

And receive the following result in SSMS:

So looking at the R code:

SET @RStat = 'library(Hmisc)
                     cor <- Sales[,c("Total", "SalesTerritoryID")]
                     results <- rcorr(as.matrix(cor), type="spearman")
                     Sig <- as.character(results$P[2,1])
                     OutputDataSet <- data.frame(Value_of_corr = results$r[2,1])';

We wanted to check for statistical significance between the SalesTerritory and Total revenue per Territory. We have used the package, Hmisc, that gives us quick access to calculating correlations between variables in a matrix. I have used the Spearman correlation coefficient to check for the correlation. In fact, the coefficient = 0.0071 (I will neglect the minus sign) which means there is very very little (if non – very much close to zero) correlation. The statistical significance concludes this. 

I have also done some recoding. SalesTerritory was changed from string to numeric type of variable, and in the query, I have also used the parameter, @param, for sp_execute_external_script, just for the convenience of getting both values in a separate select statements / variables. I could unlist and grouped both values into data.frame in R.

Based on the data wrangling so far, we have find out, there is little differences in terms of variance of the sales, althought the total sums of revenue are different among the Sales Territories.

Cluster analysis

So far, we have seen how data wrangling is done, that usually it does take time to prepare the data, and time to find the descriptive information and correlation among the data. On the other hand, cluster analysis or clustering is a technique that divides instances into natural groups. In these groups, the instances (or in our case the point-of-sales) are grouped together bearing stronger resemblance to one another or strong difference between each other. We will use a hierarchical clustering, where you define number of clusters based on the dendrogram. To gain more understanding the sales data for WWI we will use the undirected method.

I will use slighty changed query, that aggregate data based on the SalesTerritory and CustomerKey. Both will give us some interesting results. I also added two more filters in WHERE clause; External sales territory is excluded, as well as Customers where Customery Key is different from 0. Such rules can be added and applied based on the previous step of data wrangling as well as the data cleaning step.

SELECT                           
       SUM(fs.[Profit]) AS Profit
       ,c.[Sales Territory] AS SalesTerritory
       ,CASE
                     WHEN c.[Sales Territory] = 'Rocky Mountain' THEN 1
                     WHEN c.[Sales Territory] = 'Mideast' THEN 2
                     WHEN c.[Sales Territory] = 'New England' THEN 3
                     WHEN c.[Sales Territory] = 'Plains' THEN 4
                     WHEN c.[Sales Territory] = 'Southeast' THEN 5
                     WHEN c.[Sales Territory] = 'Great Lakes' THEN 6
                     WHEN c.[Sales Territory] = 'Southwest' THEN 7
                     WHEN c.[Sales Territory] = 'Far West' THEN 8
        END AS SalesTerritoryID
       ,fs.[Customer Key] AS CustomerKey   
       ,SUM(fs.[Quantity]) AS Quantity
       ,SUM(fs.[Total Including Tax]) AS Total
                                 
FROM [Fact].[Sale] AS  fs
       JOIN dimension.city AS c
       ON c.[City Key] = fs.[City Key]
       WHERE
              fs.[customer key] <> 0
       AND c.[Sales Territory] NOT IN ('External')
GROUP BY
       c.[Sales Territory]
       ,fs.[Customer Key]
       ,CASE
                     WHEN c.[Sales Territory] = 'Rocky Mountain' THEN 1
                     WHEN c.[Sales Territory] = 'Mideast' THEN 2
                     WHEN c.[Sales Territory] = 'New England' THEN 3
                     WHEN c.[Sales Territory] = 'Plains' THEN 4
                     WHEN c.[Sales Territory] = 'Southeast' THEN 5
                     WHEN c.[Sales Territory] = 'Great Lakes' THEN 6
                     WHEN c.[Sales Territory] = 'Southwest' THEN 7
                     WHEN c.[Sales Territory] = 'Far West' THEN 8
        END
ORDER BY CustomerKey

First we need to find out how many clusters are appearing in out dataset. With this, I will use following R code:

clusters <- hclust(dist(cust.data[,c(1,3,5)]), method = 'average')
plot(clusters)

And I will get a plotted dendrogram in return:

This shows a clean 3 clusters in return. With this information I can proceed with the following query:

DECLARE @SQLStat NVARCHAR(4000)
SET @SQLStat = 'SELECT                                
       SUM(fs.[Profit]) AS Profit
       ,c.[Sales Territory] AS SalesTerritory
       ,CASE
                     WHEN c.[Sales Territory] = ''Rocky Mountain'' THEN 1
                     WHEN c.[Sales Territory] = ''Mideast'' THEN 2
                     WHEN c.[Sales Territory] = ''New England'' THEN 3
                     WHEN c.[Sales Territory] = ''Plains'' THEN 4
                     WHEN c.[Sales Territory] = ''Southeast'' THEN 5
                     WHEN c.[Sales Territory] = ''Great Lakes'' THEN 6
                     WHEN c.[Sales Territory] = ''Southwest'' THEN 7
                     WHEN c.[Sales Territory] = ''Far West'' THEN 8
        END AS SalesTerritoryID
       ,fs.[Customer Key] AS CustomerKey   
       ,SUM(fs.[Quantity]) AS Quantity
FROM [Fact].[Sale] AS  fs
       JOIN dimension.city AS c
       ON c.[City Key] = fs.[City Key]
       WHERE
              fs.[customer key] <> 0
       AND c.[Sales Territory] NOT IN (''External'')
GROUP BY
       c.[Sales Territory]
       ,fs.[Customer Key]
       ,CASE
                     WHEN c.[Sales Territory] = ''Rocky Mountain'' THEN 1
                     WHEN c.[Sales Territory] = ''Mideast'' THEN 2
                     WHEN c.[Sales Territory] = ''New England'' THEN 3
                     WHEN c.[Sales Territory] = ''Plains'' THEN 4
                     WHEN c.[Sales Territory] = ''Southeast'' THEN 5
                     WHEN c.[Sales Territory] = ''Great Lakes'' THEN 6
                     WHEN c.[Sales Territory] = ''Southwest'' THEN 7
                     WHEN c.[Sales Territory] = ''Far West'' THEN 8
        END ;'
DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'clusters <- hclust(dist(Sales[,c(1,3,5)]), method = ''average'')
                       cl <- data.frame(cutree(clusters, 3))
                       OutputDataSet <- data.frame(cbind(Sales[,c(1,2,5,4)],cl))'
EXECUTE sp_execute_external_script
         @language = N'R'
       ,@script = @RStat
       ,@input_data_1 = @SQLStat
       ,@input_data_1_name = N'Sales'
WITH RESULT SETS ((
                                    Profit INT
                                   ,SalesTerritory NVARCHAR(1000)
                                   ,Quantity INT
                                   ,CustomerKey INT
                                   ,ClusterCut INT
                                  ))

And this query returns the cluster belonging of each of the customer in sales territory.

In other words, we get to additional information on how our dataset is being grouped into clusters based on similarities. What we need to do now is to plot these customers based on sales territory, profit and quantity of the products sold with following R code:

ggplot(cust.data, aes(Total, Quantity, color = cust.data$SalesTerritory)) + 
geom_point(alpha = 0.4, size = 2.5) + geom_point(col = clusterCut) +   
scale_color_manual(values = c('black', 'red', 'green','yellow','blue','lightblue','magenta','brown'))

With full query:

DECLARE @SQLStat NVARCHAR(4000)
SET @SQLStat = 'SELECT                                
       SUM(fs.[Profit]) AS Profit
       ,c.[Sales Territory] AS SalesTerritory
       ,CASE
                     WHEN c.[Sales Territory] = ''Rocky Mountain'' THEN 1
                     WHEN c.[Sales Territory] = ''Mideast'' THEN 2
                     WHEN c.[Sales Territory] = ''New England'' THEN 3
                     WHEN c.[Sales Territory] = ''Plains'' THEN 4
                     WHEN c.[Sales Territory] = ''Southeast'' THEN 5
                     WHEN c.[Sales Territory] = ''Great Lakes'' THEN 6
                     WHEN c.[Sales Territory] = ''Southwest'' THEN 7
                     WHEN c.[Sales Territory] = ''Far West'' THEN 8
        END AS SalesTerritoryID
       ,fs.[Customer Key] AS CustomerKey   
       ,SUM(fs.[Quantity]) AS Quantity
FROM [Fact].[Sale] AS  fs
       JOIN dimension.city AS c
       ON c.[City Key] = fs.[City Key]
       WHERE
              fs.[customer key] <> 0
       AND c.[Sales Territory] NOT IN (''External'')
GROUP BY
       c.[Sales Territory]
       ,fs.[Customer Key]
       ,CASE
                     WHEN c.[Sales Territory] = ''Rocky Mountain'' THEN 1
                     WHEN c.[Sales Territory] = ''Mideast'' THEN 2
                     WHEN c.[Sales Territory] = ''New England'' THEN 3
                     WHEN c.[Sales Territory] = ''Plains'' THEN 4
                     WHEN c.[Sales Territory] = ''Southeast'' THEN 5
                     WHEN c.[Sales Territory] = ''Great Lakes'' THEN 6
                     WHEN c.[Sales Territory] = ''Southwest'' THEN 7
                     WHEN c.[Sales Territory] = ''Far West'' THEN 8
        END ;'
DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(ggplot2)
              image_file <- tempfile()
                       jpeg(filename = image_file, width = 400, height = 400)
                       clusters <- hclust(dist(Sales[,c(1,3,5)]), method = ''average'')
                       clusterCut <- cutree(clusters, 3)
                       ggplot(Sales, aes(Total, Quantity, color = Sales$SalesTerritory)) +
                       geom_point(alpha = 0.4, size = 2.5) + geom_point(col = clusterCut) +
                       scale_color_manual(values = c(''black'', ''red'', ''green'',''yellow'',''blue'',''lightblue'',''magenta'',''brown''))
                       dev.off()
                    OutputDataSet <- data.frame(data=readBin(file(image_file, "rb"), what=raw(), n=1e6))'
EXECUTE sp_execute_external_script
        @language = N'R'
       ,@script = @RStat
       ,@input_data_1 = @SQLStat
       ,@input_data_1_name = N'Sales'
WITH RESULT SETS ((plot varbinary(max)))

Note that the query returns a varbinary blob, which in fact is a following graph, easily plotted using SSRS or Power BI.

This clearly shows how customers per sales territory cluster based on similarities. Again, very interesting way to see the data and uncover what might be hidden at the first place. Based on the plot, customers based on Sales Territory are clearly clustering based on the total revenue and quantity of products sold – something we could not see any of the “differences” in first part of the article.

Association Rules

Unlike cluster analysis, association rules can predict the any attribute, not just the class, like we have seen with the cluster analysis. Association rules, as a set of rules in the result of the analysis, are not intended to be used together as a set, because each of the rule express the different business / sales process that result in the dataset. Both clustering and association rules are known to Microsoft SSAS users, as they have been available since SQL Server 2005. And since the retail / sales orientated data is also very useful for understating the purchase / sales processes.

Association rules that are available in one of the R packages will require dataset to be prepared in slightly different way, as in the previous two examples. Data will have to be prepared as array of products ordered/sold together on an invoice. I will use following query from WWI database:

SELECT
        [Stock Item Key]
       ,[WWI Stock Item ID]
       ,[Stock Item]
       ,LEFT([Stock Item], 8) AS L8DESC
       ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR
       ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
FROM [Dimension].[Stock Item]

The Products in Stock Item table (dimension) does not hold any grouping among the products, nor any additional logic how to find unique products among the product variants (based on Brand, size, color, etc.). I have used a very simple, yet efficient way, by using the dense_rank function and substringing the product description. It is not 100% correct, but for purpose of this demo, good enough. So this query is returning a product group, that will be used for association rules.

Once this is covered, a following query will return the rules.

DECLARE @TSQL AS NVARCHAR(MAX)
SET @TSQL = N'WITH PRODUCT
                        AS
                      (SELECT
                      [Stock Item Key]
                      ,[WWI Stock Item ID]
                      ,[Stock Item]
                      ,LEFT([Stock Item], 8) AS L8DESC
                      ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR
                      ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP
                      FROM [Dimension].[Stock Item]
                      )
                     SELECT
                      O.[WWI Order ID] AS OrderID
                      ,P.PRODUCT_GROUP AS ProductGroup
                      ,LEFT([Stock Item],8) AS ProductDescription
                      FROM [Fact].[Order] AS O
                      JOIN PRODUCT AS P
                      ON P.[Stock Item Key] = O.[Stock Item Key]
                      GROUP BY
                       O.[WWI Order ID]
                      ,P.PRODUCT_GROUP
                      ,LEFT([Stock Item],8)
                      ORDER BY
                      O.[WWI Order ID]'
DECLARE @RScript AS NVARCHAR(MAX)
SET @RScript = N'
        library(arules)
        cust.data <- InputDataSet
        cd_f <- data.frame(OrderID=as.factor(cust.data$OrderID), ProductGroup=as.factor(cust.data$ProductGroup))
        cd_f2_tran  <- as(split(cd_f[,"ProductGroup"], cd_f[,"OrderID"]), "transactions")
        rules <- apriori(cd_f2_tran, parameter=list(support=0.01, confidence=0.1))
        OutputDataSet <- data.frame(inspect(rules))'
EXEC sys.sp_execute_external_script
           @language = N'R'
          ,@script = @RScript
          ,@input_data_1 = @TSQL
WITH RESULT SETS ((
     lhs NVARCHAR(500)
    ,[Var.2] NVARCHAR(10)
    ,rhs NVARCHAR(500)
    ,support DECIMAL(18,3)
    ,confidence DECIMAL(18,3)
    ,lift DECIMAL(18,3)
                 ));

Please note, this sample is using a rules package, specifically used for the calculation of rules. With following set of parameters, I get 125 rules and here are just some of them:

With this rules, we essentially see, which product groups are sold together on the invoices. Each rule is represented as:

{7}          =>           {17}

Interpreted as, product from product group 7 is on the invoice with product from product group 17 with for specific support and given confidence. Such rules, like cluster analysis are easy understandable and can easily and fast converted to business rules, needs or giving the insights to your business.

We can plot the rules with R using the following command

plot(rules, method="grouped", control=list(k=20));

and resulting in:

Conclusion

Analyzing data takes time and that also includes numerous tries of getting the sense and insights into the data. With R integration in SQL Server, the process of analyzing OLTP or OLAP data, can happen much faster, with less data transfer, better and more interesting visualization and task distribution between data wrangler, data analyst and data scientist (and others, as well).  In this article, we have seen that immediate data analysis, which might not give a concrete conclusion, can bring the “hidden” patterns with additional data analysis and playing with the data. Based on different business processes and different datasets, we can apply particular approaches, algorithms and methods, to extract the information from data.

Author: Tomaz Kastrun (tomaz.kastrun@gmail.com)

Twitter: @tomaz_tsql

Blog: http://tomaztsql.wordpress.com

Resources

Rate

5 (3)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (3)

You rated this post out of 5. Change rating