SQLServerCentral Article

A genuine use for a SQL CLR Aggregate

,

When SQL CLR functions were added to SQL Server 2005 my interest was piqued by the possibility of introducing new aggregate functions. I quickly realised that most of the aggregate activity I wanted to do was better handled by T-SQL and/or extremely dangerous within a CLR function. For example, if we wanted to use an aggregate to work out the median (middle value) from a set you could write a CLR aggregate however on a large data set you would run the risk of consuming a huge amount of memory.

Judging by the examples (the lack of) on the Internet it seems that most people reached the same conclusion. In fact most examples seem entirely academic exercises or face the same problem as my Median example. The CLR aggregate examples I have seen are:

  • Concatenate strings (lots of plagiarism for this one)
  • A baker's dozen example (when would I ever use it?)
  • A weighted average calculation (interesting, but of limited use)

Characteristics of a good candidate for a CLR Aggregate

An aggregate operates over a set of data, which could be zero rows to whatever the limit is for rows in a SQL Server table. Potentially the aggregate could be performing a huge volume of work therefore poor candidates for aggregate have the following characteristics:

  • Require a substantial portion of the set to be held in memory (median and mode calculations)
  • Require string manipulation
  • Require the instantiation of many objects
  • Produce output that is of little use in the real world

A good candidate would be one that does something simple like COUNT, SUM, AVG, STDEV, STDEVP, VAR, VARP, RANK, NTILE etc but as these functions already exist in T-SQL I couldn't think of a genuinely useful, safe to run aggregate, so I put CLR aggregates to the back of my mind. Academically interesting but not useful in the real world.

A game changer for aggregates, SQL2008

All this changed when SQL2008 introduced the ability for an aggregate to take more than one argument. In the world of statistics there are a number of functions that take a set of pairs of data and calculate a figure to say how closely related they are.

For example if you look at trend lines in Microsoft Excel you have the ability to look at the formula of the trend line and also the R2 value revealing how a good a fit that trend line is. If we look at the formulae used for linear regression we can see that it meets all the criteria for being a good candidate for a CLR aggregate

  • Only requires a few primitive variables to be maintained
  • Does not instantiate many objects
  • Does not involve string manipulation
  • Produces genuinely useful output
  • Offers a "better" solution that T-SQL for these reasons:

    • May be a T-SQL implementation is not possible
    • May be a T-SQL implementation is more complex
    • May be a T-SQL implementation is less efficient.

Linear Regression

The formula for a straight line is y = mx+c

  • m = the slope of the graph
  • c = a constant

R2 is a figure that indicates how close the calculated trend line is to the real data.

Linear regression allows us to calculate the slope and goodness of fit of a trend line for a set of data.

The formulae below show the mathematical representation and the T-SQL equivalent.

Formula to determine the slope of a trend line in linear regression
Formula to determine the goodness of fit or R squared value

First steps

My first step was to create a very simple test table in the Adventureworks database and populate it with some dummy values.

CREATE TABLE dbo.test_table(
    col1 FLOAT NOT NULL ,
    col2 FLOAT NOT NULL 
)
GO
INSERT INTO dbo.test_table
SELECT TOP 30000 ROW_NUMBER() OVER(ORDER BY s1.id DESC),ROW_NUMBER() OVER(ORDER BY s1.id ASC)
FROM sysobjects,sysobjects s1

I deliberately limited the test values to 30,000 because I wanted to test my CLR aggregate and T-SQL equivalents against the Microsoft Excel equivalent trend lines.

The Excel chart appears as shown below.

Excel graph to illustrate the trend line formula and R squared value

The T-SQL solution

The next stage was to write a traditional query which, if correct, should give me exactly the same values as the Excel equivalent.

DECLARE
    @x FLOAT,
    @y FLOAT,
    @xy FLOAT,
    @x2 FLOAT,
    @n FLOAT
SELECT 
    @n =  COUNT(*),
    @x =  SUM(col1),
    @y =  SUM(col2),
    @xy =  SUM(col1 * col2),
    @x2 =  SUM(col1*col1)
FROM dbo.test_table
SELECT ((@n * @xy) - (@x*@y)) / (@n * @x2 - (@x*@x))

This produced exactly the same value as the Excel equivalent but I was curious to see if amalgamating the two selects into a single formula would make any difference to the execution plans.

SELECT ( COUNT(*) *  SUM(col1*col2) - ( SUM(col1) *  SUM(col2)))/( COUNT(*)*  SUM(col1*col1)-( SUM(col1)* SUM(col1)))
FROM dbo.test_table

I also tested both queries with SET STATISTICS TIME ON

These both produced identical execution plans. However over several runs the CPU time for the former query was marginally lower than the latter. Realistically the difference ranged from 0 to 16ms.

Beginning my first CLR Aggregate

My first port of call was to do some basic background research and the principal articles I found were as follows

http://www.mssqltips.com/tip.asp?tip=2022

http://blog.effiproz.com/2010/06/multi-parameter-clr-user-defined.html

A basic CLR aggregate has four methods

Method Purpose
Init This is called when the aggregate is first fired up.

It is a bit like a constructor.

Accumulate As the name suggests this is where the data for the aggregate accumulates up.

In my case I was counting valid entries, summing up the arguments, squares of arguments and multiples of arguments

Merge If SQL Server decides to parallelise the query using the aggregate this method is what brings the different streams back together again.
Terminate This is where the aggregate actually returns its data and ends.

The basic aggregate also has four attributes which are described in Andy Novak's article.

Attribute Comment
Format This is a compulsory enum with options Native, Unknown and UserDefined. In my case I used Native
IsInvariantToNulls Setting this to true tells the aggregate that it doesn't care whether or not values are null or not, it will always produce the same result.

An example of an aggregate where the value for this is false is COUNT(field) which only counts instances that are NOT NULL

IsInvariantToDuplicates Setting this to true tells the aggregate that it will produce the same results even if there are duplicate entries.

MAX() is a good example of this as no matter how many duplicates there are the MAX value will still be the MAX value.

IsInvariantToOrder Setting this to true affects how SQL Server calls the Merge method of the aggregate.

If you had a string concatenation aggregate that was parallelised then it is vitally important that the streams merge back together in the correct order.

As well as affecting the way that the aggregate behaves these attribute values will also affect the performance of the aggregate.

So with a bit of basic knowledge I fired up Visual C# 2008, created a database project called Regression and began.

The aggregate header

When you first add an aggregate to a Visual Studio solution you get the C# equivalent of include files

        using  System;
        using  System.Data;
        using  System.Data.SqlClient;
        using  System.Data.SqlTypes;
        using Microsoft.SqlServer.Server;
        [Serializable]
        [SqlUserDefinedAggregate(
        Format.Native, //use clr serialization to serialize the intermediate result
        IsInvariantToNulls = false, //optimizer property
        IsInvariantToDuplicates = false, //optimizer property
        IsInvariantToOrder = true //optimizer property
        )]
        public struct Slope
        {
        private long count;
        private SqlDouble SumY;
        private SqlDouble SumXY;
        private SqlDouble SumX2;
        private SqlDouble SumY2;
        private SqlDouble SumX;

As you can see I have used a struct and my aggregate is called "Slope" but it doesn't really matter if you use structs or classes.

The private variables are simply those that I am going to use within my aggregate. Notice that I am explicitly using the CLR Sql types.

The Init method

The init function is the simplest part of the aggregate. All I do is set my private member variables to zero.

        public void Init()
        {
        count = 0;
        SumX = SumY = SumXY = SumX2 = SumY2 = 0;
        }

The Accumulate method

This is the engine room for my aggregate. All I am doing here is counting the iterations, summing x, y and their squares and their multiple.

        public void Accumulate(SqlDouble x, SqlDouble y )
        {
            if (!x.IsNull && !y.IsNull){
                count++;
                SumX += (SqlDouble)x;
                SumY += (SqlDouble)y;
                SumXY += (SqlDouble)x * (SqlDouble)y;
                SumX2 += (SqlDouble)x * (SqlDouble)x;
                SumY2 += (SqlDouble)y * (SqlDouble)y;
            }
        }

The Merge method

        public void Merge(Slope other)

{ count += other.count; SumY += other.SumY; SumXY += other.SumXY; SumX2 += other.SumX2; SumY2 += other.SumY2; SumX += other.SumX; }

As stated earlier the Merge method is simply to bring the different streams together when SQL Server decides to parallelise the query. If you get this method wrong and the query parallelises you will get strange results.

If SQL server does parallise a query with a CLR aggregate it is worth testing to see if the query produces the same results when MAXDOP=1. If it doesn't then you have probably got something wrong.

The Terminate Method

Finally the terminate method takes your accumulated data and returns a value back to SQL Server.

        public SqlDouble Terminate()

{ if (count > 0) { SqlDouble value = (count * SumXY - (SumX * SumY)) / ((count * SumX2) - (SumX * SumX)); return value; } else { return 0; } }

At this point I should like to offer a big thank you to Bob Beauchemin, the MVP who helped me to get the aggregate up and running.

I also set up a 2nd struct called RSquared that was identical apart from the Terminate function which was as follows

   public SqlDouble Terminate()
    {
        if (count > 0)
        {
            SqlDouble value = (count * SumXY - (SumX * SumY)) / Math.Sqrt((double)((count * SumX2) - (SumX * SumX)) * (double)((count * SumY2) - (SumY * SumY)));
            return value*value;
        }
        else
        {
            return 0;
        }

Deploying the Aggregate

Visual Studio 2008 will happily compile a multi-parameter CLR aggregate but it won't deploy it. If you try it then you will get a message warning that a user defined aggregate should have only one parameter.

To deploy the aggregate you need to use SQL Server Management Studio. My deployment script is as shown below.

USE AdventureWorks2008
GO
IF EXISTS(SELECT * FROM INFORMATION_SCHEMA.ROUTINES 'LINEARREGRESSION')
    BEGIN
        DROP ASSEMBLY LINEARREGRESSION
    END
GO
CREATE ASSEMBLY LinearRegression FROM 'C:\Documents and Settings\David\My Documents\Visual Studio 2008\Projects\Regression\Regression\bin\Debug\Regression.dll'
GO
CREATE AGGREGATE Slope(@x FLOAT, @y FLOAT)
RETURNS FLOAT
EXTERNAL NAME LinearRegression.Slope;

Testing the CLR Aggregate

To run my aggregate I simply executed the query below

SELECT dbo.Slope(col1,col2)
FROM dbo.test_table
GO

This produced exactly the same results as for my T-SQL Equivalent

My next step was to compare the execution plans.

Execution plans for T-SQL vs CLR equivalent

As you can see the CLR function has a marginally better execution plan. This difference was maintained when the number of test records was increased to 500,000

Query type 30,000 records 500,000 records
CLR 0.122431 2.10203
T-SQL 0.125431 2.15459

However looking at the CPU time the results are considerably less encouraging. I found that the aggregate consistently required more CPU time and this got worse as the sample size got larger.

Query type 30,000 records 500,000 records
CLR 62 1,400
T-SQL 47 767

My next test was to add a field to my table by which I could do some grouping. I simply called my field GroupingColumn and populated it with CAST(Col1 AS DECIMAL(18,5))%10

The results from this provided a shock.

CLR execution plan with GROUP BY

For whatever reason my CLR Aggregate has decided to throw a SORT into the mix which accounts for 94% of the query cost. Without that sort the CLR function would have been substantially cheaper than the T-SQL equivalent. Even if an ORDER BY is added to the T-SQL result there is a slight change to the execution plan however the ordering is done after the aggregation where as the CLR aggregate performs its work before the aggregation!

At first I thought it might be something to do with the IsInvariantToOrder setting however rebuilding and re-creating the aggregate didn't make any difference.

Playing with indexes

The first thing I tried after seeing the sort was to create an index on my grouping column.

IF EXISTS(SELECT * FROM sys.indexes OBJECT_ID(N'dbo.test_table') AND name = N'idx_test_table_groupingcolumn') 
    DROP INDEX idx_test_table_groupingcolumn ON dbo.test_table
GO
CREATE CLUSTERED INDEX idx_test_table_groupingcolumn ON dbo.test_table (Groupingcolumn) 
GO 

This suceeded in reverting the execution plan back to the original 49:51 CLR:T-SQL proportions. However I may want to correlate figures grouped by a number of different columns either singly or in combinations so a single clustered index per table is not going to get me very far.

Conclusions

There is a distinct absence of good CLR Aggregate examples on the web and not a massive amount of reference material to help a beginner. It is quite possible that I have missed something fundamental that would eliminate the CLR sort from the aggregate.

At the heart of it my linear regression example is possible to write to execute efficiently using T-SQL. The T-SQL may appear convoluted but at its heart it is really doing something fairly simple.

There is clearly a significant overhead in using CLR aggregates but in the real world linear regression functions fall more into the data warehousing space than OLTP. In my particular case the penalty is (just) bearable because the aggregate is not something that will be run many times a second. I would be nervous about running the CLR aggregate against a multi-million row dataset.

To be capable of running the aggregate safely you would need a sizable TEMPDB on fast disks however I would regard this as a pre-requisite for a DB server supporting analytics functions.

What the CLR aggregate does offer is simplicity to the data analyst who is at heart a statistician who is perfectly capable of writing complex T-SQL but frankly more interested in what the data tells them than in the mechanics of retrieving it.

Such a person is more likely to use a statistics tool to retrieve huge volumes of data and then grind it all up in their favourite stats package. This being the case providing them with some simple CLR aggregates to carry out core stats functions may still have less impact than retrieving huge volumes of data.

Rate

4.89 (47)

You rated this post out of 5. Change rating

Share

Share

Rate

4.89 (47)

You rated this post out of 5. Change rating