A genuine use for a SQL CLR Aggregate

  • Comments posted to this topic are about the item A genuine use for a SQL CLR Aggregate

  • On the subject of the expensive sort in the execution plan I had a note back from Bob Beauchamp saying that there is an internal stream aggregator operation that causes the sort and unfortunately there is no way around it other than to put the clustered index on the column(s) used in the GROUP BY statement.:crying:

  • Thanks for putting this together. It was a fascinating exercise that I really enjoyed. Like many others I have struggled to find a genuine use case for this feature. I agree that SQL Server 2008 made the feature more usable but I'm still not sure the tradeoffs make it worthwhile. Great article.

    Bob

    "Beliefs" get in the way of learning.

  • Another good example to use CLR is GMT/Local/Unix time handling. Trying this in T-SQL is near imposible and within CLR it's simple and faster.

  • see what happens when you add a materialized view... this will be much quicker..

    create VIEW vx_agg WITH SCHEMABINDING

    as

    SELECT --( COUNT_big(*) * SUM(col1*col2) - ( SUM(col1) * SUM(col2)))/( COUNT_big(*)* SUM(col1*col1)-( SUM(col1)* SUM(col1))) AGG

    COUNT_big(*) cnt

    ,SUM(col1*col2) col1col2

    ,SUM(col1) col1

    ,SUM(col1*col1) col1sq

    ,SUM(col2) col2

    FROM dbo.test_table

    GO

    CREATE UNIQUE clustered INDEX IX_AGG ON vx_agg (cnt)

    SELECT ( COUNT(*) * SUM(col1*col2) - ( SUM(col1) * SUM(col2)))/( COUNT(*)* SUM(col1*col1)-( SUM(col1)* SUM(col1)))

    FROM dbo.test_table

  • We have developed a Project Planning and Management system and I have used CLR routines in two main places. The user wanted cost escalation at user defined periods of the project in the planned budget calculation. These calculated amounts had to be used by UI screens and reports and had to be consistent. Each work unit is inspected to see if it falls within one to many escalation periods that occur during the life of the project and how much of the work falls within a period. A planned budget amount is calculated and aggregated based on the cost, amount of work and escalation percent. There are also status codes that are used to filter the work units involved in the calculation, multiple tables of data and many loops through the data so it was easier for me to use a CLR and C# to gather the data I needed and loop through the work units to provide an escalated work unit planned budget.

    Did I mention different contract types that use these escalated amounts in different ways to create a budget? I could also encapsulate this logic in this routine so I could return a planned budget amount for every contract type in the system. We allow budgeting at the lowest work breakdown level but have to provide aggregated budget amounts at every parent level so another loop through the data to perform this aggregation to the parent.

    Since the CLR works as a Table Valued Function we can write queries against it for UI and reporting. Every UI or report will have the same planned budget value because the math and rounding and contract types are encapsulated in this routine. It seems that if I do division of two numbers in T-Sql , C#, Crystal Reports I get a different number of decimals and value of decimals. I am not sure I could have done this without the use of a CLR routine to perform the looping and provide consistent rounded amounts.

    The other place I use it is to provide a cross tab type output but I will save that for another day as I have spouted enough.

  • Thanks for sharing David. I agree most CLR aggregates have not had much success beyond academics. I think part of this limitation is due to the lack of ability to perform an ORDER BY clause with an aggregate.

    For example, lets say I create a CLR aggregate that concatenates strings together ie

    SELECT dbo.List(myStringColumn,',') -- Where ',' means comma delimited

    FROM myTable

    Assuming I've implemented it correctly, this could produce output such as

    firstRowValue,secondRowValue,thirdRowValue

    or maybe parallelism kicks in (on a large set) and produces output where the order changes ie

    secondRowValue,firstRowValue,thirdRowValue

    Now, if we had we had the ability to perform the ORDER BY clause with the OVER clause ie

    SELECT dbo.List(myStringColumn,',') OVER(ORDER BY myStringColumn)

    FROM myTable

    Then we could guarantee the correct (expected) order of output.

    If you believe this would be a useful feature, and I do, please vote for it on connect.

    https://connect.microsoft.com/SQLServer/feedback/details/254387/over-clause-enhancement-request-order-by-for-aggregates

    In regards to the attribute IsInvariantToOrder, it's not supported, see the following link.

    http://msdn.microsoft.com/en-us/library/microsoft.sqlserver.server.sqluserdefinedaggregateattribute.isinvarianttoorder.aspx

    Apparently, it will be implemented in the "future" which after 5 years hasn't been far enough into the future.

  • Hello David and thanks for a great article on yet another good use of SQL CLR.

    The only issue I have is with your statement that "median" is not a good case for a CLR aggregate as it might "require a substantial portion of the set to be held in memory". I disagree because it is possible to reduce the size of data in memory by compressing it, as I have shown in this article:

    http://www.sqlservercentral.com/articles/SS2K5+-+CLR+Integration/3208/

    And that article was done before SQL Server 2008 added the ability to go beyond the 8k memory limit (with a MaxSize set to -1) so adding that option to the mix solves all problems (generally) outside of the possible efficiency of the CPU it takes to compress/decompress the dataset. At that point I would argue that if your situation requires a Median calculation (and it appears to for quite a few people) then a little bit of extra processing time is worth the ability to do this in the first place.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky (12/29/2010)And that article was done before SQL Server 2008 added the ability to go beyond the 8k memory limit (with a MaxSize set to -1) so adding that option to the mix solves all problems (generally) outside of the possible efficiency of the CPU it takes to compress/decompress the dataset. At that point I would argue that if your situation requires a Median calculation (and it appears to for quite a few people) then a little bit of extra processing time is worth the ability to do this in the first place.

    Have you compared the performance of doing a median calculation w/compression against performing the same calculation using Joe Celko's ROW_NUMBER technique?

    http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/medians-row-numbers-and-performance.aspx

    Compression is far from cheap, and I suspect that the ROW_NUMBER solution will scale much better. But that's just a suspicion and I don't have time to test myself at the moment.

    --
    Adam Machanic
    whoisactive

  • The images have been corrected. My apologies for the issues.

  • A fantastic article. I think I would rather approach this particular problem in pure T-SQL (or bringing it into something more suited for thorough statistical analysis, depending on the project at hand), but this was still a well written intro to the topic.

    ---
    Timothy A Wiseman
    SQL Blog: http://timothyawiseman.wordpress.com/

  • Adam Machanic (12/29/2010)


    Have you compared the performance of doing a median calculation w/compression against performing the same calculation using Joe Celko's ROW_NUMBER technique?

    http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/medians-row-numbers-and-performance.aspx

    Compression is far from cheap, and I suspect that the ROW_NUMBER solution will scale much better. But that's just a suspicion and I don't have time to test myself at the moment.

    Hi Adam. Thanks for mentioning that article as I had not seen it. I did just test this against the AdventureWorks2008 DB using SQL Server 2008 SP2 using the Celko method, Ben-Gan method, and my CLR UDA. The CLR method has the same number of reads as the Celko method but has slightly higher CPU and Elapsed times as compared to both Celko and Ben-Gan methods.

    So yes, the pure T-SQL methods do appear to be slightly more efficient. However, for the record I did not say that compression was cheap. In fact, I did say that it would be slightly less efficient to do that but that it might prove to be a worthwhile trade-off depending on the circumstances.

    My testing is as follows (the first two queries are copied directly from your blog that you noted above):

    ------------------------------------------------------------------------

    DBCC DROPCLEANBUFFERS

    DBCC FREEPROCCACHE

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    SELECT

    CustomerId,

    AVG(TotalDue)

    FROM

    (

    SELECT

    CustomerId,

    TotalDue,

    ROW_NUMBER() OVER (

    PARTITION BY CustomerId

    ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,

    ROW_NUMBER() OVER (

    PARTITION BY CustomerId

    ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc

    FROM Sales.SalesOrderHeader SOH

    ) x

    WHERE

    RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

    GROUP BY CustomerId

    ORDER BY CustomerId;

    SET STATISTICS IO OFF

    SET STATISTICS TIME OFF

    -- logical reads 686

    -- CPU time = 47 ms, elapsed time = 1087 ms. (when DROPCLEANBUFFERS and FREEPROCCACHE are run)

    -- CPU time = 62 ms, elapsed time = 235 ms. (when cached)

    ------------------------------------------------------------------------

    DBCC DROPCLEANBUFFERS

    DBCC FREEPROCCACHE

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    SELECT

    CustomerId,

    AVG(TotalDue)

    FROM

    (

    SELECT

    CustomerId,

    TotalDue,

    ROW_NUMBER() OVER (

    PARTITION BY CustomerId

    ORDER BY TotalDue) AS RowNum,

    COUNT(*) OVER (

    PARTITION BY CustomerId) AS RowCnt

    FROM Sales.SalesOrderHeader

    ) x

    WHERE

    RowNum IN ((RowCnt + 1) / 2, (RowCnt + 2) / 2)

    GROUP BY CustomerId

    ORDER BY CustomerId;

    SET STATISTICS IO OFF

    SET STATISTICS TIME OFF

    -- logical reads 686 + 139407 for Worktable = 140093 total

    -- CPU time = 344 ms, elapsed time = 1085 ms. (when DROPCLEANBUFFERS and FREEPROCCACHE are run)

    -- CPU time = 328 ms, elapsed time = 374 ms. (when cached)

    ------------------------------------------------------------------------

    DBCC DROPCLEANBUFFERS

    DBCC FREEPROCCACHE

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    SELECTCustomerId,

    SQL#.SQL#.Agg_Median(TotalDue)

    FROM Sales.SalesOrderHeader SOH

    GROUP BY CustomerId

    ORDER BY CustomerId;

    SET STATISTICS IO OFF

    SET STATISTICS TIME OFF

    -- logical reads 686

    -- CPU time = 1812 ms, elapsed time = 2707 ms. (when DROPCLEANBUFFERS and FREEPROCCACHE are run)

    -- CPU time = 1906 ms, elapsed time = 1948 ms. (when cached)

    ------------------------------------------------------------------------

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • I agree, i haven't been able to really leverage CLR functions successfully. With that said, I have had one victory using CLR fuctions. That was creating a CLR function to extend the functionality of the REPLACE() function within SQL. We ran into a situation where we needed to find and replace strings greater than the lengths allowed using the native REPLACE() function. Now with BIGREPLACE (creativity is not my strongsuit) we're able to pass the function 3 nvarchar(max) arguments. I guess CLR functions also give us t-sql fellas an excuse to poke around in some c# :p

  • Solomon Rutzky (12/29/2010)


    The only issue I have is with your statement that "median" is not a good case for a CLR aggregate as it might "require a substantial portion of the set to be held in memory". I disagree because it is possible to reduce the size of data in memory by compressing it, as I have shown in this article:

    http://www.sqlservercentral.com/articles/SS2K5+-+CLR+Integration/3208/

    Thanks Solomon, what I was getting at with the Median is that to find the middle value you have to keep track of all the values and my worry (as yet unsubstantiated) is that you could run an aggregate on such a large recordset that you consume all the RAM on your system and destabilize your server.

    It may be that the point where this happens is beyond the bounds of most users.

    The other thing that jumps out at me is that most stats functions seem to rely on what I will call the primitive aggregates; the built in aggregates. Effectively the CLR aggregates provide a short-hand for data analysts so they can focus on what the data is telling them rather than the mechanics.

    I should like to see Microsoft consider adding more in-built statistical functions given the way that the BI world is heading. I should also like to see some focusing on integration with external stats packages such as SPSS and SAS.

Viewing 15 posts - 1 through 15 (of 44 total)

You must be logged in to reply to this topic. Login to reply