A genuine use for a SQL CLR Aggregate

Question

Post reply

A genuine use for a SQL CLR Aggregate

David.Poole

SSC Guru

Points: 76419
More actions
December 28, 2010 at 9:09 pm

#95677

Comments posted to this topic are about the item A genuine use for a SQL CLR Aggregate
LinkedIn Profile

Viewing 15 posts - 1 through 15 (of 45 total)

You must be logged in to reply to this topic. Login to reply

tcm_mail Grasshopper Points: 23 More actions · Answer 1

Please, can you repair the broken links of the pictures:

http://www.sqlservercentral.com/articles/SQL+CLR/71942/m.gif

http://www.sqlservercentral.com/articles/SQL+CLR/71942/regression.gif

http://www.sqlservercentral.com/articles/SQL+CLR/71942/Regressiongraph.gif

http://www.sqlservercentral.com/articles/SQL+CLR/71942/CLRAggregateExecutionPlan.JPG

http://www.sqlservercentral.com/articles/SQL+CLR/71942/CLRAggWithGroupExecutionPlan.JPG

David.Poole SSC Guru Points: 76419 More actions · Answer 2

On the subject of the expensive sort in the execution plan I had a note back from Bob Beauchamp saying that there is an internal stream aggregator operation that causes the sort and unfortunately there is no way around it other than to put the clustered index on the column(s) used in the GROUP BY statement.:crying:

LinkedIn Profile

Robert Frasca SSCertifiable Points: 6027 More actions · Answer 3

Thanks for putting this together. It was a fascinating exercise that I really enjoyed. Like many others I have struggled to find a genuine use case for this feature. I agree that SQL Server 2008 made the feature more usable but I'm still not sure the tradeoffs make it worthwhile. Great article.

Bob

"Beliefs" get in the way of learning.

rdw-769613 Grasshopper Points: 12 More actions · Answer 4

Another good example to use CLR is GMT/Local/Unix time handling. Trying this in T-SQL is near imposible and within CLR it's simple and faster.

alchemy9 SSC Veteran Points: 203 More actions · Answer 5

see what happens when you add a materialized view... this will be much quicker..

create VIEW vx_agg WITH SCHEMABINDING

as

SELECT --( COUNT_big(*) * SUM(col1*col2) - ( SUM(col1) * SUM(col2)))/( COUNT_big(*)* SUM(col1*col1)-( SUM(col1)* SUM(col1))) AGG

COUNT_big(*) cnt

,SUM(col1*col2) col1col2

,SUM(col1) col1

,SUM(col1*col1) col1sq

,SUM(col2) col2

FROM dbo.test_table

GO

CREATE UNIQUE clustered INDEX IX_AGG ON vx_agg (cnt)

SELECT ( COUNT(*) * SUM(col1*col2) - ( SUM(col1) * SUM(col2)))/( COUNT(*)* SUM(col1*col1)-( SUM(col1)* SUM(col1)))

FROM dbo.test_table

bbalok Grasshopper Points: 15 More actions · Answer 6

We have developed a Project Planning and Management system and I have used CLR routines in two main places. The user wanted cost escalation at user defined periods of the project in the planned budget calculation. These calculated amounts had to be used by UI screens and reports and had to be consistent. Each work unit is inspected to see if it falls within one to many escalation periods that occur during the life of the project and how much of the work falls within a period. A planned budget amount is calculated and aggregated based on the cost, amount of work and escalation percent. There are also status codes that are used to filter the work units involved in the calculation, multiple tables of data and many loops through the data so it was easier for me to use a CLR and C# to gather the data I needed and loop through the work units to provide an escalated work unit planned budget.

Did I mention different contract types that use these escalated amounts in different ways to create a budget? I could also encapsulate this logic in this routine so I could return a planned budget amount for every contract type in the system. We allow budgeting at the lowest work breakdown level but have to provide aggregated budget amounts at every parent level so another loop through the data to perform this aggregation to the parent.

Since the CLR works as a Table Valued Function we can write queries against it for UI and reporting. Every UI or report will have the same planned budget value because the math and rounding and contract types are encapsulated in this routine. It seems that if I do division of two numbers in T-Sql , C#, Crystal Reports I get a different number of decimals and value of decimals. I am not sure I could have done this without the use of a CLR routine to perform the looping and provide consistent rounded amounts.

The other place I use it is to provide a cross tab type output but I will save that for another day as I have spouted enough.

Bradley Deem SSCrazy Points: 2565 More actions · Answer 7

Thanks for sharing David. I agree most CLR aggregates have not had much success beyond academics. I think part of this limitation is due to the lack of ability to perform an ORDER BY clause with an aggregate.

For example, lets say I create a CLR aggregate that concatenates strings together ie

SELECT dbo.List(myStringColumn,',') -- Where ',' means comma delimited

FROM myTable

Assuming I've implemented it correctly, this could produce output such as

firstRowValue,secondRowValue,thirdRowValue

or maybe parallelism kicks in (on a large set) and produces output where the order changes ie

secondRowValue,firstRowValue,thirdRowValue

Now, if we had we had the ability to perform the ORDER BY clause with the OVER clause ie

SELECT dbo.List(myStringColumn,',') OVER(ORDER BY myStringColumn)

FROM myTable

Then we could guarantee the correct (expected) order of output.

If you believe this would be a useful feature, and I do, please vote for it on connect.

https://connect.microsoft.com/SQLServer/feedback/details/254387/over-clause-enhancement-request-order-by-for-aggregates

In regards to the attribute IsInvariantToOrder, it's not supported, see the following link.

http://msdn.microsoft.com/en-us/library/microsoft.sqlserver.server.sqluserdefinedaggregateattribute.isinvarianttoorder.aspx

Apparently, it will be implemented in the "future" which after 5 years hasn't been far enough into the future.

Solomon Rutzky SSCoach Points: 16431 More actions · Answer 8

Hello David and thanks for a great article on yet another good use of SQL CLR.

The only issue I have is with your statement that "median" is not a good case for a CLR aggregate as it might "require a substantial portion of the set to be held in memory". I disagree because it is possible to reduce the size of data in memory by compressing it, as I have shown in this article:

http://www.sqlservercentral.com/articles/SS2K5+-+CLR+Integration/3208/

And that article was done before SQL Server 2008 added the ability to go beyond the 8k memory limit (with a MaxSize set to -1) so adding that option to the mix solves all problems (generally) outside of the possible efficiency of the CPU it takes to compress/decompress the dataset. At that point I would argue that if your situation requires a Median calculation (and it appears to for quite a few people) then a little bit of extra processing time is worth the ability to do this in the first place.

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

Adam Machanic SSCoach Points: 15370 More actions · Answer 9

Solomon Rutzky (12/29/2010)And that article was done before SQL Server 2008 added the ability to go beyond the 8k memory limit (with a MaxSize set to -1) so adding that option to the mix solves all problems (generally) outside of the possible efficiency of the CPU it takes to compress/decompress the dataset. At that point I would argue that if your situation requires a Median calculation (and it appears to for quite a few people) then a little bit of extra processing time is worth the ability to do this in the first place.

Have you compared the performance of doing a median calculation w/compression against performing the same calculation using Joe Celko's ROW_NUMBER technique?

http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/medians-row-numbers-and-performance.aspx

Compression is far from cheap, and I suspect that the ROW_NUMBER solution will scale much better. But that's just a suspicion and I don't have time to test myself at the moment.

--
Adam Machanic
whoisactive

Steve Jones - SSC Editor SSC Guru Points: 742706 More actions · Answer 10

The images have been corrected. My apologies for the issues.

TimothyAWiseman SSCrazy Eights Points: 8819 More actions · Answer 11

A fantastic article. I think I would rather approach this particular problem in pure T-SQL (or bringing it into something more suited for thorough statistical analysis, depending on the project at hand), but this was still a well written intro to the topic.

---
Timothy A Wiseman
SQL Blog: http://timothyawiseman.wordpress.com/

Solomon Rutzky SSCoach Points: 16431 More actions · Answer 12

Adam Machanic (12/29/2010)
Have you compared the performance of doing a median calculation w/compression against performing the same calculation using Joe Celko's ROW_NUMBER technique?
http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/medians-row-numbers-and-performance.aspx
Compression is far from cheap, and I suspect that the ROW_NUMBER solution will scale much better. But that's just a suspicion and I don't have time to test myself at the moment.

Hi Adam. Thanks for mentioning that article as I had not seen it. I did just test this against the AdventureWorks2008 DB using SQL Server 2008 SP2 using the Celko method, Ben-Gan method, and my CLR UDA. The CLR method has the same number of reads as the Celko method but has slightly higher CPU and Elapsed times as compared to both Celko and Ben-Gan methods.

So yes, the pure T-SQL methods do appear to be slightly more efficient. However, for the record I did not say that compression was cheap. In fact, I did say that it would be slightly less efficient to do that but that it might prove to be a worthwhile trade-off depending on the circumstances.

My testing is as follows (the first two queries are copied directly from your blog that you noted above):

------------------------------------------------------------------------

DBCC DROPCLEANBUFFERS

DBCC FREEPROCCACHE

SET STATISTICS IO ON

SET STATISTICS TIME ON

SELECT

CustomerId,

AVG(TotalDue)

FROM

(

SELECT

CustomerId,

TotalDue,

ROW_NUMBER() OVER (

PARTITION BY CustomerId

ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,

ROW_NUMBER() OVER (

PARTITION BY CustomerId

ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc

FROM Sales.SalesOrderHeader SOH

) x

WHERE

RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

GROUP BY CustomerId

ORDER BY CustomerId;

SET STATISTICS IO OFF

SET STATISTICS TIME OFF

-- logical reads 686

-- CPU time = 47 ms, elapsed time = 1087 ms. (when DROPCLEANBUFFERS and FREEPROCCACHE are run)

-- CPU time = 62 ms, elapsed time = 235 ms. (when cached)

------------------------------------------------------------------------

DBCC DROPCLEANBUFFERS

DBCC FREEPROCCACHE

SET STATISTICS IO ON

SET STATISTICS TIME ON

SELECT

CustomerId,

AVG(TotalDue)

FROM

(

SELECT

CustomerId,

TotalDue,

ROW_NUMBER() OVER (

PARTITION BY CustomerId

ORDER BY TotalDue) AS RowNum,

COUNT(*) OVER (

PARTITION BY CustomerId) AS RowCnt

FROM Sales.SalesOrderHeader

) x

WHERE

RowNum IN ((RowCnt + 1) / 2, (RowCnt + 2) / 2)

GROUP BY CustomerId

ORDER BY CustomerId;

SET STATISTICS IO OFF

SET STATISTICS TIME OFF

-- logical reads 686 + 139407 for Worktable = 140093 total

-- CPU time = 344 ms, elapsed time = 1085 ms. (when DROPCLEANBUFFERS and FREEPROCCACHE are run)

-- CPU time = 328 ms, elapsed time = 374 ms. (when cached)

------------------------------------------------------------------------

DBCC DROPCLEANBUFFERS

DBCC FREEPROCCACHE

SET STATISTICS IO ON

SET STATISTICS TIME ON

SELECTCustomerId,

SQL#.SQL#.Agg_Median(TotalDue)

FROM Sales.SalesOrderHeader SOH

GROUP BY CustomerId

ORDER BY CustomerId;

SET STATISTICS IO OFF

SET STATISTICS TIME OFF

-- logical reads 686

-- CPU time = 1812 ms, elapsed time = 2707 ms. (when DROPCLEANBUFFERS and FREEPROCCACHE are run)

-- CPU time = 1906 ms, elapsed time = 1948 ms. (when cached)

------------------------------------------------------------------------

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

thirst SSC Rookie Points: 28 More actions · Answer 13

I agree, i haven't been able to really leverage CLR functions successfully. With that said, I have had one victory using CLR fuctions. That was creating a CLR function to extend the functionality of the REPLACE() function within SQL. We ran into a situation where we needed to find and replace strings greater than the lengths allowed using the native REPLACE() function. Now with BIGREPLACE (creativity is not my strongsuit) we're able to pass the function 3 nvarchar(max) arguments. I guess CLR functions also give us t-sql fellas an excuse to poke around in some c# :p

David.Poole SSC Guru Points: 76419 More actions · Answer 14

Solomon Rutzky (12/29/2010)
The only issue I have is with your statement that "median" is not a good case for a CLR aggregate as it might "require a substantial portion of the set to be held in memory". I disagree because it is possible to reduce the size of data in memory by compressing it, as I have shown in this article:
http://www.sqlservercentral.com/articles/SS2K5+-+CLR+Integration/3208/

Thanks Solomon, what I was getting at with the Median is that to find the middle value you have to keep track of all the values and my worry (as yet unsubstantiated) is that you could run an aggregate on such a large recordset that you consume all the RAM on your system and destabilize your server.

It may be that the point where this happens is beyond the bounds of most users.

The other thing that jumps out at me is that most stats functions seem to rely on what I will call the primitive aggregates; the built in aggregates. Effectively the CLR aggregates provide a short-hand for data analysts so they can focus on what the data is telling them rather than the mechanics.

I should like to see Microsoft consider adding more in-built statistical functions given the way that the BI world is heading. I should also like to see some focusing on integration with external stats packages such as SPSS and SAS.

LinkedIn Profile