Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««1234»»»

A genuine use for a SQL CLR Aggregate Expand / Collapse
Author
Message
Posted Wednesday, December 29, 2010 9:36 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: Administrators
Last Login: Today @ 4:16 PM
Points: 31,284, Visits: 15,748
The images have been corrected. My apologies for the issues.






Follow me on Twitter: @way0utwest

Forum Etiquette: How to post data/code on a forum to get the best help
Post #1040413
Posted Wednesday, December 29, 2010 10:46 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Today @ 12:52 PM
Points: 1,945, Visits: 3,122
rdw-769613 (12/29/2010)
Another good example to use CLR is GMT/Local/Unix time handling. Trying this in T-SQL is near imposible and within CLR it's simple and faster.


I always did this by mimicking the ANSI/ISO Standards system tables for the WITH TIMEZONE option in the DDL. Then everything is done with VIEWs. No need for a CLR.


Books in Celko Series for Morgan-Kaufmann Publishing
Analytics and OLAP in SQL
Data and Databases: Concepts in Practice
Data, Measurements and Standards in SQL
SQL for Smarties
SQL Programming Style
SQL Puzzles and Answers
Thinking in Sets
Trees and Hierarchies in SQL
Post #1040454
Posted Wednesday, December 29, 2010 11:57 AM


Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: 2 days ago @ 4:29 PM
Points: 752, Visits: 920
A fantastic article. I think I would rather approach this particular problem in pure T-SQL (or bringing it into something more suited for thorough statistical analysis, depending on the project at hand), but this was still a well written intro to the topic.

---
Timothy A Wiseman
SQL Blog: http://timothyawiseman.wordpress.com/
Post #1040519
Posted Wednesday, December 29, 2010 12:03 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Yesterday @ 6:36 PM
Points: 370, Visits: 1,970
Adam Machanic (12/29/2010)
Have you compared the performance of doing a median calculation w/compression against performing the same calculation using Joe Celko's ROW_NUMBER technique?

http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/medians-row-numbers-and-performance.aspx

Compression is far from cheap, and I suspect that the ROW_NUMBER solution will scale much better. But that's just a suspicion and I don't have time to test myself at the moment.


Hi Adam. Thanks for mentioning that article as I had not seen it. I did just test this against the AdventureWorks2008 DB using SQL Server 2008 SP2 using the Celko method, Ben-Gan method, and my CLR UDA. The CLR method has the same number of reads as the Celko method but has slightly higher CPU and Elapsed times as compared to both Celko and Ben-Gan methods.

So yes, the pure T-SQL methods do appear to be slightly more efficient. However, for the record I did not say that compression was cheap. In fact, I did say that it would be slightly less efficient to do that but that it might prove to be a worthwhile trade-off depending on the circumstances.

My testing is as follows (the first two queries are copied directly from your blog that you noted above):

------------------------------------------------------------------------
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
SET STATISTICS IO ON
SET STATISTICS TIME ON

SELECT
CustomerId,
AVG(TotalDue)
FROM
(
SELECT
CustomerId,
TotalDue,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
FROM Sales.SalesOrderHeader SOH
) x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;

SET STATISTICS IO OFF
SET STATISTICS TIME OFF

-- logical reads 686
-- CPU time = 47 ms, elapsed time = 1087 ms. (when DROPCLEANBUFFERS and FREEPROCCACHE are run)
-- CPU time = 62 ms, elapsed time = 235 ms. (when cached)
------------------------------------------------------------------------
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
SET STATISTICS IO ON
SET STATISTICS TIME ON

SELECT
CustomerId,
AVG(TotalDue)
FROM
(
SELECT
CustomerId,
TotalDue,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue) AS RowNum,
COUNT(*) OVER (
PARTITION BY CustomerId) AS RowCnt
FROM Sales.SalesOrderHeader
) x
WHERE
RowNum IN ((RowCnt + 1) / 2, (RowCnt + 2) / 2)
GROUP BY CustomerId
ORDER BY CustomerId;
SET STATISTICS IO OFF
SET STATISTICS TIME OFF

-- logical reads 686 + 139407 for Worktable = 140093 total
-- CPU time = 344 ms, elapsed time = 1085 ms. (when DROPCLEANBUFFERS and FREEPROCCACHE are run)
-- CPU time = 328 ms, elapsed time = 374 ms. (when cached)
------------------------------------------------------------------------
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
SET STATISTICS IO ON
SET STATISTICS TIME ON

SELECT CustomerId,
SQL#.SQL#.Agg_Median(TotalDue)
FROM Sales.SalesOrderHeader SOH
GROUP BY CustomerId
ORDER BY CustomerId;

SET STATISTICS IO OFF
SET STATISTICS TIME OFF

-- logical reads 686
-- CPU time = 1812 ms, elapsed time = 2707 ms. (when DROPCLEANBUFFERS and FREEPROCCACHE are run)
-- CPU time = 1906 ms, elapsed time = 1948 ms. (when cached)
------------------------------------------------------------------------






SQL# - http://www.SQLsharp.com/
Post #1040526
Posted Wednesday, December 29, 2010 12:30 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, August 27, 2013 4:36 PM
Points: 2, Visits: 60
I agree, i haven't been able to really leverage CLR functions successfully. With that said, I have had one victory using CLR fuctions. That was creating a CLR function to extend the functionality of the REPLACE() function within SQL. We ran into a situation where we needed to find and replace strings greater than the lengths allowed using the native REPLACE() function. Now with BIGREPLACE (creativity is not my strongsuit) we're able to pass the function 3 nvarchar(max) arguments. I guess CLR functions also give us t-sql fellas an excuse to poke around in some c# :p
Post #1040549
Posted Wednesday, December 29, 2010 3:22 PM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Today @ 9:02 AM
Points: 2,915, Visits: 1,849
Solomon Rutzky (12/29/2010)

The only issue I have is with your statement that "median" is not a good case for a CLR aggregate as it might "require a substantial portion of the set to be held in memory". I disagree because it is possible to reduce the size of data in memory by compressing it, as I have shown in this article:

http://www.sqlservercentral.com/articles/SS2K5+-+CLR+Integration/3208/

Thanks Solomon, what I was getting at with the Median is that to find the middle value you have to keep track of all the values and my worry (as yet unsubstantiated) is that you could run an aggregate on such a large recordset that you consume all the RAM on your system and destabilize your server.

It may be that the point where this happens is beyond the bounds of most users.

The other thing that jumps out at me is that most stats functions seem to rely on what I will call the primitive aggregates; the built in aggregates. Effectively the CLR aggregates provide a short-hand for data analysts so they can focus on what the data is telling them rather than the mechanics.

I should like to see Microsoft consider adding more in-built statistical functions given the way that the BI world is heading. I should also like to see some focusing on integration with external stats packages such as SPSS and SAS.


LinkedIn Profile
Newbie on www.simple-talk.com
Post #1040614
Posted Wednesday, December 29, 2010 3:23 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Friday, November 14, 2014 5:25 PM
Points: 1, Visits: 53
Have you tried to use sqlSingle instead of sqlDouble in the CLR?
Post #1040615
Posted Wednesday, December 29, 2010 3:38 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Yesterday @ 6:36 PM
Points: 370, Visits: 1,970
David.Poole (12/29/2010)
Thanks Solomon, what I was getting at with the Median is that to find the middle value you have to keep track of all the values and my worry (as yet unsubstantiated) is that you could run an aggregate on such a large recordset that you consume all the RAM on your system and destabilize your server.


Hello David. I am not sure this is a real issue since in SQL Server 2005 the max memory that the UDA can take up is 8k and if using the -1 MaxSize starting in SQL Server 2008 you still only get up to 2 GB. And yes, 2 GB is a lot but it is a constraint that is presumably less than the total amount of memory available to SQL Server. Also, with using the compression the memory usage will be less. Of course, as Adam pointed out it is possible to do this (Median) in straight T-SQL but I have also included "Geometric Average" and "Root Mean Square" Aggregates in my SQL# library and neither one needs to keep the entire set in memory so it doesn't matter how large the dataset is. And to your final point, I also think that Microsoft should include at least those two ("Geometric Average" and "Root Mean Square") if not all three as built-in Aggregates.





SQL# - http://www.SQLsharp.com/
Post #1040618
Posted Wednesday, December 29, 2010 3:59 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Yesterday @ 6:36 PM
Points: 370, Visits: 1,970
Bradley Deem (12/29/2010)
I think part of this limitation is due to the lack of ability to perform an ORDER BY clause with an aggregate.

Assuming I've implemented it correctly, this could produce output such as
firstRowValue,secondRowValue,thirdRowValue

or maybe parallelism kicks in (on a large set) and produces output where the order changes ie
secondRowValue,firstRowValue,thirdRowValue

Now, if we had we had the ability to perform the ORDER BY clause with the OVER clause ie
SELECT dbo.List(myStringColumn,',') OVER(ORDER BY myStringColumn)
FROM myTable

Then we could guarantee the correct (expected) order of output.


Hello Bradley. I am not sure that I agree that this cannot be done in the Aggregate (hence requiring the ORDER BY). Maybe I am thinking of something that is not accurate, but I think this can be handled in the "Terminate()" method. The parallelism issue that you mention is handled via the "Merge()" method but the output still needs to go through the "Terminate()" method so it is there that you could properly order the string. Would that work?





SQL# - http://www.SQLsharp.com/
Post #1040626
Posted Wednesday, December 29, 2010 4:25 PM
Mr or Mrs. 500

Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500

Group: General Forum Members
Last Login: Today @ 4:10 PM
Points: 554, Visits: 1,208
Solomon Rutzky (12/29/2010)
Bradley Deem (12/29/2010)
I think part of this limitation is due to the lack of ability to perform an ORDER BY clause with an aggregate.

Assuming I've implemented it correctly, this could produce output such as
firstRowValue,secondRowValue,thirdRowValue

or maybe parallelism kicks in (on a large set) and produces output where the order changes ie
secondRowValue,firstRowValue,thirdRowValue

Now, if we had we had the ability to perform the ORDER BY clause with the OVER clause ie
SELECT dbo.List(myStringColumn,',') OVER(ORDER BY myStringColumn)
FROM myTable

Then we could guarantee the correct (expected) order of output.


Hello Bradley. I am not sure that I agree that this cannot be done in the Aggregate (hence requiring the ORDER BY). Maybe I am thinking of something that is not accurate, but I think this can be handled in the "Terminate()" method. The parallelism issue that you mention is handled via the "Merge()" method but the output still needs to go through the "Terminate()" method so it is there that you could properly order the string. Would that work?


You are correct, you could handle this in the terminate method, but who wants to determine the fastest CLR method to sort strings (or some other input) when we have SQL at our finger tips?
Post #1040632
« Prev Topic | Next Topic »

Add to briefcase ««1234»»»

Permissions Expand / Collapse