Custom Aggregate for statistical use

Question

Custom Aggregate for statistical use

john.melling

Grasshopper

Points: 15
More actions
April 27, 2011 at 10:23 am

#389393

I've successfully built and run a custom clr aggregate to return the sum of the (abs) difference between consecutive rows i.e. for the following data:
0
1
2
the difference between row 2 and 1 = 1 and between row 3 and 2 = 1, so the sum is 2.
On initial testing, this seemed to return the correct result, however on larger data sets, the result seemed to vary. I believe this is due to the order in which sql server presents the rows to the aggregate function (irrespective of any order by statements), so in the above example, if the order were to change to say:
0
2
1
The result would now be 3.
Does anyone know of a way to force/control the order of rows presented to an aggregate function?
Thanks.

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

sturner SSC-Insane Points: 22471 More actions · Answer 1

Ho many rows are we talking about? You could try inserting these rows into a table variable using an order-by and run the aggregate against the table variable.That should run on one thread.

Another thing to try is the MAXDOP Query hint.

The probability of survival is inversely proportional to the angle of arrival.

LutzM SSC Guru Points: 107049 More actions · Answer 2

Why using a CLR in the first place?

Wouldn't something along those lines do the job, too?

WITH sample_data AS

(

SELECT 0 AS n UNION ALL

SELECT 2 UNION ALL

SELECT 1

), cte AS

(

SELECT n,ROW_NUMBER() OVER(ORDER BY n ) ROW

FROM sample_data

)

SELECT SUM(a.n-b.n)

FROM cte a

INNER JOIN cte b ON a.row=b.row+1

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

sturner SSC-Insane Points: 22471 More actions · Answer 3

LutzM (4/27/2011)
Why using a CLR in the first place?
Wouldn't something along those lines do the job, too?

Excellent point. It appears that this custom "aggregation" depends upon a certain ordering of the rows. Other than trying to defeat parallelism to make it work out, it is more logical to leverage set based processing and order by to make it work.

Good observation Lutz!

The probability of survival is inversely proportional to the angle of arrival.

john.melling Grasshopper Points: 15 More actions · Answer 4

Hi

Thanks for the replies.

The number of rows is currently around 65000 and about to double. The "group by" will split that into sets of about 30 rows for each group. The CLR seemed like a neat solution to avoid potentially huge joins. I'm also using a CLR to do other statistics like Anderson-Darling (Normality check) etc. on the same set of data (which works fine as it isn't order dependent).

I take the point about being able to do it via usual SQL and I'll have to go back to that method if I can't solve this particular problem. The current run time to produce all the stats for 65000 rows is < 3 secs, but if it doesn't return the correct result due to the aggregate order issue then it's worthless!

Thanks for your help.

john.melling Grasshopper Points: 15 More actions · Answer 5

Just for info:

I tried the MAXDOP but this had no effect.

I don't think its a thread issue. My understanding is that if a multi-thread is used for the aggregate, then the Merge within the CLR is called. I changed my code to flag if the Merge was used and it never got called. I believe this issue is down to the fact that SQL server assumes that the order sent to the aggregate doesn't matter (even if IsInvariantToOrder is set to false).

Note: just read this on a microsoft link to IsInvariantToOrder:

Reserved for future use. This property is not currently used by the query processor: order is currently not guaranteed.

Looks like I'm going to have to use standard sql!

LutzM SSC Guru Points: 107049 More actions · Answer 6

If your original data include a column with a sequential number ,you could avoid the sort caused by ROW_NUMBER() and just use the very last part (the self-join). This should return the data for just 65k rows in much less than three sec.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]