Adam Machanic (2/5/2010)
This is certainly an interesting and promising idea and I'll have to play with it a bit. I'm not sure, however, whether it helps to solve windowing problems, which is what I had in mind for ordered aggregates. If you need a 1:1 "aggregation," like a running sum, you have to group by something distinct like the row number. But if you do that the QP will spin up a new instance of the aggregate per row, which means you won't have access to the previous row's value. And if you group by anything else you'll have fewer rows in the output than you will in the input. Have you gotten around that somehow? Or do you have some other class of problem where you're able to leverage this?
This is what my code did (in very broad outline only for space reasons):
I had a UDA called dbo.Pack which took three parameters - the Primary Key of the source record, the Sequence number generated by the Row_Number, and the data value to be running-summed.
This UDA was called once per group. So, N records of transactions went in to producing one aggregated output value.
Taking advantage of 2008's ability to return VARBINARY(MAX) from a UDA, I stored the PK and data value in an highly compacted form inside that SqlBytes output. (I got it down to under three bytes per row for some data sets.) The sequence number is just there to check that Accumulate receives rows in the right order, so it doesn't need to be stored.
This SqlBytes object was passed to a CLR TVF (dbo.Unpack) as input. The byte-stuffed format was expanded as needed inside the MoveNext method, and a running sum computed at that stage.
The output streams via the FillRow method as normal. The output contains the Primary Key, in case a join back to the original data is required for any reason.
I am missing a lot of detail out, so please try to read between the lines a little there. I actually wrote several different versions with different approaches, but the description above represents the core of it.
The obvious concern is with memory usage for very large input data sets. Hence my interest in spilling to disk or whatever. Again, the details don't really matter.
The things that stopped me publishing the article are:
1. The wrong thread exception when passing my SqlBytes from the UDA's Terminate method to the TVF's input. I could get around this by reading the stream's buffer into a byte array, but that's not at all nice from a memory allocation point of view.
2. The perf spool that you get when APPLYing the UDA output to the TVF input!!! Ok so I could try the trace flag to disable it, but that's not fun either. You only get a perf spool if the UDA produces more than one group, so for single group runs it flies! The perf spool hurts a lot for larger numbers of groups though.
I should mention though that the whole thing was still very fast. It is very nearly as fast as Jeff and Hugo's methods, which was quite something considering all that work. (There was actually even more work than I mentioned since I used streaming compression in the UDA and decompression in the TVF for large sets).
So I will wait for the wrong thread thing to be fixed, maybe enable the trace flag, and go with that for the time being. I hope you can see why I am so keen on a streaming-in-and-out CLR combination UDA and TVF object?!