CLR Table-Valued Function Example with Full Streaming (STVF)

  • Jeff Moden (12/24/2009)


    Matt Miller (#4) (12/23/2009)


    I had attempted to use something like this to bring in data from an external file. One of those "ugly delimited" files, kind of delimited and kind of not, that I needed to bring in. This would have offered a way to parse it and clean it up in one single shot, and I would simply need to insert the results from the TVF into my destination table, skipping the whole staging table, clean up the data, etc...

    I ended up abandoning it because it was trying to build the entire table in mem before returning it, so the CLR function would crash before getting anything out. So - I ended up going in another direction and got it done that way.

    I had also looked at building the old running totals this way, running into the same problem (order is in fact guaranteed this way so no controversy over documented vs not). I got around that using the CLR SP (since you can stream the results without having to build the entire return set first), but still use the "other way" until I am formally forbidden from doing so , since it is still faster than the CLR SP (although not by a whole lot).

    That's one of the problems that I seem to see popping up a lot... CLR's running the system out of memory. I've not asked any DBA in particular, but I'm thinking that's one of the reasons why many DBA's simply say "No! No CLR's. Not on my box."

    Shifting gears, it would be fun to see the file you were trying to split if it's not proprietary and has no private info in it.

    Heh... it's also good to hear that the "other" method for running aggregates beats the CLR method... thanks for the feedback on that. 🙂

    Sorry to say - those files are long gone for another company I no longer deal with. They were some delimited files, with varying numbers of inputs on each line (the generating system would "skip" the ending inputs with no values).

    Again nothing too horrible, and I wanted to play with SQLCLR early on with it, so I took a shot.

    I've been pleasantly surprised to see my 2008 doing better with resource mgmt (probably due to the 64-bit in part), so CLR does seems to be doing somewhat better from my small testing with it. Don't have much in the way of play time these days, so CLR doesn't get much air time. There is one reasonably interesting application I am attempting which might be useful (CLR call to an ERE or enterprise rules engine), but I haven't had a chance to stress-test it.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Solomon Rutzky (12/24/2009)


    RBarryYoung (12/24/2009)


    Actually, Jeff, the line-splitting functions and (less often)parsing functions are usually best done as TVF's, whether SQL or CLR. Unfortunately, as Matt pointed out, CLR TVF's by default do not want to do a streaming return. Solomon's "trick" here is really quite significant. Up till now, I have not even heard any of the "Big Guns" in CLR's say anything about it other than to complain about the situation.

    Ok, so English lesson's aside ;-), I really cannot lay claim to this being my technique. I was hoping to indicate in the article that this is just something that I came across in my workings with SQLCLR. I think I picked it up from my last article dealing with Streaming INTO SQL Server from an application and figured I would give the "yield return" syntax a shot in a TVF and was very pleasantly surprised when it worked. However, I have since done some searches on "yield return" and have found that a few other people have been doing this so maybe it is documented somewhere. Unfortunately the examples are all in blog posts about something not working correctly and not attempting to show a good way of doing TVFs. Hence nobody looking for "how to do a streaming TVF" (or maybe, "how to do a TVF properly" since I am not sure why anyone would choose to store the entire collection in memory, unless maybe you needed to do secondary logic over the set) will ever find those posts. This is why I felt it was important to do this article: because, as you mentioned, for some odd reason nobody else has.

    Fair enough. And I should mention here that my concern with conversion here is with "yield break", not with "yield return". There are well known (though difficult) ways to emulate "yield return" in VB.Net; the two most common are 1) Inherit someone else's IEnumerable and then force it to do your bidding (this is the canonical article on that technique: http://visualstudiomagazine.com/columns/article.aspx?editorialsid=2972), and 2) use VB's "Static" keyword (not the same as C#'s "static", which is called "Shared" in VB) to write a state-machine to emulate the Co-routine context retention that "yield return" provides.

    However, "yield break" is another story altogether. I have not heard of any way by anyone of a way to emulate a "yield break" in Vb other than the obvious: execute a hard return. So if you can rewrite your CLR so that you can replace the "yield break" with your functions exit/return, then it *should* be possible for me to convert it into VB.Net.

    Now, I don't really know much about VB.Net but I did mention in my article that there is the standard implementation of IEnumerator that does the same thing. I would have to assume that while the examples are in C# that they would work the same in VB.Net since they do not rely upon the "yield" construct. Here are two:

    1) http://www32.brinkster.com/srisamp/sqlArticles/article_46.htm (bottom half of the page)

    2) http://sqlblog.com/blogs/adam_machanic/archive/2009/04/28/sqlclr-string-splitting-part-2-even-faster-even-more-scalable.aspx

    Right. I also assume that that's the same as I mentioned above. The "yield break" is the sticking point there.

    As for the Errors that you were getting in your routine, I assume that those are actually related to T-SQL's inability to create and maintain a class-instance (object) context in any consistent way, and conversely, CLR's great difficulty telling different calling contexts & threads apart. When I wrote my own CLR "running totals" solution, that was the biggest problem that I ran into (and not the sequencing problem). I did finally come up a couple of solutions to it, but they are somewhat kludgey and not at all elegant.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Matt Miller (#4) (12/24/2009)


    Sorry to say - those files are long gone for another company I no longer deal with. They were some delimited files, with varying numbers of inputs on each line (the generating system would "skip" the ending inputs with no values).

    Again nothing too horrible, and I wanted to play with SQLCLR early on with it, so I took a shot.

    I've been pleasantly surprised to see my 2008 doing better with resource mgmt (probably due to the 64-bit in part), so CLR does seems to be doing somewhat better from my small testing with it. Don't have much in the way of play time these days, so CLR doesn't get much air time. There is one reasonably interesting application I am attempting which might be useful (CLR call to an ERE or enterprise rules engine), but I haven't had a chance to stress-test it.

    Ah... got it. I ran into a couple of similar buzz saws last year. I had delimited files like the ones you speak of where empty "fields" at the end of the row were simply dropped just as you described. I also had some where there were an unknown number of fields where there was a two row column header and if they contained certain words, either had to be handled as column pairs or column quads and THEN be routed to the correct tables. I ended up using my old friend the Tally table. The original job for the imports and splits were written in Perl and each file would take about 40 minutes to render out. Between Bulk Insert, xp_DirTree, and the Tally table, I could load, render, glean, and clean 8 files in less than 2 minutes.

    The files also had "file headers" and "file footers". Heh... while everyone else was pitching a fit, I used the information in those to confirm what was in the file, made sure it matched the file name, and used the data in the footer to confirm the entire file was sent and received in good order because it also contained a row count.

    It's funny that you'd mention the memory problems with CLRs... the Perl scripts had the same bloody problem. As business increased, the files got bigger and memory usage became a real issue. Do you think this streaming method will solve the memory issues on extremely large files? If it does, then that makes this article really worth the price of admission because, as Barry mentioned, one of the code places to use CLR's is for parsing raw data. (Actually, that question is for anyone who might like to jump in).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/24/2009)


    ...Do you think this streaming method will solve the memory issues on extremely large files? If it does, then that makes this article really worth the price of admission because, as Barry mentioned, one of the code places to use CLR's is for parsing raw data. (Actually, that question is for anyone who might like to jump in).

    Oh yeah, absolutely. Streaming is all the rage in the .Net set these days (even aside from SQL, SQL CLR, etc.), it solves the memory overflow problems by yards, because its basically pipelining the recordsets internally, it never has to have more than a fraction of them allocated at a time.

    But more than that, its just way faster, even when you don't have a max memory problem. First it saves on the overhead of memory allocation, because it can just keep reusing those buffers that it freed up. And secondly, it can heavily leverage multi-processors/cores through on-server co-processing, because the sender and receiver are both working at the same time, instead of first the sender in one long huge lump, and then the receiver, in an equally long huge lump.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • And of course, Streaming is a dream for scalability. You no longer have worry about scaling up physical memory to match the size of your Import/Export datasets, if they never have to have more than a couple of hundred rows in memory at anyone time.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Jeff Moden (12/24/2009)


    The files also had "file headers" and "file footers". Heh... while everyone else was pitching a fit, I used the information in those to confirm what was in the file, made sure it matched the file name, and used the data in the footer to confirm the entire file was sent and received in good order because it also contained a row count.

    Hey Jeff. As I mentioned in a reply to one of Matt's posts, I basically solved this problem, in a general sense, using a CLR Stored Proc in my SQL# project called File_SpllitIntoFields which takes a Regular Expression delimiter. Using a RegEx delimiter means that you can truly parse CSV files that have text-qualification on some (not all) fields an embedded double-quotes. It also returns "empty" fields and can skip any number of "header" rows. And because it releases each row as it reads it from disk, it consumes very little memory.

    It's funny that you'd mention the memory problems with CLRs... the Perl scripts had the same bloody problem. As business increased, the files got bigger and memory usage became a real issue. Do you think this streaming method will solve the memory issues on extremely large files? If it does, then that makes this article really worth the price of admission because, as Barry mentioned, one of the code places to use CLR's is for parsing raw data. (Actually, that question is for anyone who might like to jump in).

    As I have been saying (and now Barry has been helpful in explaining the technical "why") there is no reason to not always do this (or even the full IEnumerator syntax as noted via links in my previous post) since keeping one row (or a small set of data) in memory is always preferable over the entire file / set. For example, I have a GZip CLR Function in my SQL# project that will compress a file on disk. I open the read and write streams at the same time and process a small amount of data per each read/write so the file size never matters. I use this function at work and I GZip 3 gig files down to 700 megs in about 5 minutes. In fact, the 3 Gig file is generated by the DB_BulkExport CLR Proc that is also in the SQL# project that exports the result set of a passed-in Query. This will also not consume much memory as it only has one row in memory at any given time since it writes each row to disk as it is read from the datareader.

    So file size should never really matter (unless for some reason you need to make multiple passes on the full set of data). And if someone is processing large files and does not want to use (or even enable) CLR, there is a wonderful alternative if they are using SQL Server 2008 which is to stream the data in from an application, which is the basis of my previous article (which is 100% set-based and no RBAR :-)): http://www.sqlservercentral.com/articles/SQL+Server+2008/66554/

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • RBarryYoung (12/24/2009)


    However, "yield break" is another story altogether. I have not heard of any way by anyone of a way to emulate a "yield break" in Vb other than the obvious: execute a hard return. So if you can rewrite your CLR so that you can replace the "yield break" with your functions exit/return, then it *should* be possible for me to convert it into VB.Net.

    Hey there. I tried reworking my example to use a "return" in a few different manners but always got this error:

    Cannot return a value from an iterator. Use the yield return statement to return a value, or yield break to end the iteration.

    Sorry I could not help more here. 🙁

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • RBarryYoung (12/25/2009)


    And of course, Streaming is a dream for scalability. You no longer have worry about scaling up physical memory to match the size of your Import/Export datasets, if they never have to have more than a couple of hundred rows in memory at anyone time.

    Agreed, and if you can get the return stream out of context while keeping the incoming stream "in context", you tend to get very good perf (although this seems to not work using a loopback OPENQUERY for that). using the context connection for both isn't bad, but it's not quite as fast as the other method.

    I'm just surprised they decided not to build this into functions when they made it so very easy to do in Stored procs.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • RBarryYoung (12/25/2009)


    Jeff Moden (12/24/2009)


    ...Do you think this streaming method will solve the memory issues on extremely large files? If it does, then that makes this article really worth the price of admission because, as Barry mentioned, one of the code places to use CLR's is for parsing raw data. (Actually, that question is for anyone who might like to jump in).

    Oh yeah, absolutely. Streaming is all the rage in the .Net set these days (even aside from SQL, SQL CLR, etc.), it solves the memory overflow problems by yards, because its basically pipelining the recordsets internally, it never has to have more than a fraction of them allocated at a time.

    But more than that, its just way faster, even when you don't have a max memory problem. First it saves on the overhead of memory allocation, because it can just keep reusing those buffers that it freed up. And secondly, it can heavily leverage multi-processors/cores through on-server co-processing, because the sender and receiver are both working at the same time, instead of first the sender in one long huge lump, and then the receiver, in an equally long huge lump.

    Ah... now I get it. I've not worked with C and I haven't worked with VB in about 7 years. I couldn't tell you an IENumerator from a Terminator. Thanks for the simple answer.

    It's ironic, though... seems we've come full circle back to the "old way" of doing things. We've come back to the natural cycle of read a line, process a line, write a line, loop until done. No memory worries, no scalability problems, no speed problems. Open two "file" connections and go like hell one line or one controlable batch of lines at a time. It's about time.

    Thanks, Barry.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Solomon Rutzky (12/25/2009)


    Hey Jeff. As I mentioned in a reply to one of Matt's posts, I basically solved this problem, in a general sense, using a CLR Stored Proc in my SQL# project called File_SpllitIntoFields which takes a Regular Expression delimiter. Using a RegEx delimiter means that you can truly parse CSV files that have text-qualification on some (not all) fields an embedded double-quotes. It also returns "empty" fields and can skip any number of "header" rows. And because it releases each row as it reads it from disk, it consumes very little memory.

    Thanks, Solomon. Well done. You've absolutely nailed several of my main complaints with the likes of BCP and Bulk Insert. And Yep... I get that part and have previously said that a well written CLR is the way to go for splitters. It's easy to do well and can be done with some excellent performance especially when it comes to the splitting and text-qualifier recognition. Splitters are one of those places where a CLR will just about always beat T-SQL. Not sure you need the minor overhead of RegEx for such a thing either but, being pretty much a hard core data troll, I've not worked outside SQL Server very much in the last 7 years and wouldn't know for sure. Heh... as a side bar, I gave all that other stuff up when my manager (at the time) insisted that I make a "warning" field on the GUI with White lettering and a pale Yellow background and a "danger" field with a particular color Pink lettering on a Fusia background.

    Anyway, thank you (and the others) very much for answering my questions and entertaining my strong bit of skepticism on the subject of CLRs. Just like the Friday poll I put out a couple of weeks ago, it seems that people (on this thread, anyway), are using CLRs in a highly appropriate fashion. Thanks to Barry for taking the time to dumb down what you're doing in the article for me, all that's left to say is "Well done, Solomon" and "It's about time folks started thinking this way". 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Heh, glad I could help,Jeff. 🙂

    And as far as things coming full circle, yeah. "The more things change the more they stay the same". Or as I like to say sometimes "those who forget the lessons of history are doomed to repeat them. But those who remember them, will eventually be able to exploit them." 😀

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Solomon Rutzky (12/25/2009)


    RBarryYoung (12/24/2009)


    However, "yield break" is another story altogether. I have not heard of any way by anyone of a way to emulate a "yield break" in Vb other than the obvious: execute a hard return. So if you can rewrite your CLR so that you can replace the "yield break" with your functions exit/return, then it *should* be possible for me to convert it into VB.Net.

    Hey there. I tried reworking my example to use a "return" in a few different manners but always got this error:

    Cannot return a value from an iterator. Use the yield return statement to return a value, or yield break to end the iteration.

    Sorry I could not help more here. 🙁

    I don't think I explained myself very well (I have a hard time remember the right terminology for the C world sometimes). Anyway what I meant by a "hard return", was not an explicit [font="Courier New"]return[/font] statement, but rather the implicit return marked by the physical end of the method. (looking at it now, my attempted description seems awful to me). In other words, you have to arrange the logic so that the "yield break" is replaced by the physical end of the routine.

    (EDIT: Hmmm, or maybe that message means that you must use "return" without a value? *sigh* I'm just not good enough at C# to figure these subtle differences out by myself ... :().

    That's my understanding anyway, which is probably dicey at best. I'll try and dig up some reference that might make this clearer...

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • OK, Solomon, here's a quote from Jon Skeet at his C# site (http://www.yoda.arachsys.com/csharp/csharp2/iterators.html), that I think explains what I am getting at:

    As mentioned before, yield break is used to stop iterating. Usually this is not needed, as you naturally reach the end of the iterator block. As well as stopping iterating, yield break can also be used to create a simple "empty" iterator which doesn't yield anything. If you had a completely empty method body, the compiler wouldn't know whether you wanted to write an iterator block or a "normal" block (with normal return statements etc). A single yield break; statement as the whole method body is enough to satisfy the compiler.

    yield break can be useful if you want to stop iterating due to some external signal - the user clicking on a "cancel" button for instance. Sometimes it is easier to stop the code which is providing the data than the code which is requesting that data. In simple cases, of course, you can just use while loops to only keep going while more data is really wanted. In more complicated scenarios, however, that can make the code messy - yield break ends the method abruptly in the same way that a normal return statement does, with no need to make sure that every level of iteration checks whether or not to continue. Here's an example:...

    In other words, if "yield break" is used to replace the end-of-routine return when you don't actually want to end your routine yet, then (hopefully) you should be able to do the converse as well: use the end of your C# routine instead of "yield break". Except of course for some of those edge cases that he mentions, and the whole question of whether the code logic can be gerrymandered around like that in each specific case (which is why there's no deterministic solution that works all the time, just usually).

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Jeff Moden (12/25/2009)


    Not sure you need the minor overhead of RegEx for such a thing either but, being pretty much a hard core data troll, I've not worked outside SQL Server very much in the last 7 years and wouldn't know for sure.

    Hey Jeff, actually, the RegEx delimiter is needed (as far as I can tell) in order to do true CSV / TSV parsing. The problem with splitting on a regular comma comes in when you have commas embedded in data fields. At this point you do text-qualification but need to be able to determine if it is being used (might not be on all fields) and even if it is all fields you still need to know that a comma inside of a text-qualified region is not a field-separator. And at that point it gets a little trickier since now you need to also rule out instances of the text-qualifier character that are likewise embedded in the text-qualified data field.

    Anyway, thank you (and the others) very much for answering my questions and entertaining my strong bit of skepticism on the subject of CLRs. Just like the Friday poll I put out a couple of weeks ago, it seems that people (on this thread, anyway), are using CLRs in a highly appropriate fashion. Thanks to Barry for taking the time to dumb down what you're doing in the article for me, all that's left to say is "Well done, Solomon" and "It's about time folks started thinking this way". 🙂

    Thank you for that and you are welcome. I think this has been a good discussion on the topic of CLR. And yes, thanks also to Barry for explaining that in more technical detail than I could have :-).

    RBarryYoung (12/25/2009)


    In other words, if "yield break" is used to replace the end-of-routine return when you don't actually want to end your routine yet, then (hopefully) you should be able to do the converse as well: use the end of your C# routine instead of "yield break". Except of course for some of those edge cases that he mentions, and the whole question of whether the code logic can be gerrymandered around like that in each specific case (which is why there's no deterministic solution that works all the time, just usually).

    Ok, I understand what you are asking for now but am still not sure that I can help. The only two things I can think of I am sure you have already thought about:

    1) Structure the code differently (I think you did mention this earlier). In my example, I could have reversed the initial IF statement to be "if parameter is not null proceed" and then left the "else" condition to simply dump out at the end of the function.

    2) This one is not really recommended, but technically possible: use the evil "goto" statement. If C# even allows for this I could have done that inside of that initial IF statement and gone directly to the end of the function. Sloppy but workable if the goto statement is available.

    3) Ok so maybe 3 things, but this one is really more tongue-in-cheeck than serious: you could always work in C# instead of VB.Net ;-). I know, I know, not very helpful.

    Other than that I really don't know. I don't even really know what it is that I stumbled upon other than it works great! ;-). I really wish I could be of more help.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky (12/25/2009)


    Hey Jeff, actually, the RegEx delimiter is needed (as far as I can tell) in order to do true CSV / TSV parsing.

    That's what I'm talking about... You don't need RegEx to parse True CSV although using RegEx is one of the easiest ways to pull that task off.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 31 through 45 (of 60 total)

You must be logged in to reply to this topic. Login to reply