Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • David Data (5/2/2011)


    All these methods use CHARINDEX to efficiently see where the next delimiter is. But what about handling CSV strings with quoted text, e.g. where "John Smith","23, The High Street","Sometown" has to be split into 3 strings? Do you have a good method for this sort of data?

    (This is why I prefer tab separated to comma separated, with all its different ways to handle embedded comas, but usually you have to process the client's data as they give it to you, not as you'd like it :rolleyes: )

    That would be "True" CSV as it was meant to be instead of the this "Comedy Separated Value" stuff that comma separated is referred to as. You're correct that this splitter won't handle it especially when the quotes may be conditional. There IS a way to handle true CSV's using ACE drivers (64 bit) or Jet drivers (32 bit). I guess a "Spackle" article on those would do.

    I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore). But, again, you're correct that you usually have to take whatever some 3rd party is giving you. Of course, that's not a problem because they always do it "right", huh? 😛

    Thanks for stopping by, David. I'll see what I can do for the true CSV stuff, soon. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/2/2011)


    David Data (5/2/2011)


    All these methods use CHARINDEX to efficiently see where the next delimiter is. But what about handling CSV strings with quoted text, e.g. where "John Smith","23, The High Street","Sometown" has to be split into 3 strings? Do you have a good method for this sort of data?

    (This is why I prefer tab separated to comma separated, with all its different ways to handle embedded comas, but usually you have to process the client's data as they give it to you, not as you'd like it :rolleyes: )

    That would be "True" CSV as it was meant to be instead of the this "Comedy Separated Value" stuff that comma separated is referred to as. You're correct that this splitter won't handle it especially when the quotes may be conditional. There IS a way to handle true CSV's using ACE drivers (64 bit) or Jet drivers (32 bit). I guess a "Spackle" article on those would do.

    I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore). But, again, you're correct that you usually have to take whatever some 3rd party is giving you. Of course, that's not a problem because they always do it "right", huh? 😛

    David and Jeff: the real issue here is that ANY delimiter, whether single or multiple characters, can be embedded in the data itself. The amount of code it would take to turn off and back on again the parsing within text-qualified sections of the data would likely render the code much less efficient (and I am not sure if it would truly be set-based). However, there is a way: Regular Expressions via SQLCLR ;-). A RegEx Split function can handle text-qualified data with embedded delimiters. You can find some examples around the net on doing this yourself or you can simply download an already done free set of functions at: http://www.SQLsharp.com/.

    That said, if Jeff is able to make a set-based function to do this then I would love to see it as I know that I would learn something from it :-).

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Jeff, your exposition of how you mentally wrestle with a problem was exceptional. That alone made your article worth reading! You've got a rare combination of technical skill, lucid writing style, and humor.

    One question: where's the CLR function? I didn't see it in the attachments.

    Thanks,

    Rich

  • Congratulations, You just rediscovered a LRLA(1) parser (left to right lookahead one), invented in 20 years ago by DICK GRUNE and CERIEL JACOBS. PARSING TECHNIQUES A Practical Guide. I am impress you implemented it in TSQL. Here is another experiment you can try. I think it should come out equal in speed; but lower in memory because same parsing technique but manages memory with CLR types. Take a look at http://irony.codeplex.com/. This is a very good implementation of a LRLA(1) parser. It should be able to run in the SQLCCLR as safe code there are no external dependencies. Grab the BNF grammar for CSV from http://www.ietf.org/rfc/rfc4180.txt

    One more idea to explore.

    The LRLA(1) parser can take generic grammars. You could easily parse JSON http://www.ietf.org/rfc/rfc4627.txt into SQL table. Source data is data type richer compared to CSV because JSON values are type, not just tuples of strings for CSV.

  • Gianluca Sartori (5/2/2011)


    Jeff, your article is great!

    The final splitter function could have been 3 times slower than the original one, and the article would have been memorable anyway. It's a tale of stubborn resistance to surrender and a "thinking outside the box" lesson for all of us.

    Thank you very much.

    I'm humbled by your observation, Gianluca. :blush: I didn't think that anyone would understand that especially with the silly and fun metaphors I made about my angst on the subject. In real life, it was killing me that the original function performed so terribly on the high end of strings and I had to either fix it or apologize for it. Thank you for the very kind words.

    For the record, I really do have the artwork with the cat in the box. I commissioned a friend to make it for me several years ago to remind me that "Before you can think outside the box, you must first realize... you're in a box!"

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Lol I finally made a mention in one of Jeff's articles.

    One day I'll have to collaborate with Jeff to define just what this alliance is.

    On another note though, I notice something about the final graph of performance around the 270 mark on the X axis. The pure SQL set based method seems to continue on a straight line path, while the CLR methods seems to be starting to curve upwards similarly to what Jeff was complaining about the original Tally splitter doing, just later in the X series than the original Tally did.

    Btw, before I start experimenting with Jeff's downloads, if you were to concatenate the deliminters onto the string before running it through the splitter, wouldn't that help with the performance issue? As I see it, the CHARINDEX searches to find the delimiters is repeating the concatenation over and over within the where clause, but I could be seeing this wrong.

    SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY t.N),

    Item = SUBSTRING(@pString, t.N, CHARINDEX(@pDelimiter, @pString + @pDelimiter, t.N) - t.N)

    FROM cteTally t

    WHERE t.N <= DATALENGTH(@pString)+1 --DATATLENGTH allows for trailing space delimiters

    AND SUBSTRING(@pDelimiter + @pString, t.N, 1) = @pDelimiter

    ;

    If that were rewritten as

    SET @pString = ',' + @pString + ',';

    SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY t.N),

    Item = SUBSTRING(@pString, t.N, CHARINDEX(@pDelimiter, @pString, t.N) - t.N)

    FROM cteTally t

    WHERE t.N <= DATALENGTH(@pString)+1 --DATATLENGTH allows for trailing space delimiters

    AND SUBSTRING(@pString, t.N, 1) = @pDelimiter

    ;

    Doesn't that help the performance hit from the concatenation in the where clause?



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • rmechaber (5/2/2011)


    Jeff, your exposition of how you mentally wrestle with a problem was exceptional. That alone made your article worth reading! You've got a rare combination of technical skill, lucid writing style, and humor.

    One question: where's the CLR function? I didn't see it in the attachments.

    Thanks,

    Rich

    Very kind words. Thanks Rich.

    I kind of blew it on the CLR function. I asked someone to write it for me and then he came down with a really nasty flu and couldn't get to it. As a result, I had actually submitted the article without a chart or any info on the CLR except to say that it would probably still be faster. When I did finally get the CLR code, I was in a dead rush to update the article before the publishing date and I forgot to ask if I could include his actual code. Although I'm sure he'll say it's OK to attach his code, I still have to ask the question as a professional courtesy. If he says "yes", I'll be sure to add it to the attachments.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/2/2011)


    I kind of blew it on the CLR function. I asked someone to write it for me and then he came down with a really nasty flu and couldn't get to it. As a result, I had actually submitted the article without a chart or any info on the CLR except to say that it would probably still be faster. When I did finally get the CLR code, I was in a dead rush to update the article before the publishing date and I forgot to ask if I could include his actual code. Although I'm sure he'll say it's OK to attach his code, I still have to ask the question as a professional courtesy. If he says "yes", I'll be sure to add it to the attachments.

    Ah, great, that makes sense. I look forward to seeing it.

    And here I was thinking you'd intentionally left out that nasty-non-SQL-not-set-based-procedural-dingo-droppings code. ("If I don't post it, no one can use it?") 😉

    I am lucky (or unlucky for my goals of professional development) to work with small- and moderate-sized data sets that perform well using most of these approaches, but better code is .... better. Having just used CLR to solve a difficult problem ("which students can earn a special degree certificate?") involving a lot of forwards and backwards computations, scenario splitting, etc., I have to say I'm a convert. I don't see that having this CLR TVF is any more or less secure than a SQL-based TVF, as I still have to GRANT users rights to execute it.

    Thanks again for another great article, Jeff,

    Rich

  • mtassin (5/2/2011)


    Lol I finally made a mention in one of Jeff's articles.

    One day I'll have to collaborate with Jeff to define just what this alliance is.

    On another note though, I notice something about the final graph of performance around the 270 mark on the X axis. The pure SQL set based method seems to continue on a straight line path, while the CLR methods seems to be starting to curve upwards similarly to what Jeff was complaining about the original Tally splitter doing, just later in the X series than the original Tally did.

    Btw, before I start experimenting with Jeff's downloads, if you were to concatenate the deliminters onto the string before running it through the splitter, wouldn't that help with the performance issue? As I see it, the CHARINDEX searches to find the delimiters is repeating the concatenation over and over within the where clause, but I could be seeing this wrong.

    SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY t.N),

    Item = SUBSTRING(@pString, t.N, CHARINDEX(@pDelimiter, @pString + @pDelimiter, t.N) - t.N)

    FROM cteTally t

    WHERE t.N <= DATALENGTH(@pString)+1 --DATATLENGTH allows for trailing space delimiters

    AND SUBSTRING(@pDelimiter + @pString, t.N, 1) = @pDelimiter

    ;

    If that were rewritten as

    SET @pString = ',' + @pString + ',';

    SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY t.N),

    Item = SUBSTRING(@pString, t.N, CHARINDEX(@pDelimiter, @pString, t.N) - t.N)

    FROM cteTally t

    WHERE t.N <= DATALENGTH(@pString)+1 --DATATLENGTH allows for trailing space delimiters

    AND SUBSTRING(@pString, t.N, 1) = @pDelimiter

    ;

    Doesn't that help the performance hit from the concatenation in the where clause?

    Hi Mark,

    Heh... I'm thinking T-Shirts with the classic "No RBAR" symbol on them to start with! 😀

    Shifting gears... I actually tried the pre-concatenation but it left the original two problems. Sure, it makes the splitter function look like it's doing a better job if you only consider the splitter function but, overall, you still have the time consumed by the concatenation. The other problem that I was working on but didn't emphasize in the article is that such concatenation means that you can never split a full 8K bytes. Depending on whether you concatenate 1 or 2 delimiters (the original function used 1 on one end or the other at different times), you'll always fall 1 or 2 bytes short of a full 8K. The new splitter can handle a full 8K.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Robert Livermore (5/2/2011)


    Congratulations, You just rediscovered a LRLA(1) parser (left to right lookahead one), invented in 20 years ago by DICK GRUNE and CERIEL JACOBS. PARSING TECHNIQUES A Practical Guide. I am impress you implemented it in TSQL. Here is another experiment you can try. I think it should come out equal in speed; but lower in memory because same parsing technique but manages memory with CLR types. Take a look at http://irony.codeplex.com/. This is a very good implementation of a LRLA(1) parser. It should be able to run in the SQLCCLR as safe code there are no external dependencies. Grab the BNF grammar for CSV from http://www.ietf.org/rfc/rfc4180.txt

    One more idea to explore.

    The LRLA(1) parser can take generic grammars. You could easily parse JSON http://www.ietf.org/rfc/rfc4627.txt into SQL table. Source data is data type richer compared to CSV because JSON values are type, not just tuples of strings for CSV.

    Nah... didn't "just" discover it. The old splitter, the basis of which has been around for years before I came along, used the same approach. It's good to know the history behind it, though. And thanks for the kudo.

    I took a look at the code you pointed to. Very clever indeed. Will it parse "True" CSV with double-quoted text qualifiers?

    Also, you should send a note to Phil Factor. He's been working on splitters for JSON and he'd be very interested in the "Irony" code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/2/2011)


    mtassin (5/2/2011)


    Lol I finally made a mention in one of Jeff's articles.

    Doesn't that help the performance hit from the concatenation in the where clause?

    Hi Mark,

    Heh... I'm thinking T-Shirts with the classic "No RBAR" symbol on them to start with! 😀

    Shifting gears... I actually tried the pre-concatenation but it left the original two problems. Sure, it makes the splitter function look like it's doing a better job if you only consider the splitter function but, overall, you still have the time consumed by the concatenation. The other problem that I was working on but didn't emphasize in the article is that such concatenation means that you can never split a full 8K bytes. Depending on whether you concatenate 1 or 2 delimiters (the original function used 1 on one end or the other at different times), you'll always fall 1 or 2 bytes short of a full 8K. The new splitter can handle a full 8K.

    I suppose, though these days if I suspect I'm going to exceed or even come close to exceeding the 8k, I'd switch to varchar(max). Which I know this article doesn't cover (because you said you'd get to it).

    Today I get to learn something, I figured the issue with the concatenation was that as it was part of a WHERE clause that was re-applied as the tally table progressed through the set, that this was causing the concatentation to be reapplied for each value of N in the tally table.

    I just don't understand how concatenating a pair of commas to a string can be such a performance impact, or how it can cause the performance to change slope like that later in the graph. I can see it causing a lag as the concatenation is done, just not that lag increasing as the size of the string does.



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • Solomon Rutzky (5/2/2011)


    Jeff Moden (5/2/2011)


    David Data (5/2/2011)


    All these methods use CHARINDEX to efficiently see where the next delimiter is. But what about handling CSV strings with quoted text, e.g. where "John Smith","23, The High Street","Sometown" has to be split into 3 strings? Do you have a good method for this sort of data?

    (This is why I prefer tab separated to comma separated, with all its different ways to handle embedded comas, but usually you have to process the client's data as they give it to you, not as you'd like it :rolleyes: )

    That would be "True" CSV as it was meant to be instead of the this "Comedy Separated Value" stuff that comma separated is referred to as. You're correct that this splitter won't handle it especially when the quotes may be conditional. There IS a way to handle true CSV's using ACE drivers (64 bit) or Jet drivers (32 bit). I guess a "Spackle" article on those would do.

    I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore). But, again, you're correct that you usually have to take whatever some 3rd party is giving you. Of course, that's not a problem because they always do it "right", huh? 😛

    David and Jeff: the real issue here is that ANY delimiter, whether single or multiple characters, can be embedded in the data itself. The amount of code it would take to turn off and back on again the parsing within text-qualified sections of the data would likely render the code much less efficient (and I am not sure if it would truly be set-based). However, there is a way: Regular Expressions via SQLCLR ;-). A RegEx Split function can handle text-qualified data with embedded delimiters. You can find some examples around the net on doing this yourself or you can simply download an already done free set of functions at: http://www.SQLsharp.com/.

    That said, if Jeff is able to make a set-based function to do this then I would love to see it as I know that I would learn something from it :-).

    Take care,

    Solomon...

    Several years ago, I actually did make a pure T-SQL solution for "True" CSV which allowed not only embedded delimiters, but also allowed for "escaped" double quotes to be embedded. It worked very nicely but was a fair bit slower as you would expect it be even though a Tally Table certainly came into play.

    I set it aside in favor of the more efficient splitters available in Jet (and now, ACE) although those two methods are a bit more difficult to use. The algorithm I used would make a heck of a nice CLR, though. Since it's "dedicated code" instead of "general purpose" such as RegEx is, it may be even faster.

    I'm impressed by the wide array of functions available in SQL# but also a bit disappointed. I'm definitely NOT an anti_CLR zealot, but there are a lot of functions (random numbers, for example) that are just too simple to do correctly in T-SQL. I AM, however, seriously interested in the BULK EXPORT I see there. I wish MS had thought of such a thing and it's always perplexed by as to why they didn't.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/2/2011)


    Hi Mark,

    Heh... I'm thinking T-Shirts with the classic "No RBAR" symbol on them to start with! 😀

    For just $19.95 and if you're one of the first 20 callers we'll send you another one absolutely free (except for an additional $8.95/shirt S&H of course).

    🙂



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • mtassin (5/2/2011)


    Today I get to learn something, I figured the issue with the concatenation was that as it was part of a WHERE clause that was re-applied as the tally table progressed through the set, that this was causing the concatentation to be reapplied for each value of N in the tally table.

    Ah... let me make a correction here. I was talking about pre-concatenation within a function and you're talking about doing it outside the function. You are correct... pre-concatenation does definitely speed things up for the reasons you stated and allows things like the XML method to run fairly quickly and in a linear fashion. I see no reason why it wouldn't also do so for the code you offered.

    To be absolutely honest, I didn't try pre-concatenation outside of the function. It's likely going to be very fast... maybe even faster than this new function because the new function does an ISNULL/NULLIF on every element whereas your fine example does not. But, can you imagine the outcry if I settled for pre-concatenation outside of the function? :Whistling:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @mark-3,

    IIRC, I believe I tried doing pre-aggregation in an mTVF and it slowed things down too much (I stated something to that effect in the article). I can't find my notes on it so I must have dismissed it pretty early in the game but I'll revisit it. Still, even with using a zero based Tally Table and only concatenating a delimiter at the end, it'll still only allow 7999 bytes which may prove to be a distraction to some because it won't error on any truncation that will occur when passed a full 8K string.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 990 total)

You must be logged in to reply to this topic. Login to reply