Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • see this post from Paul White :

    http://www.sqlservercentral.com/Forums/FindPost1570980.aspx

    Paul White (5/14/2014)


    Is everyone running Jeff's original/updated test rig from the article to get these results?

    If so, you might not be testing what you think you are. There is a weakness in the test script that can result in the Csv8K table containing duplicates (i.e. every row is the same). This is related to hiding the NEWID call in the view. In (at least) SQL Server 2012 and 2014, this NEWID value may be cached, resulting in every row being the same.

    It is obviously not a very good test to simply split the string once, cache the results in a table spool, and replay that for each duplicate row from the Csv8K table. This particularly (and unfairly) benefits the non-CLR splitters by the way.

    As a quick hack fix, you can use undocumented trace flag 8690 to prevent caching the NEWID value in this part of the script:

    While the plan is identical, you'll discover that disabling the spool should yield a different performance characteristic for delimitedsplit8k.

  • Also sqlperformance article seems to suggest the thing performs fairly well...

    http://sqlperformance.com/2016/03/t-sql-queries/string-split

    Obviously, his t-sql and xml based splitters are poor choices.

  • Goldie Lesser (3/18/2016)


    Ed Wagner (3/17/2016)


    If it doesn't have a sequence number, I wont be using it. 'Nuff said.

    I use Jeff's function all the time, and I can't even think of a single case where I've used the sequence number.

    Can someone give a case of where the ordinal number is important?

    Thanks, had forgotten about that one (see the next post after Paul's)

    😎

  • mburbea (3/18/2016)


    Also sqlperformance article seems to suggest the thing performs fairly well...

    http://sqlperformance.com/2016/03/t-sql-queries/string-split

    Obviously, his t-sql and xml based splitters are poor choices.

    Did read Aaron's post, thanks again

    😎

  • mburbea (3/18/2016)


    see this post from Paul White :

    http://www.sqlservercentral.com/Forums/FindPost1570980.aspx

    Paul White (5/14/2014)


    Is everyone running Jeff's original/updated test rig from the article to get these results?

    If so, you might not be testing what you think you are. There is a weakness in the test script that can result in the Csv8K table containing duplicates (i.e. every row is the same). This is related to hiding the NEWID call in the view. In (at least) SQL Server 2012 and 2014, this NEWID value may be cached, resulting in every row being the same.

    It is obviously not a very good test to simply split the string once, cache the results in a table spool, and replay that for each duplicate row from the Csv8K table. This particularly (and unfairly) benefits the non-CLR splitters by the way.

    As a quick hack fix, you can use undocumented trace flag 8690 to prevent caching the NEWID value in this part of the script:

    While the plan is identical, you'll discover that disabling the spool should yield a different performance characteristic for delimitedsplit8k.

    Slightly further on this, the randomness of the test set generation is one thing, the execution and the performance of the string_split function is another, lets not mix the two. I've tested the functions on different harnesses and found the results being similar, do you have any evidence of otherwise?

    😎

  • mburbea (3/18/2016)


    see this post from Paul White :

    http://www.sqlservercentral.com/Forums/FindPost1570980.aspx

    Paul White (5/14/2014)


    Is everyone running Jeff's original/updated test rig from the article to get these results?

    If so, you might not be testing what you think you are. There is a weakness in the test script that can result in the Csv8K table containing duplicates (i.e. every row is the same). This is related to hiding the NEWID call in the view. In (at least) SQL Server 2012 and 2014, this NEWID value may be cached, resulting in every row being the same.

    It is obviously not a very good test to simply split the string once, cache the results in a table spool, and replay that for each duplicate row from the Csv8K table. This particularly (and unfairly) benefits the non-CLR splitters by the way.

    As a quick hack fix, you can use undocumented trace flag 8690 to prevent caching the NEWID value in this part of the script:

    While the plan is identical, you'll discover that disabling the spool should yield a different performance characteristic for delimitedsplit8k.

    I'm not sure what else happened when Paul ran the test data generation script because I've personally executed it on all versions from 2005 through 2012 and it has never generated the symptom he cited. Wayne Sheffield also ran it on all versions from 2005 through 2016 and the only version where the test data generator generated identical rows was in 2016. That's also something to watch out for if you generate regular ol' XML using FOR XML path.

    As a bit of a sidebar, I certainly wouldn't call that an "improvement".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Goldie Lesser (3/18/2016)


    Ed Wagner (3/17/2016)


    If it doesn't have a sequence number, I wont be using it. 'Nuff said.

    I use Jeff's function all the time, and I can't even think of a single case where I've used the sequence number.

    Can someone give a case of where the ordinal number is important?

    I want the third item in the list. Using any other splitter you can't do this reliably. Many of them will order the results in the same order as the delimited list but to do something like get the third one you would have to add the splitter in a cte and add ROW_NUMBER so you can get to it.

    I have to agree that 99.99% of the time I don't really care about the position but in those times I do I am incredibly glad it is there.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Eirikur Eiriksson (3/18/2016)


    mburbea (3/18/2016)


    see this post from Paul White :

    http://www.sqlservercentral.com/Forums/FindPost1570980.aspx

    Paul White (5/14/2014)


    Is everyone running Jeff's original/updated test rig from the article to get these results?

    If so, you might not be testing what you think you are. There is a weakness in the test script that can result in the Csv8K table containing duplicates (i.e. every row is the same). This is related to hiding the NEWID call in the view. In (at least) SQL Server 2012 and 2014, this NEWID value may be cached, resulting in every row being the same.

    It is obviously not a very good test to simply split the string once, cache the results in a table spool, and replay that for each duplicate row from the Csv8K table. This particularly (and unfairly) benefits the non-CLR splitters by the way.

    As a quick hack fix, you can use undocumented trace flag 8690 to prevent caching the NEWID value in this part of the script:

    While the plan is identical, you'll discover that disabling the spool should yield a different performance characteristic for delimitedsplit8k.

    Slightly further on this, the randomness of the test set generation is one thing, the execution and the performance of the string_split function is another, lets not mix the two. I've tested the functions on different harnesses and found the results being similar, do you have any evidence of otherwise?

    😎

    +1000. You beat me too it by a couple of minutes. I'm not sure why anyone is concentrating on the performance of the test data generator. That's not the point.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The traceflag 8690 in the test set generation even makes the case worse for the new string_split function, by the looks of it it was benefiting more from the consistent input than the inline table valued functions, wonder if it's the coding or the implementation of the stream clr udf table valued function (PhyOp_StreamingTabUDF)

    😎

  • Sadly, I don't have access to a box with 2016 at the moment. But Here is a quick summary of the effect of the traceflag on my test script.

    without the flag on line 451 off:

    SplitterNametotal_davg_d

    delimitedsplit8kb3.4193440.0569890666666667

    fn_split3.0653030.0510883833333333

    SqlBaseline3.3733350.05622225

    xmlsplit3.0152980.0502549666666666

    With the flag off on a Sql server 2012 box, it appears that there is little incentive to use CLR as the performance of delimited split or xmlsplit are pretty good. Unfortunately, this is not true at all. The performance is too good, because the CSV column in each of the rows are all exactly the same.

    with trace flagon:

    SplitterNametotal_davg_d

    delimitedsplit8kb22.9642980.3827383

    fn_split3.8623840.0643730666666666

    SqlBaseline5.2135180.0868919666666666

    xmlsplit23.7863760.3964396

    This is a pretty drastic difference and it becomes clear that using the CLR is very advantageous, and that delimitedsplit8k is a better approach that using an xml splitter. While not displayed here, the xmlsplit in my gist degrades rapidly as the size of the input table grows (the tipping point is 10k rows if I remember correctly). I suspect that string_split will dominate in this benchmark and should hopefully beat my clr function.

  • mburbea (3/18/2016)


    Sadly, I don't have access to a box with 2016 at the moment. But Here is a quick summary of the effect of the traceflag on my test script.

    without the flag on line 451 off:

    SplitterNametotal_davg_d

    delimitedsplit8kb3.4193440.0569890666666667

    fn_split3.0653030.0510883833333333

    SqlBaseline3.3733350.05622225

    xmlsplit3.0152980.0502549666666666

    With the flag off on a Sql server 2012 box, it appears that there is little incentive to use CLR as the performance of delimited split or xmlsplit are pretty good. Unfortunately, this is not true at all. The performance is too good, because the CSV column in each of the rows are all exactly the same.

    with trace flagon:

    SplitterNametotal_davg_d

    delimitedsplit8kb22.9642980.3827383

    fn_split3.8623840.0643730666666666

    SqlBaseline5.2135180.0868919666666666

    xmlsplit23.7863760.3964396

    This is a pretty drastic difference and it becomes clear that using the CLR is very advantageous, and that delimitedsplit8k is a better approach that using an xml splitter. While not displayed here, the xmlsplit in my gist degrades rapidly as the size of the input table grows (the tipping point is 10k rows if I remember correctly). I suspect that string_split will dominate in this benchmark and should hopefully beat my clr function.

    The 2012 and 2014 versions do display this kind of behaviour although I've not seen it this drastic, could you please share the specs/configs of your test rig, really curious here.

    😎

  • Goldie Lesser (3/18/2016)


    Ed Wagner (3/17/2016)


    If it doesn't have a sequence number, I wont be using it. 'Nuff said.

    I use Jeff's function all the time, and I can't even think of a single case where I've used the sequence number.

    Can someone give a case of where the ordinal number is important?

    Sure.

    One classic (simple) that I've had to use on many occasions is where someone passes two separate CSVs like the following (for example)...

    A,B,C --Contents of the first parameter (@P1)

    1,2,3 --Contents of the second parameter (@P2)

    In this case, @P1 contains the entity identifiers (labels) for the values contained in @P2 and they have to be matched up and added to a table before use. Without an ordinal position value for each element, there is no ORDER BY and, therefor, no guarantee that SQL Server would Match A with 1, B with 2, and C with 3 without it.

    A much more sophisticated (complex) example would be in the form of a poor-man's full text search where a word must exist, say, somewhere between word #6 and word #10 and a different word must be within 3 words of the first word (proximity search).

    There's also the example (medium complexity) of string cleansing where a string is ripped apart at the character level and must reassembled in the correct order not including the characters that have been omitted.

    I've needed it for all sorts of encode/decode functionality, all sorts of different string manipulation, and I've also used it for validation of the order of CSVs. For example, checking that the an "EAV-like" parameter with 20 sets of 3 elements each has the first 2 elements of each set contains only numeric data while the third contains only varchar and then use those same ordinal element numbers as the driver for a CROSSTAB to correct pivot the information into a 3 column result set.

    Tons and tons of uses for me. I probably couldn't name all of the different things I've done with the ordinal position value.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/18/2016)


    Goldie Lesser (3/18/2016)


    Ed Wagner (3/17/2016)


    If it doesn't have a sequence number, I wont be using it. 'Nuff said.

    I use Jeff's function all the time, and I can't even think of a single case where I've used the sequence number.

    Can someone give a case of where the ordinal number is important?

    Sure.

    One classic (simple) that I've had to use on many occasions is where someone passes two separate CSVs like the following (for example)...

    A,B,C --Contents of the first parameter (@P1)

    1,2,3 --Contents of the second parameter (@P2)

    In this case, @P1 contains the entity identifiers (labels) for the values contained in @P2 and they have to be matched up and added to a table before use. Without an ordinal position value for each element, there is no ORDER BY and, therefor, no guarantee that SQL Server would Match A with 1, B with 2, and C with 3 without it.

    A much more sophisticated (complex) example would be in the form of a poor-man's full text search where a word must exist, say, somewhere between word #6 and word #10 and a different word must be within 3 words of the first word (proximity search).

    There's also the example (medium complexity) of string cleansing where a string is ripped apart at the character level and must reassembled in the correct order not including the characters that have been omitted.

    I've needed it for all sorts of encode/decode functionality, all sorts of different string manipulation, and I've also used it for validation of the order of CSVs. For example, checking that the an "EAV-like" parameter with 20 sets of 3 elements each has the first 2 elements of each set contains only numeric data while the third contains only varchar and then use those same ordinal element numbers as the driver for a CROSSTAB to correct pivot the information into a 3 column result set.

    Tons and tons of uses for me. I probably couldn't name all of the different things I've done with the ordinal position value.

    +1000, He he, this time you beat me to it

    😎

  • Yeah, I just ran it sorry I lied this box is running sql server 2014 developer edition but the compatibility mode is set to 2012 to match my production environment.

    Operating System: Windows 7 Enterprise 64-bit (6.1, Build 7601) Service Pack 1 (7601.win7sp1_gdr.160121-1718)

    Processor: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (8 CPUs), ~3.4GHz

    Memory: 16384MB RAM

  • mburbea (3/18/2016)


    Yeah, I just ran it sorry I lied this box is running sql server 2014 developer edition but the compatibility mode is set to 2012 to match my production environment.

    Operating System: Windows 7 Enterprise 64-bit (6.1, Build 7601) Service Pack 1 (7601.win7sp1_gdr.160121-1718)

    Processor: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (8 CPUs), ~3.4GHz

    Memory: 16384MB RAM

    Hardware specs similar or almost identical to my 2016 test box, OS and of course SQL versions different for the 2016, but identical to my 2014 test rig where I haven't been able to reproduce this as drastically as your results, leaves something to be explored.

    😎

Viewing 15 posts - 856 through 870 (of 990 total)

You must be logged in to reply to this topic. Login to reply