Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • mburbea (3/18/2016)


    Also sqlperformance article seems to suggest the thing performs fairly well...

    http://sqlperformance.com/2016/03/t-sql-queries/string-split

    Obviously, his t-sql and xml based splitters are poor choices.

    Whomever wrote the numbers table splitter doesn't have a clue on how to do it correctly. Concatenation is killing the performance there. And, for a one off it won't matter but using a recursive CTE to create an incrementing sequence is as slow as and more resource intensive than even using a WHILE loop. I wish authors would get away from that bad habit so that uninformed people don't latch onto it as a solution where it actually counts (no pun intended).

    It's also REALLY difficult for me to have any faith in an article where the primary test code of...

    SELECT SYSDATETIME();

    GO

    DECLARE @x VARCHAR(8000) =

    (

    SELECT f.Value

    FROM dbo.SourceTable AS s

    CROSS APPLY dbo.SplitStrings_XML/* method */(s.StringValue,',') AS f

    );

    GO 100

    SELECT SYSDATETIME();

    ...produces the following output.

    [font="Courier New"]Beginning execution loop

    Msg 512, Level 16, State 1, Line 1

    Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.[/font]

    ... almost 100 times. :sick:

    After repairing that code to work as follows

    SELECT SYSDATETIME();

    GO

    DECLARE @x VARCHAR(8000)

    SELECT @x = f.Value

    FROM dbo.SourceTable AS s

    CROSS APPLY dbo.SplitStrings_XML/* method */(s.StringValue,',') AS f

    ;

    GO 100

    SELECT SYSDATETIME();

    ... I verified that did come very close to what he posted for the run time of the XML splitter at 89.476 seconds average compared to his 90.31. Damned close enough.

    After accommodating the columns names and using the ItemNumber column without other wise modifying the code...

    SELECT SYSDATETIME();

    GO

    DECLARE @x VARCHAR(8000)

    SELECT @x = f.Item

    FROM dbo.SourceTable AS s

    CROSS APPLY dbo.DelimitedSplit8K/* method */(s.StringValue,',') AS f

    WHERE f.ItemNumber > 0

    ;

    GO 100

    SELECT SYSDATETIME();

    ... the DelimitedSplit8k came in at 61.078 seconds... which isn't too bad considering the CLR came in at 52 seconds.

    This also proves that his earlier articles that made the claim that the XML splitter beat DelimitedSplit8K were based on faulty "identical" row test data.

    I'll wait for Wayne Sheffield's article on the subject and for 2016 to hit the streets (I don't trust anything prior to RTM). I recommend you do the same.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    I 100% agree with you that his test script wasn't the best. His number table splitter you've already dressed down, and that xml splitter doesn't do any lob caching techniques and forces it to recreate the xml thousands of times over.

    I installed a copy of 2016 RC1 on my laptop. This is a fresh copy and this laptop has never had sql installed on it before. Specs are as such:

    Operating System: Windows 10 Pro 64-bit (10.0, Build 10586) (10586.th2_release_sec.160223-1728)

    Processor: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz (4 CPUs), ~2.3GHz

    Memory: 8192MB RAM

    This machine also shows that issue with the traceflag causing a drastic shift in performance FYI.

    I added the following function to my test script to make it more inline with the other funcs in the race:

    if object_id('stringsplit') is not null drop function dbo.stringsplit

    go

    create function dbo.stringsplit(@pstring varchar(8000),@delimiter char(1))

    returns table

    with schemabinding as

    return

    select itemNumber = ROW_NUMBER() over (order by (select 1)),

    item = value

    from string_split(@pstring,@delimiter);

    go

    I realize this isn't perfect as there is no mention of an ordering guarantee.

    I added it to my test script. As I suspected, stringsplit performs wells

    SplitterNametotal_davg_d

    delimitedsplit8k23.4896960.398130440677966

    fn_split3.3125610.0561451016949153

    SqlBaseline4.914190.0832913559322034

    stringsplit3.6748410.0622854406779661

    xmlsplit27.0281630.458104457627119

    I'm kind of shocked that my hybrid split approach actually beats the built-in. Admittedly though, stringsplit is a polymorphic function though which gives it an advantage over my hybrid split approach which can only handle varchar and those under 8k.

    For a final test, I modified the test script once more. I didn't capture the item ordinal, and renamed the expected output column in my test script to be value and tried to capture the effect of how much the wrapper iTVF had on the thing. I got the following result:

    SplitterNametotal_davg_d

    delimitedsplit8k22.7614610.385787474576271

    fn_split3.0345570.0514331694915254

    SqlBaseline2.2500520.0381364745762712

    string_split 3.0157640.0511146440677966

    stringsplit3.0783830.0521759830508475

    xmlsplit26.1976680.444028271186441

    Now it barely eeks a victory from fn_split. For some reason the wrapper function adds a pretty decent overhead to string_split, and I'm not sure why. It could be the coercing of the argument to varchar(8000). Also if I make the wrapper a varchar(max) the performance is bad (roughly 3x) but still faster than delimitedsplit.

    The built-in once it adds item ordinal, should be fast enough and while custom clr based approach can always be a bit faster it's marginal improvement at the most.

    Full gist with splitter source code and test script

    and if you just want the final version of the test

    2016 version of script

  • mburbea (3/19/2016)


    Jeff,

    I 100% agree with you that his test script wasn't the best. His number table splitter you've already dressed down, and that xml splitter doesn't do any lob caching techniques and forces it to recreate the xml thousands of times over.

    I installed a copy of 2016 RC1 on my laptop. This is a fresh copy and this laptop has never had sql installed on it before. Specs are as such:

    Operating System: Windows 10 Pro 64-bit (10.0, Build 10586) (10586.th2_release_sec.160223-1728)

    Processor: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz (4 CPUs), ~2.3GHz

    Memory: 8192MB RAM

    This machine also shows that issue with the traceflag causing a drastic shift in performance FYI.

    I added the following function to my test script to make it more inline with the other funcs in the race:

    if object_id('stringsplit') is not null drop function dbo.stringsplit

    go

    create function dbo.stringsplit(@pstring varchar(8000),@delimiter char(1))

    returns table

    with schemabinding as

    return

    select itemNumber = ROW_NUMBER() over (order by (select 1)),

    item = value

    from string_split(@pstring,@delimiter);

    go

    I realize this isn't perfect as there is no mention of an ordering guarantee.

    I added it to my test script. As I suspected, stringsplit performs wells

    SplitterNametotal_davg_d

    delimitedsplit8k23.4896960.398130440677966

    fn_split3.3125610.0561451016949153

    SqlBaseline4.914190.0832913559322034

    stringsplit3.6748410.0622854406779661

    xmlsplit27.0281630.458104457627119

    I'm kind of shocked that my hybrid split approach actually beats the built-in. Admittedly though, stringsplit is a polymorphic function though which gives it an advantage over my hybrid split approach which can only handle varchar and those under 8k.

    For a final test, I modified the test script once more. I didn't capture the item ordinal, and renamed the expected output column in my test script to be value and tried to capture the effect of how much the wrapper iTVF had on the thing. I got the following result:

    SplitterNametotal_davg_d

    delimitedsplit8k22.7614610.385787474576271

    fn_split3.0345570.0514331694915254

    SqlBaseline2.2500520.0381364745762712

    string_split 3.0157640.0511146440677966

    stringsplit3.0783830.0521759830508475

    xmlsplit26.1976680.444028271186441

    Now it barely eeks a victory from fn_split. For some reason the wrapper function adds a pretty decent overhead to string_split, and I'm not sure why. It could be the coercing of the argument to varchar(8000). Also if I make the wrapper a varchar(max) the performance is bad (roughly 3x) but still faster than delimitedsplit.

    The built-in once it adds item ordinal, should be fast enough and while custom clr based approach can always be a bit faster it's marginal improvement at the most.

    Full gist with splitter source code and test script

    and if you just want the final version of the test

    2016 version of script

    Something is wrong if your results show that the DelimitedSplit8K is only 12% faster than the XML splitter but, admittedly, I've not installed 2016 to test with and probably won't until at least sp1. I don't even like to test on the bleeding edge. Apologies for being a bit old fashioned that way.

    Also, how many rows and elements are the test results you posted for? And what is the min, max, and average length of the elements themselves?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This test script is basically the one in your article, just modified a bit so that it uses dynamic sql to run the cases and requires less effort to add new functions to benchmark. It also builds the csv table using the traceflag 8690. You can inspect it on my github link. Each table has 1000 rows, and average row is an array of 10-100 elements. I threw in a few extra extreme cases

    As for the xmlsplitter being only 12% worse than delimited split, again I don't think its that far fetched. This version of the func uses lag() analytic funcs to do lob caching using the in-memory spool table. Which means it will perform a lot better than a naive xml split function. It also is favorable, as each table has only 1000 rows in it which tends to work well with this technique and fall apart at about 10k+ rows.

    Here I've added a naive xmlsplit which does not do any technique to handle the lob caching...

    if(object_id('xmlsplit2') is not null) drop function xmlsplit2;

    go

    create function xmlsplit2(@string varchar(max),@delimiter char(1))

    returns table

    with schemabinding as

    return

    select ItemNumber = ROW_NUMBER() OVER(ORDER BY (select null)),

    value = x.i.value('./text()[1]', 'varchar(8000)')

    from (VALUES

    (convert(xml,'<r>'+replace(@string,@delimiter,'</r><r>')+'</r>',0))

    ) r(a)

    cross apply a.nodes('./r') x(i)

    And the results are not pretty. This version crawls. I stopped it after 12.5 minutes.

    delimitedsplit8k3.6131650.328469545454545

    fn_split0.9182950.0834813636363636

    SqlBaseline0.7852660.0713878181818182

    string_split0.7440640.0676421818181818

    stringsplit0.7479360.0679941818181818

    xmlsplit7.5863310.689666454545455

    xmlsplit2736.68169773.6681697

    Yes you saw that correctly 736! after only 11 of the 60 test cases completed. I guess I'm bad at writing naive sql. But this naive approach just hurts to execute.

  • mburbea (3/19/2016)


    As for the xmlsplitter being only 12% worse than delimited split, again I don't think its that far fetched. This version of the func uses lag() analytic funcs to do lob caching using the in-memory spool table. Which means it will perform a lot better than a naive xml split function. It also is favorable, as each table has only 1000 rows in it which tends to work well with this technique and fall apart at about 10k+ rows.

    Ah. Understood on the XML splitter used. I didn't look at the code... just saw "another" XML splitter. Well done on that. Thanks for taking the time to explain.

    To be sure, I won't justify using a method based on low row counts. I've seen too many good people get bit by that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • mburbea (3/19/2016)


    I installed a copy of 2016 RC1 on my laptop. This is a fresh copy and this laptop has never had sql installed on it before. Specs are as such:

    Operating System: Windows 10 Pro 64-bit (10.0, Build 10586) (10586.th2_release_sec.160223-1728)

    Processor: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz (4 CPUs), ~2.3GHz

    Memory: 8192MB RAM

    Quick question on the server settings

    1. What is the "Cost Threshold for Parallelism" setting?

    2. What is the "Max Degree of Parallelism" setting?

    😎

  • This was run with default settings on the box. So probably 5 for the cost and whatever the default is on dop.

    I decided to try to change the number of rows to 100,000 (by just having the top changed to rows*100) and run the trial with just delimitedsplit,fn_split (CLR hybrid split approach), and splitstring. It's still going but after about 1/2 an hour here are the results so far...

    SplitterNametotal_davg_dtrials

    delimitedsplit8k709.45532.247954545454522

    fn_split145.3916.6086818181818222

    SqlBaseline213.149.6881818181818222

    stringsplit145.6626.62122

    Yeah, I'd say that split_string works fairly well.

    I removed the xml splitter for my own sanity as it trails far behind each of the other splitters.

  • mburbea (3/20/2016)


    This was run with default settings on the box. So probably 5 for the cost and whatever the default is on dop.

    I decided to try to change the number of rows to 100,000 (by just having the top changed to rows*1000) and run the trial with just delimitedsplit,fn_split (CLR hybrid split approach), and splitstring. It's still going but after about 1/2 an hour here are the results so far...

    SplitterNametotal_davg_dtrials

    delimitedsplit8k709.45532.247954545454522

    fn_split145.3916.6086818181818222

    SqlBaseline213.149.6881818181818222

    stringsplit145.6626.62122

    Yeah, I'd say that split_string works fairly well.

    I removed the xml splitter for my own sanity as it trails far behind each of the other splitters.

    Reason for asking is that often I find parallel executions mess up the total work by operation assessments when the sole measure is the timing of those. Obviously measuring an operation on an idle test box is a totally different thing from assessing the impact on a busy production system.

    😎

  • I bet it does, but I'm trying to just measure how fast each function is and well I can say that splitstring is fast. I let the test finish at over 76 minutes here's the final result...

    SplitterNametotal_davg_d

    delisplit8k2433.95441.2534576271186

    fn_split329.7515.589

    SqlBaseline497.7388.43623728813559

    stringsplit366.7246.21566101694915

    Yes, delimited split was nearly 40 minutes of this contest. The baseline took a little under 9 minutes (it's not a very good baseline :(), and the other two solutions took about 6 minutes a piece.

    It seems that the split_string slows as the size of the elements to return increases. It also seems that longer strings are faster in my clr implementation. (particularly the trial with the 3000-3500 character long elements).

    Even though I'm pleased to see my fn_split beat the built in, I'm happy enough with the performance of the built-in to use it and turn off the clr integration. The difference in performance is not usually enough to care.

    I plotted the data now with pretty pictures...

    You can see that string_split if incomplete is still really really fast. And is only beaten by an unorthodox CLR based splitter that avoids many of the common associated with it.

  • Sean Lange (3/18/2016)


    Goldie Lesser (3/18/2016)


    Ed Wagner (3/17/2016)


    If it doesn't have a sequence number, I wont be using it. 'Nuff said.

    I use Jeff's function all the time, and I can't even think of a single case where I've used the sequence number.

    Can someone give a case of where the ordinal number is important?

    I want the third item in the list. Using any other splitter you can't do this reliably. Many of them will order the results in the same order as the delimited list but to do something like get the third one you would have to add the splitter in a cte and add ROW_NUMBER so you can get to it.

    I have to agree that 99.99% of the time I don't really care about the position but in those times I do I am incredibly glad it is there.

    I look at IP address activity and need to determine the traffic from class C ranges. I take octets < 4 and concatenate them together to grab the class C address. They there's any split that you're going to crosstab - without the position number, you're dead in the water.

    I'm not even getting to the poor performance. I'm just talking about basic functionality that's missing from the string_split, which was made to replace what I already have. The poor performance is something they should definitely fix, but they should finish it at the same time.

  • I 100% agree that without a guarantee of ordering, or an explicit item ordinal the function is not complete.

    However, I want to reiterate that the function does perform very well, and beats most conventional CLR functions. It's polymorphic nature is an advantage that no CLR func can really beat, as it allows both varchar, nvarchar, lob types and offers consistent excellent performance.

    It is only barely beaten by my fn_split which avoids most of the clr costs, but can only operate on a varchar(8000) or less.

  • String splitting has at least two fundamentally independent components that both require more specific attention then any full split function can provide (hence all the discussions).

    Components

    1. Identifying split locations in a larger string and the subsequent length of substrings.

    2. Cutting out substrings for follow up processing.

    Both these steps can be split further, but hear me out, this is not needed.

    Component 1 is typically done by a find() or regular expression reFind() function that delivers implicit ordinal position, substring-position, and identified substring length (of the matched string, the marker in our case).

    Component 2 is typically done, when required, by explicitly coded follow up code.

    I see no point why T-SQL should be much different.

    When we have a typical find function, plus the position of the following match, we have the first step covered (and can add optional removal of what effectively are empty sub-strings as an option).

    It can be a very efficient function that operates on all types and lengths of stings, even binary for that matter.

    This is so because its output does not contain any variable length stuff!

    For component 2, I propose a new cast_substring function that operates just like like a normal cast( X as ... ) around a substring( source, pos, len ), but only takes varchar, nvarchar, varbinary (and their max versions) as input.

    This eliminates overhead of making new costly nvarchar(max) substrings, that end up existing only as input for a cast to integer or varchar(11).

    This construct allows target aware parsing of the original input and simplifies T-SQL code at the same time.

    A full cast could look like:

    select cast_substring( source, pos, len, int )

    or

    select cast_substring( source, pos, len, varchar(11) )

    Instead of cast_substring, convert_substring would work fine too.

    It is with these tools easy to implement any efficient custom splitting function as needed, on top of what gets shipped out of the box.

    It also introduces a find function that is generic and can be used for many other things (a double win in my book).

    Please give me feedback and/or help out getting this on connect!

  • mburbea (3/23/2016)


    I 100% agree that without a guarantee of ordering, or an explicit item ordinal the function is not complete.

    However, I want to reiterate that the function does perform very well, and beats most conventional CLR functions. It's polymorphic nature is an advantage that no CLR func can really beat, as it allows both varchar, nvarchar, lob types and offers consistent excellent performance.

    It is only barely beaten by my fn_split which avoids most of the clr costs, but can only operate on a varchar(8000) or less.

    Reminder that your fn_split function is actually broken when it comes to trailing delimiters.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • peter-757102 (3/24/2016)


    Out of pure frustration I will comment my idea here as the connect site is completely and utterly broken for me (can't post or comment even after signing in).

    Hopefully someone here can forward my idea, preferably in a new topic.

    String splitting has at least two fundamentally independent components that both require more specific attention then any full split function can provide (hence all the discussions).

    Components

    1. Identifying split locations in a larger string and the subsequent length of substrings.

    2. Cutting out substrings for follow up processing.

    Both these steps can be split further, but hear me out, this is not needed.

    Component 1 is typically done by a find() or regular expression reFind() function that delivers implicit ordinal position, substring-position, and identified substring length (of the matched string, the marker in our case).

    Component 2 is typically done, when required, by explicitly coded follow up code.

    I see no point why T-SQL should be much different.

    When we have a typical find function, plus the position of the following match, we have the first step covered (and can add optional removal of what effectively are empty sub-strings as an option).

    It can be a very efficient function that operates on all types and lengths of stings, even binary for that matter.

    This is so because its output does not contain any variable length stuff!

    For component 2, I propose a new cast_substring function that operates just like like a normal cast( X as ... ) around a substring( source, pos, len ), but only takes varchar, nvarchar, varbinary (and their max versions) as input.

    This eliminates overhead of making new costly nvarchar(max) substrings, that end up existing only as input for a cast to integer or varchar(11).

    This construct allows target aware parsing of the original input and simplifies T-SQL code at the same time.

    A full cast could look like:

    select cast_substring( source, pos, len, int )

    or

    select cast_substring( source, pos, len, varchar(11) )

    Instead of cast_substring, convert_substring would work fine too.

    It is with these tools easy to implement any efficient custom splitting function as needed, on top of what gets shipped out of the box.

    It also introduces a find function that is generic and can be used for many other things (a double win in my book).

    Please give me feedback and/or help out getting this on connect!

    Just my thoughts... LEN is easily calculated outside of the function and can be done only when needed. Just like it's normally a good idea to avoid SELECT *, I'd personally prefer to not have anything extra returned because it does require extra data movement.

    As for the rest of the stuff, I think it should be kept uber simple even behind the scenes. I can speak for no one else but I use DelimitedSplit8k for a whole lot. Having this new splitter being so much faster than even a CLR is absolutely wonderful and I wouldn't want to see it slowed down even by 1% even if it can't current handle VARBINARY.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @jeff Moden,

    These are two new functions I like to see, with a broader scope then just string splitting (but they are suited for it).

    If I had these I would not even care how limited string_split() turns out to be, as long as it is efficient at what it does.

    For general use string_split looks inadequate to me, like every construct that has to be surrounded by a lot of code to make it do what is actually needed. Making a fast function that does little is easy. Speed is easily lost when follow up code is needed. Precisely because the output of the function is an intermediate result or the input is being too limited and pre-processing is required.

    As for extra fixed length columns causing data columns.

    I believe that has virtually no impact when streaming rows.

    Whatever column is not consumed upon use, is discarded after every row iteration.

    Missing the extra column on the other hand causes extra processing and storage, it is not even a fair fight in the slightest.

Viewing 15 posts - 871 through 885 (of 990 total)

You must be logged in to reply to this topic. Login to reply