Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • WayneS (5/4/2011)


    Jeff Moden (5/3/2011)


    What I find really strange is why do people wait until I write an article to bring these wonders forward? 😛

    To go along with this... where were they the past year that we've been posting the "old" DelimitedSplit8K code for solutions? Jeez, it sure would have been nice to have seen this back then!

    You guys say binkie and I have visions of this:

    I would really like to see Jeff or Wayne wrapped up in one of these.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • WayneS (5/4/2011)


    Jeff Moden (5/4/2011)


    Nicely done, Nadrek!

    Yes indeed, very nice!

    The cool part about all of this is if you run into a DBA that won't allow auxiliary tables such as a Tally Table, the cteTally method isn't that far off.

    I think that any of these methods would be very good ones to have.

    How did Peso's splitter compare? He usually has some wickedly fast routines...

    I must agree - I think they are all awesome. Thanks to all who contributed good fast code.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Code for the tests that produced the data for the above charts is attached to this post.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • WayneS (5/4/2011)


    How did Peso's splitter compare? He usually has some wickedly fast routines...

    Dang it! I knew I was forgetting something! I'll be back.

    Oh... and the dust bunnies said they made a new batch and it's in the freezer as we speak. Gotta love courteous dust bunnies. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/4/2011)


    Oh... and the dust bunnies said they made a new batch and it's in the freezer as we speak. Gotta love courteous dust bunnies. 😀

    They said that, did they? Oh Jeff... I've got this nice tight jacket for you to slip into... why yes, it does go on backwards.... oh, don't worry about those straps....

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • whoops...

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (5/4/2011)


    Jeff Moden (5/4/2011)


    Oh... and the dust bunnies said they made a new batch and it's in the freezer as we speak. Gotta love courteous dust bunnies. 😀

    They said that, did they? Oh Jeff... I've got this nice tight jacket for you to slip into... why yes, it does go on backwards.... oh, don't worry about those straps....

    It's ok. I won't need my hands so long as that pink toy that Jason showed a picture of is beer flavored. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Agh... I ran into a glitch with Peter's code and I sent it back to him. No more tests tonight.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/4/2011)


    ...

    I guess I'm going blind or slowly insane. I didn't look that closely at Nadrek's functions before. The only mods he made were on the Tally Table source...

    ...

    Correct; the two fast ones are merely different ways of using permanent Tally Tables instead of the multi-CTE temporary.

    Thank you for doing the graphs; It's very interesting to see that the Union based version is only (slightly) faster than the 0 based version for a small segment at fairly low sizes; for even lower sizes, the 0 based is faster, and for any medium or higher sizes, the 0 based is nontrivially faster.

    And you're correct; when you don't have a permanent tally table available, the internal multi-CTE version is less than 16% off.

    I would expect the permanent tally table lead to increase significantly at VARCHAR(MAX) string sizes, though there have been many surprises on the SQL based string splitting front so far.

    P.S. The only significant derivation I did was to take it down to only one CTE total; for reasons yet to be determined, that was horrifically slow.

  • Jeff Moden (5/4/2011)


    gary.rumble (5/3/2011)


    Jeff Moden (5/3/2011)


    gary.rumble (5/3/2011)


    Jeff Moden (5/2/2011)


    gary.rumble (5/2/2011)


    Well, so far I got:

    ...

    I think your code is too much for my server. 😉

    Yowch. The code is even split into batches. What you may have to do is run the code a section at a time up to where the test loop begins and the let the test loop rip.

    Thanks for trying, Gary. If for some reason, your server just won't take it, let me know and we'll test your code for you.

    I set up a new SS instance on my laptop and got the tests to run. Looks like my code tracks your results fairly closely, but of course it doesn't create the numbers table on the fly so it probably has an advantage there.

    I emailed you the results. I couldn't generate your pretty graphs, though.

    Thanks for the article.

    No problem, Gary. You obviously had to convert your code to a function to run it through my test harness. Could you post your function, please. Thanks.

    Gary... sorry, my friend. I'm not sure how you got the results you sent me because I had to stop your code because it only made it to the 1-10 element size with only 512 elements on each row. Your code will be in the attachment when I publish my race results for Nadrek's code but the actual test is commented out. You'll need to send me your actual test and setup for the Tally Table so I can see if I'm doing something wrong.

    Apparently I created cold fusion. I wasn't able to reproduce those results either. That's what I get for wiring up a test harness at 2:00 in the morning, I guess. Mea Culpa.

    It looks like it doesn't even perform as well as the slowest of the original examples. I must have completely short circuited the code somehow.

    My sincere apologies for wasting your time.

    14090 SW TENNESSEE LN

  • gary.rumble (5/5/2011)


    My sincere apologies for wasting your time.

    I appreciate your humility, but things like that just can't be classified as a waste of time, Gary. What if your code smoked even a CLR? You just don't know for certain until it's been tested and verified by a 3rd party. Thanks for letting me be the 3rd party. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SQLkiwi

    Does the stringbuilder class also suffer from the same performance 'hit' as the string object? (in this usage context)

  • mark hutchinson (5/5/2011)


    Does the stringbuilder class also suffer from the same performance 'hit' as the string object? (in this usage context)

    Well not exactly in the current context (StringBuilder builds strings, and we want to split them).

    Nevertheless, the general comparison is usually between StringBuilder and concatenation. The answer is, naturally, 'it depends'. To over-simplify, if you do it right, and end up using the builder enough times, it is usually faster and more efficient than repeated concatenation.

    This is one of those topics (like string splitting) which has a habit of polarizing opinion and descending into argument, so you'll find many contrasting views out there. Here's just one link, use your favourite search engine to find others:

    http://blogs.msdn.com/b/rahulso/archive/2006/08/29/string-concatenation-vs-string-builder-_2d00_-the-performance-hit_2100_-see-it-to-believe-it-_3a00_o_2900_.aspx

    Paul

  • Hi Jeff,

    I finaly got a quick look at this work of you and other subsequent posts and it looks like great work.

    I did notice someting that striked me as odd tho, it just does not feel "right".

    It is about this bit of code:

    DATALENGTH(ISNULL(@pString,1))

    I read this as: if string is null, then convert 1 to a varchar '1', and after that a datalength is taken of the resulting varchar which then is indeed 1. I expected more along the lines of:

    ISNULL(DATALENGTH(@pString),0)

    Which is not the same, I must have missed why with a NULL as input, there is need to perform a top 1 instead of a top 0 and return (0,1) in the tally instead of only (0).

    I be looking at things a bit more later.

  • Nadrek (5/3/2011)


    Jeff Moden (5/2/2011)


    Well, that makes 3 of us. We have a quorum! 😛 Phil Factor wrote an article fairly recently on all the XML, JSON, and other gobilty-gook and suggested that someone should come up with a better way. I was considering writing an article about the all-too-"secret" characters 28-31 and a couple of other goodies in "control character land". Whatcha tink? Worthwhile or not?

    I haven't run it but your real Tally Table splitter looks right. You picked up on everything including the use of the "TOP".

    Change your test harness script to use CHAR(30) instead of ',' and start leading by example :); examples are what will help with the control characters. Peoplesoft names, valid uses of tabs in text (formatted text blurbs, for example), valid uses of pipes in VARCHAR fields (program documentation, storing REGEX expressions in a database, secure usernames with special characters allowed), etc.

    Also, an easy way to get SQL Server to output SELECT results to a special character delimited VARCHAR() table, and directly to a text file would be very worthwhile for the article. The main uses I see are getting data to/from arbitrary SELECT results into a string with one or two delimiters/dimensions, and the same thing to/from a text file.

    As far as my previous single CTE real Tally Table (0 based) splitter, please ignore it. It takes too long on higher numbers of elements, much like the original tally table splitter's problem. Instead, try one or both of the two double CTE versions below:

    CREATE FUNCTION [dbo].[YourFunctionUnion](@pString [varchar](8000), @pDelimiter [char](1))

    RETURNS TABLE

    RETURN

    WITH

    cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT 0 UNION ALL

    SELECT TOP (DATALENGTH(ISNULL(@pString,1))) N FROM YourDB.dbo.YourTallyTable1Based WITH (nolock) -- the 1 based + Union All is almost identical in performance to a 0 based TOP (Datalength()+1)

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1

    FROM cteTally t

    WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1), -- not including StringNumber actually increases duration and CPU, but uses slightly fewer writes

    Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))

    FROM cteStart s WITH (nolock)

    ;

    CREATE FUNCTION [dbo].[YourFunction0Based](@pString [varchar](8000), @pDelimiter [char](1))

    RETURNS TABLE

    RETURN

    WITH

    cteTally(N) AS (--==== This limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (DATALENGTH(ISNULL(@pString,1))+1) N FROM YourDB.dbo.YourTallyTable0Based WITH (nolock) -- the 1 based + Union All is almost identical in performance to a 0 based TOP (Datalength()+1)

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1

    FROM cteTally t

    WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1), -- not including StringNumber actually increases duration and CPU, but uses slightly fewer writes

    Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))

    FROM cteStart s WITH (nolock)

    ;

    The UNION version is up to about 5% faster on small but reasonably useful numbers of elements and uses the "standard" 1 based tally table, and the 0 based version is up to about 5% faster on medium to large numbers of elements, and uses a special tally table that starts at 0. Both can be slightly slower than your CTE Tally example on very small numbers of elements/string sizes, depending on some combination of SQL version and platform parameters.

    These functions seem to be missing an order by in their respective cteTally. I suspect it makes no difference performance wise as its scanning a clustered index anyway, but to guarantee correct numbers being returned the order by is required!

Viewing 15 posts - 166 through 180 (of 990 total)

You must be logged in to reply to this topic. Login to reply