Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • UMG Developer (5/2/2011)


    Thanks for the excellent article that was both entertaining and instructional. I will be making use of your new splitter soon!

    Thanks, UMG. Like I said to someone else, usage is the highest form of flattery. 🙂 Thanks for the kind words.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • gary.rumble (5/2/2011)


    I'd be interested in seeing the data you used for benchmarking.

    It's in the third paragraph of the Prologue, Gary.

    Shifting gears, there are 3 "obviously named" attachments to this article. [font="Arial Black"]One is a fully documented stand-alone test script[/font], one is a 5 sheet Excel spreadsheet with a performance chart on each sheet, and one is a zipped file which contains copies of the new VARCHAR(8000) and NVARCHAR(4000) splitters that I think you'll like.

    I tried a method that used a cte to generate a table of offsets for the delimiters. Then I joined that on itself to find the starts and ends and used that to get the substrings from the original string.

    I was curious how that would perform compared to the other methods.

    Now that you know where the test code is, give it a try a let us know. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    Glad I brought a wide smile or laugh to your face/heart. But I was serious. I'm used to minimizing operations in the normal course of performance-tuning. If this minimized-operation approach performs worse than the tally table approach, I'd really like to know so I might adjust my thinking in the T-SQL environment.

    Mark

  • tfifield (5/2/2011)


    Jeff,

    Brilliant piece of work! I did something a bit similar but cheated by always appending the last delimiter to the string in the function call. I always felt that it was a hack but I was under some pressure to get the project done. Yours is a much cleaner and way more versatile approach.

    The article is so well written that I was coming up with the ISNULL/NULLIF right along with you - a testament to the logical progression of the thing.

    The switch to always getting N+1 and starting with zero was also brilliant.

    By the way, I always referred to the 'Nibbler' approach as a 'Shuffler' approach. I think I like 'Nibbler' a bit better.

    Todd Fifield

    Awesome feedback, Todd. :blush: Glad you liked it. BTW... the dust bunnies are tickled that you like the "Nibbler" name... it was their idea. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Nadrek (5/2/2011)


    Jeff Moden (5/2/2011)


    ...

    I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore).

    ...

    Thank you for the excellent reminder; I looked at my ASCII chart for a refresher:

    CHAR(28) --File Separator

    CHAR(29) --Group Separator

    CHAR(30) --Record Separator

    CHAR(31) --Unit Separator

    I move we start trying to use these separator characters again! 30 and 31 are particularly appropriate for the usual two-dimensional split into table techniques.

    On another note, which tricks did I miss for a permanent, 0-based tally table adaptation?

    CREATE FUNCTION [dbo].[YourFunction](@pString [varchar](8000), @pDelimiter [char](1))

    RETURNS TABLE

    RETURN

    WITH cteStart(N1) AS (--==== This limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP ((DATALENGTH(ISNULL(@pString,1)))+1) N+1

    FROM YourDatabase.dbo.YourZeroBasedTallyTable t

    WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

    ORDER BY N

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))

    FROM cteStart s

    ;

    And Record Separator test:

    select * from YourFunction('first'+CHAR(30)+'second'+CHAR(30)+REPLICATE('last',60),CHAR(30));

    And thank you for giving me a solid reason to finally have a good old-fashioned 0-based set as well.

    Well, that makes 3 of us. We have a quorum! 😛 Phil Factor wrote an article fairly recently on all the XML, JSON, and other gobilty-gook and suggested that someone should come up with a better way. I was considering writing an article about the all-too-"secret" characters 28-31 and a couple of other goodies in "control character land". Whatcha tink? Worthwhile or not?

    I haven't run it but your real Tally Table splitter looks right. You picked up on everything including the use of the "TOP".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Well, so far I got:

    (2 row(s) affected)

    Msg 8623, Level 16, State 1, Line 3

    The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.

    Msg 3701, Level 11, State 5, Line 4

    Cannot drop the table 'dbo.Csv8K', because it does not exist or you do not have permission.

    I think your code is too much for my server. 😉

    14090 SW TENNESSEE LN

  • UMG Developer (5/2/2011)


    mtassin (5/2/2011)


    On another note though, I notice something about the final graph of performance around the 270 mark on the X axis. The pure SQL set based method seems to continue on a straight line path, while the CLR methods seems to be starting to curve upwards similarly to what Jeff was complaining about the original Tally splitter doing, just later in the X series than the original Tally did.

    I too noticed that, and would be interested in seeing the graph compared a few little further out to see if the CLR solution continued it surge to the top.

    I think the darned thing is running so fast that just about any blip on the CPU will cause a delay. BUT, you can bet I'll test it because that CLR will also be used in the split "MAX" article when I can finally get to it. There was also a downward trend that's real hard to see early in the game on that chart. It might be why the final point looks like an upward trend with line smoothing turned on.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • gary.rumble (5/2/2011)


    Well, so far I got:

    ...

    I think your code is too much for my server. 😉

    Yowch. The code is even split into batches. What you may have to do is run the code a section at a time up to where the test loop begins and the let the test loop rip.

    Thanks for trying, Gary. If for some reason, your server just won't take it, let me know and we'll test your code for you.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • mark hutchinson (5/2/2011)


    Jeff,

    Glad I brought a wide smile or laugh to your face/heart. But I was serious. I'm used to minimizing operations in the normal course of performance-tuning. If this minimized-operation approach performs worse than the tally table approach, I'd really like to know so I might adjust my thinking in the T-SQL environment.

    Mark

    I really did think you were simply being ironic but wasn't sure.

    I can't imagine that a cursor stepping through the individual characters to find a delimiter would be faster but that might be a box that a lot of us are stuck in that we should try to get out of. In fact, that's the way the CLR was written. Do you have some T-SQL code for what you propose?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff

    I think you already identified the basic code in your articles. Since you have a test bed for performance comparison, I thought you might have tested a loop to populate a temp table with CharIndex() values of the delimiter positions as opposed to the Substring() operation for every character in the string.

    Once you have the delimiter positions in your temp table, use a cursor to iterate through the delimiter positions, keeping track of the prior position in a local variable (initialized to zero, perhaps).

    In answer to your question...No. I don't have any code to offer. I just thought that you were comparing your Tally Table method against a less-than-optimized looping method.

    I'm even interested in a performance comparison between a CharIndex() loop to find delimiter positions and your Tally Table method. The Tally Table method invokes a SubString() function for every character to get the same result. The optimized OH method invokes a CharIndex(), to return the next delimiter position, in a similar manner that I suggest for the initial delimiter position calculation.

    I've been around relational databases for a long time (mid 80s), but my DB duties are much less interesting than this topic. I hope to learn something from you in this post-article discussion.

  • As usual, you have written another excellent article Jeff. It was great one.:-)

  • It's a pity some obscure character caught the attention of CNN, Bloomberg, Al Jazera, ... the day your great article got published.

    For sure it would have been a headliner !

    Great code, great benchmark info,

    Great job Jeff

    Thanks again for sharing the knowledge.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Great job! Thank you all those who have written, contributed or inspired Jeff.

    I have done some tests on real data before using this new one, executing both under same parameters for 10.000 times and logging the times.

    The previous splitter function we used took 1084, 1986 and 633 milliseconds (average, max and min) on a given parameter. This new one takes 286, 416 and 176 milliseconds (again avg, max, min).

    It's around 4 times faster in my scenario.

    So, just congrats and thanks!

  • Great Job,

    I will use these code for future use.

  • mark hutchinson (5/2/2011)


    Jeff

    I think you already identified the basic code in your articles. Since you have a test bed for performance comparison, I thought you might have tested a loop to populate a temp table with CharIndex() values of the delimiter positions as opposed to the Substring() operation for every character in the string.

    Once you have the delimiter positions in your temp table, use a cursor to iterate through the delimiter positions, keeping track of the prior position in a local variable (initialized to zero, perhaps).

    In answer to your question...No. I don't have any code to offer. I just thought that you were comparing your Tally Table method against a less-than-optimized looping method.

    I'm even interested in a performance comparison between a CharIndex() loop to find delimiter positions and your Tally Table method. The Tally Table method invokes a SubString() function for every character to get the same result. The optimized OH method invokes a CharIndex(), to return the next delimiter position, in a similar manner that I suggest for the initial delimiter position calculation.

    I've been around relational databases for a long time (mid 80s), but my DB duties are much less interesting than this topic. I hope to learn something from you in this post-article discussion.

    The "Inchworm" While Loop method I tested uses the optimization you speak of.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 61 through 75 (of 990 total)

You must be logged in to reply to this topic. Login to reply