Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Anipaul (5/3/2011)


    As usual, you have written another excellent article Jeff. It was great one.:-)

    Thanks, Anipaul. Good to "see" you again. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • adnan.korkmaz (5/3/2011)


    Great job! Thank you all those who have written, contributed or inspired Jeff.

    I have done some tests on real data before using this new one, executing both under same parameters for 10.000 times and logging the times.

    The previous splitter function we used took 1084, 1986 and 633 milliseconds (average, max and min) on a given parameter. This new one takes 286, 416 and 176 milliseconds (again avg, max, min).

    It's around 4 times faster in my scenario.

    So, just congrats and thanks!

    Very cool feedback, Adnan. Is there a chance of you posting your previous splitter function so we can take a gander at the differences the two may have?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • srikant maurya (5/3/2011)


    Great Job,

    I will use these code for future use.

    Thanks, Srikant. I appreciate the feedback.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ALZDBA (5/3/2011)


    It's a pity some obscure character caught the attention of CNN, Bloomberg, Al Jazera, ... the day your great article got published.

    For sure it would have been a headliner !

    Great code, great benchmark info,

    Great job Jeff

    Thanks again for sharing the knowledge.

    Thanks, Johan. Always good to hear from you and thanks for the laugh. 🙂

    As a side bar, I've submitted one abstract for PASS so far and will submit another soon. I can't afford to go unless I present because I have to pay for everything out of my own pocket. If I get selected, it would be a real pleasure to meet with you again. Do you know if you're going?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/3/2011)


    adnan.korkmaz (5/3/2011)


    Great job! Thank you all those who have written, contributed or inspired Jeff.

    I have done some tests on real data before using this new one, executing both under same parameters for 10.000 times and logging the times.

    The previous splitter function we used took 1084, 1986 and 633 milliseconds (average, max and min) on a given parameter. This new one takes 286, 416 and 176 milliseconds (again avg, max, min).

    It's around 4 times faster in my scenario.

    So, just congrats and thanks!

    Very cool feedback, Adnan. Is there a chance of you posting your previous splitter function so we can take a gander at the differences the two may have?

    Sure, here it is:

    ALTER FUNCTION [dbo].[FastSplit] (@s varchar(512))

    RETURNS table

    AS

    RETURN (

    WITH Pieces(pn, start, stop) AS (

    SELECT 1, 1, CHARINDEX(',', @S)

    UNION ALL

    SELECT pn + 1, stop + 1, CHARINDEX(',', @S, stop + 1)

    FROM Pieces

    WHERE stop > 0

    )

    SELECT pn,

    SUBSTRING(@s, start, CASE WHEN stop > 0 THEN stop-start ELSE 512 END) AS ID

    FROM Pieces

    )

  • Except for the way they handled the final element, that's almost identical to the rCTE code I have in the test harness I included with the article. Thanks for posting it, Adnan. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/3/2011)


    ALZDBA (5/3/2011)


    It's a pity some obscure character caught the attention of CNN, Bloomberg, Al Jazera, ... the day your great article got published.

    For sure it would have been a headliner !

    Great code, great benchmark info,

    Great job Jeff

    Thanks again for sharing the knowledge.

    Thanks, Johan. Always good to hear from you and thanks for the laugh. 🙂

    As a side bar, I've submitted one abstract for PASS so far and will submit another soon. I can't afford to go unless I present because I have to pay for everything out of my own pocket. If I get selected, it would be a real pleasure to meet with you again. Do you know if you're going?

    Probably I'll apply for travel approval by the end of this month because SQLPass will have published an agenda by then on which I can build my POI that can serve our needs.

    I'll keep you updated.

    Like in the previous years, as soon as I get my approval it will show up in my forum signature 😎

    Fingers crossed :Whistling:

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • As usual, great post Jeff. Five star article.. 🙂

    -Roy

  • Jeff, please please please write a book, without changing your style. Give us all an opportunity to repay you for what you've done for us.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Jeff

    Thank you for your patience. I must not be asking my question in the right way. Sorry. I'll try one last time.

    There are two parts to the parsing:

    * finding delimiter positions

    * parsing the string

    When I do performance comparisons, I time the individual components. I wanted to know the performance difference between using the CharIndex() in a loop and the Tally Table to populate a temp table of delimiter positions.

    Although we would both use an "inchworm" method to parse the string, I was curious about the performance of a cursor on the delimiter table, compared to the NullIf(IsNull()).

    As I stated earlier, traditional performance tuning primarily involves simplification, reuse, and pre-calculation. If that methodology and approach is not valid in the TSQL environment, I'll gladly get a new toolbox.

  • mark hutchinson (5/3/2011)


    Jeff

    Thank you for your patience. I must not be asking my question in the right way. Sorry. I'll try one last time.

    There are two parts to the parsing:

    * finding delimiter positions

    * parsing the string

    When I do performance comparisons, I time the individual components. I wanted to know the performance difference between using the CharIndex() in a loop and the Tally Table to populate a temp table of delimiter positions.

    Although we would both use an "inchworm" method to parse the string, I was curious about the performance of a cursor on the delimiter table, compared to the NullIf(IsNull()).

    As I stated earlier, traditional performance tuning primarily involves simplification, reuse, and pre-calculation. If that methodology and approach is not valid in the TSQL environment, I'll gladly get a new toolbox.

    The biggest reason for not using a while loop or cursor to step through each position of a string looking for the specified delimiter is that you cannot use those in an inline TVF (iTVF). You would have to use a multiline TVF (mTVF) and those are slower.

  • Lynn

    Thanks. Then this is an environmental property of 'compiled' TSQL of which I was not aware.

  • Jeff Moden (5/2/2011)


    gary.rumble (5/2/2011)


    Well, so far I got:

    ...

    I think your code is too much for my server. 😉

    Yowch. The code is even split into batches. What you may have to do is run the code a section at a time up to where the test loop begins and the let the test loop rip.

    Thanks for trying, Gary. If for some reason, your server just won't take it, let me know and we'll test your code for you.

    I set up a new SS instance on my laptop and got the tests to run. Looks like my code tracks your results fairly closely, but of course it doesn't create the numbers table on the fly so it probably has an advantage there.

    I emailed you the results. I couldn't generate your pretty graphs, though.

    Thanks for the article.

    14090 SW TENNESSEE LN

  • Thanks Jeff - great stuff

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Jeff Moden (5/2/2011)


    Well, that makes 3 of us. We have a quorum! 😛 Phil Factor wrote an article fairly recently on all the XML, JSON, and other gobilty-gook and suggested that someone should come up with a better way. I was considering writing an article about the all-too-"secret" characters 28-31 and a couple of other goodies in "control character land". Whatcha tink? Worthwhile or not?

    I haven't run it but your real Tally Table splitter looks right. You picked up on everything including the use of the "TOP".

    Change your test harness script to use CHAR(30) instead of ',' and start leading by example :); examples are what will help with the control characters. Peoplesoft names, valid uses of tabs in text (formatted text blurbs, for example), valid uses of pipes in VARCHAR fields (program documentation, storing REGEX expressions in a database, secure usernames with special characters allowed), etc.

    Also, an easy way to get SQL Server to output SELECT results to a special character delimited VARCHAR() table, and directly to a text file would be very worthwhile for the article. The main uses I see are getting data to/from arbitrary SELECT results into a string with one or two delimiters/dimensions, and the same thing to/from a text file.

    As far as my previous single CTE real Tally Table (0 based) splitter, please ignore it. It takes too long on higher numbers of elements, much like the original tally table splitter's problem. Instead, try one or both of the two double CTE versions below:

    CREATE FUNCTION [dbo].[YourFunctionUnion](@pString [varchar](8000), @pDelimiter [char](1))

    RETURNS TABLE

    RETURN

    WITH

    cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT 0 UNION ALL

    SELECT TOP (DATALENGTH(ISNULL(@pString,1))) N FROM YourDB.dbo.YourTallyTable1Based WITH (nolock) -- the 1 based + Union All is almost identical in performance to a 0 based TOP (Datalength()+1)

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1

    FROM cteTally t

    WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1), -- not including StringNumber actually increases duration and CPU, but uses slightly fewer writes

    Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))

    FROM cteStart s WITH (nolock)

    ;

    CREATE FUNCTION [dbo].[YourFunction0Based](@pString [varchar](8000), @pDelimiter [char](1))

    RETURNS TABLE

    RETURN

    WITH

    cteTally(N) AS (--==== This limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (DATALENGTH(ISNULL(@pString,1))+1) N FROM YourDB.dbo.YourTallyTable0Based WITH (nolock) -- the 1 based + Union All is almost identical in performance to a 0 based TOP (Datalength()+1)

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    SELECT t.N+1

    FROM cteTally t

    WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1), -- not including StringNumber actually increases duration and CPU, but uses slightly fewer writes

    Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))

    FROM cteStart s WITH (nolock)

    ;

    The UNION version is up to about 5% faster on small but reasonably useful numbers of elements and uses the "standard" 1 based tally table, and the 0 based version is up to about 5% faster on medium to large numbers of elements, and uses a special tally table that starts at 0. Both can be slightly slower than your CTE Tally example on very small numbers of elements/string sizes, depending on some combination of SQL version and platform parameters.

Viewing 15 posts - 76 through 90 (of 990 total)

You must be logged in to reply to this topic. Login to reply