Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Lynn Pettis (12/30/2012)


    John Hardin (12/30/2012)


    L' Eomot Inversé (12/30/2012)


    Jeff Moden (12/28/2012)


    If you look at the execution plan, it doesn't actually use the "same list". CTEs have some wonderful advantages but they also have a hidden fault. Instead of using the same internal result set from a CTE when things like self joins are used, the entire CTE is re-executed.

    Now, if only one could force the optimiser to spool the CTE and reuse it, CTEs would be far more useful.

    Don't become so blinded by CTEs that you forget about temporary tables or table-type variables; another problem with CTEs is they are not indexed.

    On page 27 I posted a suggested version of this that avoided CHARINDEX() and used an indexed table-type variable and it showed performance gains over the pure-CTE version, but I don't think anyone has seen fit to comment on that suggested version... 🙁

    Page 27 passed on how many posts per page? I ask as I display 50 posts per page. It would help if you posted the url of your post so we could go directly to that post.

    D'oh! I didn't consider non-default posts-per-page settings. Sorry. Here's the direct link: http://www.sqlservercentral.com/Forums/FindPost1129283.aspx

    The second code block is what I am referring to.

  • L' Eomot Inversé (12/30/2012)


    Why can't the optimiser look and see if spooling would enhance performance (only in the cases where it make no semantic difference, of course) and use it when it does? Or if that's too difficult a task for the optimiser (it shouldn't be, but they haven't done it so I can imagine them claiming it is) why can't we have a query hint that tells them to do it? I hope Paul or someone will jump in and explain all this - maybe tell me I've got it all wrong, but that's OK too, I like learning.

    You're right that the optimizer doesn't consider CTE materialization today (they are strictly in-line view definitions, expanded once per reference) and there is no hint for materialization either.

    There is a 2006 active Connect suggestion by Adam Machanic for a hint. Though some duplicates of that item have been closed as Won't Fix (= perhaps for the time being) my overall impression from public Microsoft comments is that something along these lines is being actively considered for a future release. We will have to wait and see.

    Related: more generalized subtree matching and reuse has been trialled in the product, see this Microsoft Research paper (PDF).

  • Tony.l (12/29/2012)


    Hi

    unbelievable - so good. I have been doing these puter things for 30 years now and i am still amazed how little i know or probably at just how clever other people are. Really well explained and totally useful.

    keep up the good work.

    APL. My initials not the language.

    Thanks for the great feedback, Tony. I feel the same way. I've been working with computers in one form or another since 1968 and I'm still learning something new about them everyday.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • John Hardin (12/30/2012)


    On page 27 I posted a suggested version of this that avoided CHARINDEX() and used an indexed table-type variable and it showed performance gains over the pure-CTE version, but I don't think anyone has seen fit to comment on that suggested version... 🙁

    You'll have to forgive me a bit on that. If you look at even the most recent pages of this thread, there are still people saying "Try this" or "How about this"? Even you said...

    How about this to get rid of the CHARINDEX() string operation:

    {snip}

    ...how does that affect performance?

    Two days later (lots of water had flowed under the bridge by then and we're not notified by edits), you added...

    edited to add: Actually performance goes down a bit because the subquery in making cteStartEnd is not indexed. This version does exhibit performance gains in my very modest testing:

    ...and followed that up with...

    (Note: I'm doing my version as a SP so the above exact code hasn't actually been tested by me...)

    I just didn't (and still don't) have the time to test everyone's suggestion. That's why I spent so much time on building a test harness for everyone to use.

    If you think your method has merit in the area of performance, please test it and post the test results. Who knows? You might be on to something.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SQL Kiwi (12/30/2012)


    There is a 2006 active Connect suggestion by Adam Machanic for a hint. Though some duplicates of that item have been closed as Won't Fix (= perhaps for the time being) my overall impression from public Microsoft comments is that something along these lines is being actively considered for a future release. We will have to wait and see.

    Well, I've added my vote to that. It will probably make no difference, but that isn't an excuse for not doing so whem I think it's right.

    Related: more generalized subtree matching and reuse has been trialled in the product, see this Microsoft Research paper (PDF).

    For some reason the link doesn't work. But looking in the obvious place indicated by your link (http://research.microsoft.com/en-us/um/people/jrzhou/publications.htm) finds it and the link from there, although it appears to be identical to your link, works fine. I have no idea what's happening there.

    Edit: yes, I see what was happening: your link is relative, not absolute, so it tries to connect to http://www.sqlservercentral.com/Forums/research.microsoft.com/en-us/um/people/jrzhou/pub/cse.pdf instead of to the real target. I suspect I've done that sometimes too - probably every time I've typed the link instead of copyying and pasting from the address field of my browser.

    Tom

  • L' Eomot Inversé (12/31/2012)


    For some reason the link doesn't work. But looking in the obvious place indicated by your link (http://research.microsoft.com/en-us/um/people/jrzhou/publications.htm) finds it and the link from there, although it appears to be identical to your link, works fine. I have no idea what's happening there.

    Edit: yes, I see what was happening: your link is relative, not absolute, so it tries to connect to http://www.sqlservercentral.com/Forums/research.microsoft.com/en-us/um/people/jrzhou/pub/cse.pdf instead of to the real target. I suspect I've done that sometimes too - probably every time I've typed the link instead of copyying and pasting from the address field of my browser.

    Thanks Tom, I have corrected the link in the original post.

  • Nice update to this article Jeff!

    I especially liked your mysteriously labelled black line (????). Nice trick with the 0 based Tally table too - will need to remember that one.

    I have already replaced the old DelimitedSplit8K in my sandbox and now I must seek out other places where it may have been used and do the same.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Interesting stuff, thank you.

    Table type parameters for stored procedures have eliminated much of my need for doing this within SQL, but it's always nice to have options.

  • Jeff or Paul:

    I didn't see the source code for the CLR function. I also didn't read all 473 posts before mine, so maybe there's a link to it somewhere..

    Paul, would you mind sharing that with us? I would love to see its guts.

    Kurt

  • klini (1/3/2013)


    I didn't see the source code for the CLR function. I also didn't read all 473 posts before mine, so maybe there's a link to it somewhere..

    It's in one of the zip files at the bottom of the original article: Tally OH! An Improved SQL 8K “CSV Splitter” Function[/url]

     

  • hi sorry i haven't tested for performance or sure if this has been done but removed the ctelen

    IF OBJECT_ID('dbo.DelimitedSplit8K') IS NOT NULL

    DROP FUNCTION [dbo].[DelimitedSplit8K]

    go

    CREATE FUNCTION [dbo].[DelimitedSplit8K]

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    -- enough to cover NVARCHAR(4000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    , cteStart(N1,L1) AS (

    SELECT 1,CHARINDEX(@pDelimiter,@pString,1) -1

    UNION ALL

    SELECT t.N+1

    , ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,t.N+1),0)-t.N,8001) -1

    FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter

    )

    --select * from cteStart

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteStart l

  • telcogod (1/4/2013)


    hi sorry i haven't tested for performance or sure if this has been done but removed the ctelen

    IF OBJECT_ID('dbo.DelimitedSplit8K') IS NOT NULL

    DROP FUNCTION [dbo].[DelimitedSplit8K]

    go

    CREATE FUNCTION [dbo].[DelimitedSplit8K]

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    -- enough to cover NVARCHAR(4000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    , cteStart(N1,L1) AS (

    SELECT 1,CHARINDEX(@pDelimiter,@pString,1) -1

    UNION ALL

    SELECT t.N+1

    , ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,t.N+1),0)-t.N,8001) -1

    FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter

    )

    --select * from cteStart

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteStart l

    Pretty good performance compared to DelimitedSplit8K! However, when running the tests using Jeff Moden's test harness the function failed whenever there was only one element and was still not quite as fast. Very close in performance though, so the change you made didn't really make much difference.

    RowNumSplitterNameNumberOfRowsNumberOfElementsMinElementLengthMaxElementLengthDurationMinLengthAvgLengthMaxLength

    1DelimitedSplit8K100011100.123001510

    2DelimitedSplit_TELCO100011100.076001510

    3DelimitedSplit8K100021100.1530031121

    4DelimitedSplit_TELCO100021100.1860031121

    5DelimitedSplit8K100041100.2600092443

    6DelimitedSplit_TELCO100041100.1500092443

    7DelimitedSplit8K100081100.33000215073

    8DelimitedSplit_TELCO100081100.13000215073

    9DelimitedSplit8K1000161100.3460064102140

    10DelimitedSplit_TELCO1000161100.4660064102140

    11DelimitedSplit8K1000321100.33300158206268

    12DelimitedSplit_TELCO1000321100.41600158206268

    13DelimitedSplit8K1000641100.53600349415511

    14DelimitedSplit_TELCO1000641100.48600349415511

    15DelimitedSplit8K10001281100.87600701830920

    16DelimitedSplit_TELCO10001281100.96600701830920

    17DelimitedSplit8K10002561101.59600153016631835

    18DelimitedSplit_TELCO10002561101.96000153016631835

    19DelimitedSplit8K10005121103.09300307533233541

    20DelimitedSplit_TELCO10005121103.32300307533233541

    21DelimitedSplit8K100011501107.23000714174667790

    22DelimitedSplit_TELCO100011501108.41600714174667790

    23DelimitedSplit8K1000110200.01000101420

    24DelimitedSplit_TELCO1000110200.01000101420

    25DelimitedSplit8K1000210200.03600213041

    26DelimitedSplit_TELCO1000210200.05300213041

    27DelimitedSplit8K1000410200.10000476282

    28DelimitedSplit_TELCO1000410200.08000476282

    29DelimitedSplit8K1000810200.1200093126155

    30DelimitedSplit_TELCO1000810200.1260093126155

    31DelimitedSplit8K10001610200.20000219254292

    32DelimitedSplit_TELCO10001610200.23600219254292

    33DelimitedSplit8K10003210200.36000458511559

    34DelimitedSplit_TELCO10003210200.39300458511559

    35DelimitedSplit8K10006410200.7100093110211099

    36DelimitedSplit_TELCO10006410200.7600093110211099

    37DelimitedSplit8K100012810201.45300194320472158

    38DelimitedSplit_TELCO100012810201.48300194320472158

    39DelimitedSplit8K100025610202.92000394640954238

    40DelimitedSplit_TELCO100025610202.98300394640954238

    41DelimitedSplit8K100048010205.37000743776767900

    42DelimitedSplit_TELCO100048010205.66300743776767900

    43DelimitedSplit8K1000120300.05000202530

    44DelimitedSplit_TELCO1000120300.01000202530

    45DelimitedSplit8K1000220300.02300415161

    46DelimitedSplit_TELCO1000220300.05600415161

    47DelimitedSplit8K1000420300.0700084102120

    48DelimitedSplit_TELCO1000420300.1030084102120

    49DelimitedSplit8K1000820300.13600179206232

    50DelimitedSplit_TELCO1000820300.16300179206232

    51DelimitedSplit8K10001620300.26600372414449

    52DelimitedSplit_TELCO10001620300.32600372414449

    53DelimitedSplit8K10003220300.52600768830896

    54DelimitedSplit_TELCO10003220300.57000768830896

    55DelimitedSplit8K10006420301.04300159316631733

    56DelimitedSplit_TELCO10006420301.09600159316631733

    57DelimitedSplit8K100012820302.09000318233283441

    58DelimitedSplit_TELCO100012820302.20000318233283441

    59DelimitedSplit8K100025620304.24000649766546803

    60DelimitedSplit_TELCO100025620304.30300649766546803

    61DelimitedSplit8K100029020304.75300733175387696

    62DelimitedSplit_TELCO100029020304.88600733175387696

    63DelimitedSplit8K1000130400.01300303440

    64DelimitedSplit_TELCO1000130400.01000303440

    65DelimitedSplit8K1000230400.02600617181

    66DelimitedSplit_TELCO1000230400.03000617181

    67DelimitedSplit8K1000430400.09300126143161

    68DelimitedSplit_TELCO1000430400.12300126143161

    69DelimitedSplit8K1000830400.18300263287315

    70DelimitedSplit_TELCO1000830400.23300263287315

    71DelimitedSplit8K10001630400.35300529575619

    72DelimitedSplit_TELCO10001630400.38300529575619

    73DelimitedSplit8K10003230400.69300109811501212

    74DelimitedSplit_TELCO10003230400.73300109811501212

    75DelimitedSplit8K10006430401.38300221923012391

    76DelimitedSplit_TELCO10006430401.43300221923012391

    77DelimitedSplit8K100012830402.87300449646054736

    78DelimitedSplit_TELCO100012830402.83300449646054736

    79DelimitedSplit8K100021030404.57600741275617710

    80DelimitedSplit_TELCO100021030404.67000741275617710

    81DelimitedSplit8K1000140500.01600404450

    82DelimitedSplit_TELCO1000140500.01000404450

    83DelimitedSplit8K1000240500.043008191101

    84DelimitedSplit_TELCO1000240500.030008191101

    85DelimitedSplit8K1000440500.14000167182201

    86DelimitedSplit_TELCO1000440500.14300167182201

    87DelimitedSplit8K1000840500.22000342366395

    88DelimitedSplit_TELCO1000840500.25300342366395

    89DelimitedSplit8K10001640500.43600693735779

    90DelimitedSplit_TELCO10001640500.49000693735779

    91DelimitedSplit8K10003240500.86000139914701530

    92DelimitedSplit_TELCO10003240500.91300139914701530

    93DelimitedSplit8K10006440501.71600285229423022

    94DelimitedSplit_TELCO10006440501.81300285229423022

    95DelimitedSplit8K100012840503.56600576758886012

    96DelimitedSplit_TELCO100012840503.59000576758886012

    97DelimitedSplit8K100016540504.44600746575907717

    98DelimitedSplit_TELCO100016540504.55000746575907717

     

  • ah well, I have seen that before where you think removing the extra select will improve performance but it doesn't. Anyway I am working on a look ma no ddl version as well. This is brilliant stuff I always knew had to be a better way than select substring(string,charindex(delimiter,string), substring(string(charindex(delimiter,string,(charindex(delimiter,string))) etc. Thanks for checking I do not have a sandbox at work where is ok to freeproccache.

  • SQL Kiwi (12/30/2012)


    Related: more generalized subtree matching and reuse has been trialled in the product, see this Microsoft Research paper (PDF).

    More generalized matching is exactly what I always wanted, and would hate to see a query hint specific for subexpression reuse. CTE's expanding as they do now is logical as on each site, the "inline view" defined in the CTE can be used differently and thus must be optimised seperately as well. A more generalised optimisation as in the paper will exploit the similarities still found afer that process.

    As for the paper, I only quickly scanned it to see what they were up to, and hope their table signatures are based on what is there after expanding contributing code to a query/batch. You want CTE's and non inline views expanded to have a wide as possible base for finding matches. Thus the optimisation should be 100% seperate from the SQL code and no query hint should be wished for!

  • The performance of this is amazing, as is the thought behind the code. These kind of posts really make you think and learn a whole new set of features in SQL

    Thanks Jeff.

    A couple of questions (sorry if it has been addressed - I didn't see it mentioned in the thread):

    I would not expect a record back if the string being sent in is NULL, a single space or all spaces. In my case, I have records where the string being split is often blank and I would want no records coming back from the CROSS APPLY. I could filter them out using in the WHERE clause. I fixed this by adding LTRIM and RTRIM to the source string in the function.

    Also, some of my "not so good" data also had multiple spaces in between the values causing numerous records returned that were blank.

    For example:

    dbo.DelimitedSplit8k('123 467 789', ' ') -- There are two spaces between 467 and 789

    ItemNumber Item

    1 123

    2 467

    3 <-- Empty String

    4 789

    Thanks Again,

    Anton

Viewing 15 posts - 466 through 480 (of 990 total)

You must be logged in to reply to this topic. Login to reply