Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Eirikur Eiriksson - Tuesday, March 19, 2019 7:18 AM

    Quick question, are there any multi-line entries or can one use the new line bytes as a row delimiter?
    😎

    To whom are you directing that question?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Tuesday, March 19, 2019 7:26 AM

    Eirikur Eiriksson - Tuesday, March 19, 2019 7:18 AM

    Quick question, are there any multi-line entries or can one use the new line bytes as a row delimiter?
    😎

    To whom are you directing that question?

    The OP
    😎

    The spanner in the works is the multi row entries, big difference from handling single row entries 😉

  • Eirikur Eiriksson - Tuesday, March 19, 2019 8:52 AM

    Jeff Moden - Tuesday, March 19, 2019 7:26 AM

    Eirikur Eiriksson - Tuesday, March 19, 2019 7:18 AM

    Quick question, are there any multi-line entries or can one use the new line bytes as a row delimiter?
    😎

    To whom are you directing that question?

    The OP
    😎

    The spanner in the works is the multi row entries, big difference from handling single row entries 😉

    Agreed but not much of one if each "record" has the same pattern of multi-line entries.  I do use format files for such a thing if the multi-line entries are consistent.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hey, Jeff, will you update this yet again to include the new split_string() function in SQL Server 2016+?

     

  • corey lawson wrote:

    Hey, Jeff, will you update this yet again to include the new split_string() function in SQL Server 2016+?

    Probably not. The STRING_SPLIT function is relatively crippled for those that need to know the position of the split out elements.  That's been pretty well covered by Wayne Sheffield when it first came out.  Here's the link...

    https://www.sqlservercentral.com/articles/splitting-strings-in-sql-server-2016

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    corey lawson wrote:

    Hey, Jeff, will you update this yet again to include the new split_string() function in SQL Server 2016+?

    Probably not. The STRING_SPLIT function is relatively crippled for those that need to know the position of the split out elements.  That's been pretty well covered by Wayne Sheffield when it first came out.  Here's the link...

    https://www.sqlservercentral.com/articles/splitting-strings-in-sql-server-2016

    Long story short, here's a schemabound ordinal splitter tvf which uses the string_split function as its base and then uses charindex with the "double separator append" trick.  Maybe this is an efficient approach?

    drop function if exists dbo.string_split_kvp;
    go
    create function dbo.string_split_kvp(
    @string varchar(8000),
    @separator char(1))
    returns table with schemabinding as return
    select row_number() over (order by
    charindex(concat(@separator, [value], @separator),
    concat(@separator, @string, @separator))) ,
    [value]
    from string_split(@string, @separator);

    Example

    select * from dbo.string_split_kvp('this is a space separated string', ' ');

    Output

    key	value
    1 this
    2 is
    3 a
    4 space
    5 separated
    6 string

     

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Steve Collins wrote:

    Jeff Moden wrote:

    corey lawson wrote:

    Hey, Jeff, will you update this yet again to include the new split_string() function in SQL Server 2016+?

    Probably not. The STRING_SPLIT function is relatively crippled for those that need to know the position of the split out elements.  That's been pretty well covered by Wayne Sheffield when it first came out.  Here's the link...

    https://www.sqlservercentral.com/articles/splitting-strings-in-sql-server-2016

    Long story short, here's a schemabound ordinal splitter tvf which uses the string_split function as its base and then uses charindex with the "double separator append" trick.  Maybe this is an efficient approach?

    drop function if exists dbo.string_split_kvp;
    go
    create function dbo.string_split_kvp(
    @string varchar(8000),
    @separator char(1))
    returns table with schemabinding as return
    select row_number() over (order by
    charindex(concat(@separator, [value], @separator),
    concat(@separator, @string, @separator))) ,
    [value]
    from string_split(@string, @separator);

    Example

    select * from dbo.string_split_kvp('this is a space separated string', ' ');

    Output

    key	value
    1 this
    2 is
    3 a
    4 space
    5 separated
    6 string

    apart from the performance that is likely to be quite bad it has errors on its construct.

    try it with

    select * from dbo.string_split_kvp('this has 2 repeated has', ' ');
  • apart from the performance that is likely to be quite bad has errors on its construct

    The Construct is a problem but I suspect the performance should be fine, STRING_SPLIT is quite fast. The logic will force a sort but you will only sort up to 8001 rows which is quite low.

    That said, the logic is broke. STRING_SPLIT is great when you don't need the items in a specific order. I use it for things like parsing multi-select strings in SSRS. If order is important then STRING_SPLIT is not the solution.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Yes the logic is broke.  I agree with all of Alan's comments.  It seems possible to take corrective steps using cte(s) and additional windowing functions where the cardinality of the set(s) is the word count and not the character count.  Maybe that doesn't get any performance gains though.  I'll keep trying things. This issue of ordering split strings seems to come up over and over.

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • How about "This string is a space separated string"?

    will this function work?

  • Sergiy wrote:

    How about "This string is a space separated string"?

    will this function work?

    By "work" if you mean "give the right answer" then no.  It doesn't return any error though.  But I do have another idea to try.  Once I'm back at home I'll test it out.

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • @ Steve Collins,

    Great try/good thinking.  I'm thinking, though, that just about anything you add to it is going to slow it way down... especially since, if you read the article, the original performance problem was concatenation.

    @ Alan,

    I'm honestly a little bit surprised that you're justifying a sort because it has a low row count.  We know how that usually works out.

    @Everyone,

    The function is this article served well for a long time but it has soundly been beaten for performance after 2012 came out by Eirikur Eiriksson.   His good code still avoids concatenation and his change was super simple... Basically, he replaced CHAR index with LEAD.  Prior to him posting his solution, he asked me to test it for him and I tested for accurate functionality and performance...  It IS ACCURATE and it IS TWICE AS FAST as the original (getting real close to CLR speeds).

    With that I'll say that if you're using 2012 or above, I strongly recommend you upgrade your function to his code, with due credit, of course.

    Here's the link to Eirikur's article (the code is in Part 1 of his fine article).

    https://www.sqlservercentral.com/articles/reaping-the-benefits-of-the-window-functions-in-t-sql-2

    If you want to performance test against something, you should performance test against Eirikur's code and I should have Steve Jones update this article to point to Eirikur's.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @ Alan,

    I'm honestly a little bit surprised that you're justifying a sort because it has a low row count.  We know how that usually works out.

    I was a little rushed in my response. I'll clarify when I get time 😉

    • This reply was modified 3 months, 3 weeks ago by  Alan Burstein.
    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 13 posts - 946 through 958 (of 958 total)

You must be logged in to reply to this topic. Login to reply