An Overview of STRING_SPLIT()

  • Comments posted to this topic are about the item An Overview of STRING_SPLIT()

  • Nice article Steve

    Would be interested in seeing how the new function stacks up against the "custom" split functions that every DBA seems to have a few versions of.

    Cheers
    Phil

    --------------------
    Colt 45 - the original point and click interface

  • Thanks for the write-up Steve.
    😎
    One thing, there are performance references but no mention of the fastest non CLR string splitting function (on SQL Server 2012 and later).

  • Eirikur Eiriksson - Sunday, December 9, 2018 11:36 PM

    Thanks for the write-up Steve.
    😎
    One thing, there are performance references but no mention of the fastest non CLR string splitting function (on SQL Server 2012 and later).

    Added a link

  • philcart - Sunday, December 9, 2018 10:46 PM

    Nice article Steve

    Would be interested in seeing how the new function stacks up against the "custom" split functions that every DBA seems to have a few versions of.

    Cheers
    Phil

    If you look at the reference to Aaron Bertrand's article, he rates performance in different ways. STRING_SPLIT() stacks up nicely.

  • Steve Jones - SSC Editor - Monday, December 10, 2018 7:03 AM

    philcart - Sunday, December 9, 2018 10:46 PM

    Nice article Steve

    Would be interested in seeing how the new function stacks up against the "custom" split functions that every DBA seems to have a few versions of.

    Cheers
    Phil

    If you look at the reference to Aaron Bertrand's article, he rates performance in different ways. STRING_SPLIT() stacks up nicely.

    If you look at Aaron Bertrand's article, he comes to the conclusion that the XML method for splitting is as good as the DemilitedSplit8K method.  Nothing could be further from the truth.  The problem is with the "Devils in the Data" (I've gotta finish the articles on that subject for you soon).  He creates "grooved" test data that has a cardinality of "1".  That seriously skews the test results because the XML method can take advantage of such low cardinalities for a large set.  But, if you have a table of, say, 10,000 rows with a cardinality of 10,000, the XML splitter method becomes a huge performance issue.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you for sharing good news.
    From other side I used to use KNIME free analytical platform for data manipulation
    and most of the time it is necessary to add columns with the values than lines.

  • Jeff Moden - Monday, December 10, 2018 8:24 AM

    If you look at Aaron Bertrand's article, he comes to the conclusion that the XML method for splitting is as good as the DemilitedSplit8K method.  Nothing could be further from the truth.  The problem is with the "Devils in the Data" (I've gotta finish the articles on that subject for you soon).  He creates "grooved" test data that has a cardinality of "1".  That seriously skews the test results because the XML method can take advantage of such low cardinalities for a large set.  But, if you have a table of, say, 10,000 rows with a cardinality of 10,000, the XML splitter method becomes a huge performance issue.

    +100
    😎

  • Steve Jones - SSC Editor - Monday, December 10, 2018 7:03 AM

    Eirikur Eiriksson - Sunday, December 9, 2018 11:36 PM

    Thanks for the write-up Steve.
    😎
    One thing, there are performance references but no mention of the fastest non CLR string splitting function (on SQL Server 2012 and later).

    Added a link

    Thanks Steve, don't think there are too many (poor buggers) on 2008 or earlier,
    😎

  • Jeff Moden - Monday, December 10, 2018 8:24 AM

    If you look at Aaron Bertrand's article, he comes to the conclusion that the XML method for splitting is as good as the DemilitedSplit8K method.  Nothing could be further from the truth.  The problem is with the "Devils in the Data" (I've gotta finish the articles on that subject for you soon).  He creates "grooved" test data that has a cardinality of "1".  That seriously skews the test results because the XML method can take advantage of such low cardinalities for a large set.  But, if you have a table of, say, 10,000 rows with a cardinality of 10,000, the XML splitter method becomes a huge performance issue.

    AB's article isn't bad but it's incomplete and hasn't been appropriately updated.
    😎
    One thing AB missed entirely is the efficiency of fixed width parameter strings, blows the chops of any other methods by a considerable margin (must finish that article)

  • Eirikur Eiriksson - Monday, December 10, 2018 9:09 AM

    AB's article isn't bad but it's incomplete and hasn't been appropriately updated.
    😎
    One thing AB missed entirely is the efficiency of fixed width parameter strings, blows the chops of any other methods by a considerable margin (must finish that article)

    Sorry but have to disagree there.  I did try to coach Aaron publically and privately and, instead, he simply stated he wasn't going to do any additional testing and closed the article to any further comment.  Anyone reading it can easily come to the same wrong conclusion that he did because of the "Holy Grail" nature (an article that contains code to test functionality and performance as well as having charts that "prove" the findings, which are horribly wrong in that case) of his article and people end up drinking from the wrong cup and "dying" when the deploy such poor methods to a production environment.  even worse, his article didn't take into account any special characters and the XML code he published causes a failure there..

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Eirikur Eiriksson - Monday, December 10, 2018 9:09 AM

    Jeff Moden - Monday, December 10, 2018 8:24 AM

    If you look at Aaron Bertrand's article, he comes to the conclusion that the XML method for splitting is as good as the DemilitedSplit8K method.  Nothing could be further from the truth.  The problem is with the "Devils in the Data" (I've gotta finish the articles on that subject for you soon).  He creates "grooved" test data that has a cardinality of "1".  That seriously skews the test results because the XML method can take advantage of such low cardinalities for a large set.  But, if you have a table of, say, 10,000 rows with a cardinality of 10,000, the XML splitter method becomes a huge performance issue.

    AB's article isn't bad but it's incomplete and hasn't been appropriately updated.
    😎
    One thing AB missed entirely is the efficiency of fixed width parameter strings, blows the chops of any other methods by a considerable margin (must finish that article)

    If you consider the conclusions reached because of the incorrect test data used, it's bad.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Eirikur Eiriksson - Monday, December 10, 2018 9:05 AM

    Steve Jones - SSC Editor - Monday, December 10, 2018 7:03 AM

    Eirikur Eiriksson - Sunday, December 9, 2018 11:36 PM

    Thanks for the write-up Steve.
    😎
    One thing, there are performance references but no mention of the fastest non CLR string splitting function (on SQL Server 2012 and later).

    Added a link

    Thanks Steve, don't think there are too many (poor buggers) on 2008 or earlier,
    😎

    Our shop is still one of the few "poor buggers" on 2008 R2.  (Sigh.)  Working on management for a while now...

    Just sharing.  😉   This has been a really helpful article and discussion.  Thank you all.  I'm thinking positively that we will upgrade soon, and I will be able to start using all the new, helpful functionality.

  • Here's an idea that I've been kicking around for a little while now but haven't had the chance to torture test yet...
    It's one of those "not guaranteed to work every time... but... seems to work every time" kind of deals. 
    The idea is simple. Just encapsulate ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) with the STRING_SPLIT function so that THE ROW_NUMBER() is able to register the rows as they are coming out of the STRING_SPLIT before any other part of the query can affect the ordering..

    Like I said, I haven't had a chance to really go nuts trying to break it, but it has stood up the few things I have tried. I figured I'd throw it out and see if anyone wants to take a crack at proving that an idea that shouldn't, doesn't work... (because so far it does)

    The original idea...
    SET QUOTED_IDENTIFIER ON;
    GO
    SET ANSI_NULLS ON;
    GO

    ALTER FUNCTION dbo.SPLIT_STRING_RNv1
    /* ===============================================================================
    12/10/2018 JL, Created: Adds a ROW_NUMBER() to the built in STRING_SPLIT function.    
            IMPORTANT! Fo STRING_SPLIT BOL: "The order is not guaranteed to match the
                order of the substrings in the input string."
            Just becuase this trick works "most of the time" it is not guaranteed to
            work every time. DO NOT use in a business critical situation where the
            original order is critical.
    =============================================================================== */
    --===== Define I/O parameters
    (
        @string NVARCHAR(MAX),
        @separator NCHAR(1)
    )
    RETURNS TABLE WITH SCHEMABINDING AS
    RETURN
        SELECT
            rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),
            ss.[Value]
        FROM
            STRING_SPLIT(@string, @separator) ss;
    GO

    This one was more of an odd-nut idea that popped into my head after recalling that using the TOP w/ ORDER BY forces an internal sort to handle the TOP(N) criteria, even when N is large enough to handle any result set. I have absolutely no idea if it's doing anything at all... I haven't been able to get the 1st one to screw up to see if this one does any better...
    SET QUOTED_IDENTIFIER ON;
    GO
    SET ANSI_NULLS ON;
    GO

    ALTER FUNCTION dbo.SPLIT_STRING_RNv2
    /* ===============================================================================
    12/10/2018 JL, Created: Adds a ROW_NUMBER() to the built in STRING_SPLIT function.    
            This is an alternate version that adds a TOP & ORDER BY to the query
                in order to see if it behaves differently.
            IMPORTANT! Fo STRING_SPLIT BOL: "The order is not guaranteed to match the
                order of the substrings in the input string."
            Just becuase this trick works "most of the time" it is not guaranteed to
            work every time. DO NOT use in a business critical situation where the
            original order is critical.
    =============================================================================== */
    --===== Define I/O parameters
    (
        @string NVARCHAR(MAX),
        @separator NCHAR(1)
    )
    RETURNS TABLE WITH SCHEMABINDING AS
    RETURN
        SELECT TOP (2147483647)
            rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),
            ss.[Value]
        FROM
            STRING_SPLIT(@string, @separator) ss
        ORDER BY
            rn;
    GO

    This 3rd option is based on good idea from Aaron Bertrand. It's guaranteed to work as long as your input string is composed of unique values and you're willing to eat the resulting sort operation.
    It works by using CHARINDEX to find the Values position on the original input string...
    If you want to see the conversation that prompted me to actually start testing the idea (or read Aaron's thoughts on the idea [spoiler: not a fan]), Here's a link: https://stackoverflow.com/questions/53562331/selecting-data-against-numeric-values-saved-as-comma-separated-string/53563089#53563089
    SET QUOTED_IDENTIFIER ON;
    GO
    SET ANSI_NULLS ON;
    GO

    CREATE FUNCTION dbo.SPLIT_STRING_RNvAB
    /* ===============================================================================
    12/10/2018 JL, Created: Adds a ROW_NUMBER() to the built in STRING_SPLIT function.
            Based on an Aaron Butyrand's idea of using CHARINDEX to find the items location
            in the original string.
            Note: Array items must be DISTINCT for this approach to work properly.
    =============================================================================== */
    --===== Define I/O parameters
    (
        @string NVARCHAR(MAX),
        @separator NCHAR(1)
    )
    RETURNS TABLE WITH SCHEMABINDING AS
    RETURN
        SELECT
            rn = ROW_NUMBER() OVER (ORDER BY value_position),
            ss.[Value]
        FROM
            STRING_SPLIT(@string, @separator) ss
            CROSS APPLY ( VALUES (CHARINDEX(@separator + ss.Value + @separator, @separator + @string + @separator)) ) vp (value_position);
    GO

    Here's a simple test script I threw together...
    CREATE TABLE #TestData (Num INT NOT NULL PRIMARY KEY);    -- Adding the clustered key to impose a "natural sort" that causes the stock
    INSERT #TestData (Num) VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);        --STRING_SPLIT to not display in the order of the original string.

    DECLARE @string VARCHAR(8000) = '9,6,3,1,4,8,5,6,1,0,2';

    -- original STRING_SPLIT function...
    SELECT
        td.Num,
        ss.value
    FROM
        STRING_SPLIT(@string, ',') ss
        JOIN #TestData td
            ON CONVERT(INT, ss.Value) = td.Num;

    -- ROW_NUMBER() alone
    SELECT
        td.Num,
        ssr.rn,
        ssr.Value
    FROM
        dbo.SPLIT_STRING_RNv1(@string, ',') ssr
        JOIN #TestData td
            ON CONVERT(INT, ssr.Value) = td.Num

    -- ROW_NUMBER() w/ TOP & ORDER BY
    SELECT
        td.Num,
        ssr.rn,
        ssr.Value
    FROM
        dbo.SPLIT_STRING_RNv2(@string, ',') ssr
        JOIN #TestData td
            ON CONVERT(INT, ssr.Value) = td.Num;

    -- ROW_NUMBER() with order set by CHARINDEX
    SELECT
        td.Num,
        ssr.rn,
        ssr.Value
    FROM
        dbo.SPLIT_STRING_RNvAB(@string, ',') ssr
        JOIN #TestData td
            ON CONVERT(INT, ssr.Value) = td.Num;

    DROP TABLE #TestData;

  • Lot's of people have been doing that trick with RowNumber.  Thinking about the code behind the scenes, it should be guaranteed but MS refuses to state that or even imply that.  It's really a shame they didn't take the time to return the element ordinal.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply