Split string using XML

  • I'd seen XML string splitting and concatenation before, but ran into issues with characters that aren't valid in XML. I found that just escaping all the invalid characters prior to splitting frequently took more time than the entire split operation using a tally table.

    That said, if you can qualify your data enough to ensure you won't ever have those sorts of issues, then it is certainly easier to code than some other solutions.

  • Senthilnathan.Karunakaran (6/25/2009)


    We did performance and scaling test to split the comma separated string value using XML query and SQL function.

    The plan was better with XML split than SQL function; however XML query performance degraded when number of concurrent users increase. SQL function did better in scalability test.

    Cool. Let's see the test code and the data generator.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • We are converting SS2000 DTS packages to SSIS 2005.

    I need help using a User Defined Function within a query on a table.

    The function performs correctly.

    select * from fn_Split('13.0,13.1,14.0,14.1' ,',')

    idx value

    0 13.0

    1 13.1

    2 14.0

    3 14.1

    This is the table to run the function against, with a few sample record results.

    Select versions from versiontable

    13.0,13.1,14.0,14.1

    13.0,13.1,14.0,14.1

    13.0,13.1,14.0,14.1

    13.0,13.1,14.0,14.1

    11.0,11.1,12.0,12.1,13.0,13.1,14.0

    I tried this. Now it no longer recognises the function. Is the syntax wrong?

    select fn_Split(versions, ',') as SingleVersion from versiontable where versions is not null

    Server: Msg 195, Level 15, State 10, Line 1

    'fn_Split' is not a recognized function name.

    Tried this also.

    select dbo.fn_Split(versions, ',') as Version from versiontable where versions is not null

    Server: Msg 208, Level 16, State 1, Line 1

    Invalid object name 'dbo.fn_Split'.

    Any suggestions to get the query to work or even better, a nifty method to accomplish this within my data flow task using one of the Data Flow Transformation. Seems like the pivot might be useful for this?

  • jo stovall (6/25/2009)


    We are converting SS2000 DTS packages to SSIS 2005.

    I need help using a User Defined Function within a query on a table.

    The function performs correctly.

    select * from fn_Split('13.0,13.1,14.0,14.1' ,',')

    idx value

    0 13.0

    1 13.1

    2 14.0

    3 14.1

    This is the table to run the function against, with a few sample record results.

    Select versions from versiontable

    13.0,13.1,14.0,14.1

    13.0,13.1,14.0,14.1

    13.0,13.1,14.0,14.1

    13.0,13.1,14.0,14.1

    11.0,11.1,12.0,12.1,13.0,13.1,14.0

    I tried this. Now it no longer recognises the function. Is the syntax wrong?

    select fn_Split(versions, ',') as SingleVersion from versiontable where versions is not null

    Server: Msg 195, Level 15, State 10, Line 1

    'fn_Split' is not a recognized function name.

    Tried this also.

    select dbo.fn_Split(versions, ',') as Version from versiontable where versions is not null

    Server: Msg 208, Level 16, State 1, Line 1

    Invalid object name 'dbo.fn_Split'.

    Any suggestions to get the query to work or even better, a nifty method to accomplish this within my data flow task using one of the Data Flow Transformation. Seems like the pivot might be useful for this?

    What do you get when you run the following?

    SELECT *

    FROM sys.Objects

    WHERE Name = 'fn_Split'

    If the answer is nothing, then you're either in the wrong database or the CREATE FUNCTION code didn't actually work.

    As a side bar, you'd probably get a lot more "hits" on your question if you asked it in the proper forum instead of a thread dedicated to the discusssion of an article.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for responding.

    I reposted on the SSIS forum.

  • I have error:

    Msg 170, Level 15, State 1, Line 12

    Line 12: Incorrect syntax near '('.

    Msg 170, Level 15, State 1, Line 13

    Line 13: Incorrect syntax near '('.

    DECLARE @t TABLE( ID INT IDENTITY, data VARCHAR(50))

    INSERT INTO @t(data) SELECT 'AA,AB,AC,AD'

    INSERT INTO @t(data) SELECT 'BA,BB,BC'

    SELECT F1.id,

    F1.data,

    O.splitdata

    FROM

    (

    SELECT *, cast(''+replace(F.data,',','')+'' as XML) as xmlfilter from @t F

    )F1

    CROSS

    (

    SELECT fdata.D.value('.','varchar(50)') as splitdata FROM f1.xmlfilter.nodes('X') as fdata(D)) as O

  • I should probably add to my previous submitted method.

    On our server a split of 131,072 is taking 160 milli seconds.

    The server is fairly standard db server spec.

    Sam

    select getdate()

    declare @STR varchar(max)

    set @STR = 'AAAA,BBBB,CCCC,DDDD'

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    set @STR = @STR + ',' + @STR

    select count(*) from dbSystem..FnSplitQuick(@str,0,',')

    select getdate()

  • I just ran across one of J. Sebastians article on how to do a string split using XML:http://www.sqlservercentral.com/articles/XML/61618/. Looks comparable...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • The article is well written and my hat is off to the author for that. I have to agree with the others, though... using XML to split delimited data is a wee bit slow compared to many other methods.

    Also, I see lot's of claims in the area of performance, but I don't see much in the line of test data generation code or test harnesses so other people can substantiate those claims. Like Sergiy says, "A Developer must not guess... a Developer must KNOW!" With that little pearl of wisdom still ringing in all our ears, let's do some testing...

    First, if you don't have a Tally table yet, you need to build one. For this exercise, we'll write and execute all of the code, including the code for the Tally table, in a nice save place where we likely won't mess with anyone's real tables... TempDB...


    [font="Courier New"]--===== Do this in a nice safe place...

        USE TempDB

    --===== Create and populate the Tally table on the fly

     SELECT TOP 1000000

            IDENTITY(INT,1,1AS N

       INTO dbo.Tally

       FROM Master.dbo.SysColumns sc1,

            Master.dbo.SysColumns sc2

    --===== Add a Primary Key to maximize performance

      ALTER TABLE dbo.Tally

        ADD CONSTRAINT PK_Tally_N 

            PRIMARY KEY CLUSTERED (NWITH FILLFACTOR = 100

    --===== Allow the general public to use it

      GRANT SELECT ON dbo.Tally TO PUBLIC[/font]


    Next, we need some test data. Instead of the normal fixed width stuff, which probably doesn't reflect real life very well, we'll make some random length elements. Read the comments in the code folks...


    [font="Courier New"]--=============================================================================

    --      Setup

    --=============================================================================

    --===== Do this in a nice safe place

        USE TempDB

    --===== Declare some obviously named variables

    DECLARE @NumberOfElements INT,

            @MaxElementSize   INT,

            @NumberOfRows     INT

    --===== Presets

     SELECT @NumberOfElements 6,

            @MaxElementSize   6,

            @NumberOfRows     10000

    --=============================================================================

    --      Create the test table on the fly

    --=============================================================================

    --===== Conditionally drop the test table

         IF OBJECT_ID('TempDB..TestTable','U'IS NOT NULL

            DROP TABLE TestTable

    --===== Create and populate the test table on the fly.

         -- Number of elements, max size of elements, and number of rows are all

         -- programable based on the variables at the beginning of this scirpt.

     SELECT t.N AS ID,

            STUFF((SELECT ',' LEFT(NEWID(),ABS(CHECKSUM(NEWID()))%@MaxElementSize+1)

                     FROM dbo.Tally t1

                    WHERE t1.N <= @NumberOfElements

                      AND t.N   --External reference forces NEWID() to be recalculated

                      FOR XML PATH(''))

             ,1,1,''AS Data

       INTO TestTable

       FROM dbo.Tally t

      WHERE t.N <= @NumberOfRows[/font]


    Next, we need to build some functions that folks included in some of their responses either directly or indirectly. Not all the functions that were mentioned are included because some of them are the same...


    [font="Courier New"]--===== Do this in a nice safe place

        USE TEMPDB

    --===== Create Eric Sommarskog's inline function (identical to the code I use outside a function)

     CREATE FUNCTION InLineSplit(@String VARCHAR(8000))

    RETURNS TABLE AS

     RETURN (SELECT SUBSTRING(',' @String ','N+1CHARINDEX(','@String ','N)-NAS SplitData

               FROM dbo.Tally t

              WHERE <= LEN(','  @String)

                AND SUBSTRING(',' @StringN1',')

    GO

    --===== Create Rafidheen's split function

    CREATE function Split_fn

    (

       @split_string   varchar(max),

      @deli_char      varchar(3

    )

    returns @list table

    (

       SeqNo       int,

        SplitString varchar(max)

    )

    as

    begin

        declare @from_loc       int

     declare @to_loc         int

     if charindex(@deli_char,@split_string,0) <= 0

        begin

           insert into @list(seqnoSplitStringvalues (1@split_string)

         return

      end

        if charindex(@deli_char,@split_string,0) > 0

     begin

           select @from_loc    0

         select @to_loc      = charindex(@deli_char,@split_string,0)

     end

      

        if charindex(@deli_char,@split_string,0) <= 0

        begin

           select @to_loc null

       end

     while @to_loc is not null

       begin

           if substring(@split_string,@from_loc@to_loc @from_loc) <> ''

          begin

               insert into @list(seqnoSplitString)

               select isnull(max(seqno),0) + 1substring(@split_string,@from_loc@to_loc @from_loc)

                from    @list

           end

      

            select @from_loc = charindex(@deli_char,@split_string,@from_loc+len(@deli_char)) + len(@deli_char)

         select @to_loc = charindex(@deli_char,@split_string,@from_loc)

         if @to_loc 

         begin

               if substring(@split_string,@from_loc, (len(@split_string)  - @from_loc) + len(@deli_char)) <> ''

             begin

                   insert into @list(seqnoSplitString)

                   select isnull(max(seqno),0) + 1substring(@split_string,@from_loc, (len(@split_string)  - @from_loc) + len(@deli_char))

                   from    @list

               end

                select @to_loc null

           end

     end

     return

    end

    go[/font]


    Finally, we get to some test code. This prints some results in the messages window of SSMS. I recommend you be in the "Grid Mode" when you run these to keep from skewing the results and to make the performance messages easier to read....


    [font="Courier New"]--=============================================================================

    --      Setup

    --=============================================================================

    --===== Do this in a nice safe place

        USE TempDB

    --===== Declare some obviously named variables

    DECLARE @NumberOfElements INT,

            @MaxElementSize   INT,

            @NumberOfRows     INT

    --===== Presets

     SELECT @NumberOfElements 6,

            @MaxElementSize   6,

            @NumberOfRows     10000

    --=============================================================================

    --      Create the test table on the fly

    --=============================================================================

    --===== Conditionally drop the test table

         IF OBJECT_ID('TempDB..TestTable','U'IS NOT NULL

            DROP TABLE TestTable

    --===== Create and populate the test table on the fly.

         -- Number of elements, max size of elements, and number of rows are all

         -- programable based on the variables at the beginning of this scirpt.

     SELECT t.N AS ID,

            STUFF((SELECT ',' LEFT(NEWID(),ABS(CHECKSUM(NEWID()))%@MaxElementSize+1)

                     FROM dbo.Tally t1

                    WHERE t1.N <= @NumberOfElements

                      AND t.N   --External reference forces NEWID() to be recalculated

                      FOR XML PATH(''))

             ,1,1,''AS Data

       INTO TestTable

       FROM dbo.Tally t

      WHERE t.N <= @NumberOfRows

    GO

    --=============================================================================

    --      Test the various methods from this thread including the method from

    --      the article.

    --=============================================================================

    --========== XML from article =================================================

      PRINT '========== XML from article =========================================='

        SET STATISTICS TIME ON

    select F1.id,

      f1.data,

      O.splitdata  

     from

     (

     select *,

      cast('<X>'+replace(F.data,',','</X><X>')+'</X>' as XMLas xmlfilter 

     from TestTable F 

     )F1 

     cross apply

     

     select fdata.D.value('.','varchar(50)'as splitdata  

     from f1.xmlfilter.nodes('X'as fdata(D)) O

    --========== rafidheenm''s function ===========================================

       SET STATISTICS TIME OFF

      PRINT '========== Rafidheen''s function ===================================='

        SET STATISTICS TIME ON

     SELECT tt.ID,

            tt.Data,

            ca.SplitString

       FROM TestTable tt

      CROSS APPLY

            (SELECT SplitString

               FROM dbo.split_fn(tt.Data,',')

            ) ca

    --========== Tally Table Split ================================================

        SET STATISTICS TIME OFF

      PRINT '========== Tally Table Split ========================================='

        SET STATISTICS TIME ON

     SELECT tt.ID,

            tt.Data,

            SUBSTRING(',' tt.Data ','N+1CHARINDEX(','tt.Data ','N)-NAS SplitData

       FROM dbo.Tally t

      CROSS JOIN TestTable tt

      WHERE <= LEN(',' tt.Data)

        AND SUBSTRING(',' tt.DataN1','

    --========== Tally Table Cross Apply ==========================================

        SET STATISTICS TIME OFF

      PRINT '========== Tally Table Cross Apply ==================================='

        SET STATISTICS TIME ON

     SELECT tt.ID,

            tt.Data

            ca.SplitData

       FROM TestTable tt

      CROSS APPLY

            (

             SELECT SUBSTRING(',' tt1.Data ','N+1CHARINDEX(','tt1.Data ','N)-NAS SplitData

               FROM dbo.Tally t

              CROSS JOIN TestTable tt1

              WHERE <= LEN(',' tt1.Data)

                AND SUBSTRING(',' tt1.DataN1','

                AND tt.ID tt1.ID

            ca

    --========== Inline Table Function ============================================

        SET STATISTICS TIME OFF

      PRINT '========== Inline Table Function ====================================='

        SET STATISTICS TIME ON

     SELECT tt.ID,

            tt.Data,

            ca.SplitDatta

       FROM TestTable tt

      CROSS APPLY

            (SELECT SplitData

               FROM InLineSplit(tt.Data)

            ) ca

        SET STATISTICS TIME OFF

    [/font]


    Here's the results. I should point out that my computer is a 7 year old p4 1.8GHz single cpu with old IDE drives and only a Gig of RAM running SQL Server 2005 Dev Edition at sp2 (haven't had the chance to install sp3, yet).


    [font="Courier New"]

    (1000 row(s) affected)

    ========== XML from article ==========================================

    (6000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 3000 ms, elapsed time = 3573 ms.

    ========== rafidheenm's function ====================================

    (6000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 3656 ms, elapsed time = 5958 ms.

    ========== Tally Table Split =========================================

    (6000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 422 ms, elapsed time = 749 ms.

    ========== Tally Table Cross Apply ===================================

    (6000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 859 ms, elapsed time = 1208 ms.

    ========== Inline Table Function =====================================

    (6000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 156 ms, elapsed time = 406 ms.

    [/font]


    As you can see, the straight Tally Table method smokes most of the other methods. Only the Inline Table Function method is faster and that smokes the straight method. Of course, the straight Tally Table method works in all versions of SQL Server that I know of.

    CLR's can be written to beat the Tally table but only if you're careful. Florian Reischl has a lead on a great CLR that I can just about guarantee will beat everything else for performance and simplicity including other CLR's for the same task.

    Just a couple of end notes... the Cross-Apply Tally Table method works very quickly if all of the elements are exactly the same size. It seems that the optimizer can figure that out and will produce speeds nearly as fast as the Inline function method. I haven't tried the Inline function method in 2k yet mostly because I'd have to convert the Cross-Apply to a correlated subquery and simply haven't taken the time to do it.

    I'm also very pro on " T-SQL only" solutions. If a CLR runs at the same speed or even 1/2 the duration of the Inline function or the Straight Tally table, I can just about guarantee that a CLR will not find it's way into my code. 😉

    Also, the Tally Table (and several other methods) has a bit of a difficult time with "out of row" variables such as VARCHAR(MAX) especially if the distance between delimiters is large. Ths is because of the JOIN the Tally Table does with the data... and it doesn't like joining with "out of row" stuff. Depending on the distance between delimiters and the actual size of each row, you may have to shift to a very tight WHILE LOOP to get superior performance in the presence of VARCHAR(MAX). I do know that the XML method will absolutely croak not only on VARCHAR(MAX), but anytime there are a large number of elements in the same row to be split. Of course, you don't have to believe me because you now have code to test with. Remember that "One experiment is worth a thousand expert opinions".

    If you'd like to read more about what a Tally Table is and how it replaces certain kinds of While Loops, please see the following article...

    http://www.sqlservercentral.com/articles/T-SQL/62867/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, is there any chance you could try the partial CLR method as well?

    I submitted the code in previous article.

    use CLR to convert the items to being a fixed width string

    There's some additional code in there to determine the required fixed width length, if the size limit is known.

    ie

    aa,gggggggg,dd,ccc

    becomes

    aa______ggggggggdd______ccc______

    and then split the result with a tally table, and substring function.

  • Ignore that last post.

    I've just simply compared on my server.

    Rows 4096 based on huge repitition of AAAA,BB,CCCCCCCCC,D

    Inline function

    126 ms

    123 ms

    130 ms

    Partial CLR plus tally table

    3 ms

    3 ms

    3 ms

    I used count(*), what did you use for your measurements?

    I'm also very pro on " T-SQL only" solutions. If a CLR runs at the same speed or even 1/2 the duration of the Inline function or the Straight Tally table, I can just about guarantee that a CLR will not find it's way into my code.

    I personally don't mind a little CLR, where it is non specific string ops, and the performance is that much better.

    Sam

  • We are having only five characters ( , ', ", & ) that need escape character in XML.

    We can use replace function for handling them.

    DECLARE @test-2 xml

    DECLARE @chars nvarchar(500)

    set @chars = '1&4,4,5,6'

    set @chars = '' + REPLACE(REPLACE(REPLACE(REPLACE(@chars,'&','&'),'>','>'),'<','<') ,',','') + ''

    select @test-2 = @chars

    select c.value('.','nvarchar(10)')

    from @test.nodes('X') as t(c)

  • sam.walker (6/29/2009)


    Ignore that last post.

    I've just simply compared on my server.

    Rows 4096 based on huge repitition of AAAA,BB,CCCCCCCCC,D

    Inline function

    126 ms

    123 ms

    130 ms

    Partial CLR plus tally table

    3 ms

    3 ms

    3 ms

    I used count(*), what did you use for your measurements?

    I'm also very pro on " T-SQL only" solutions. If a CLR runs at the same speed or even 1/2 the duration of the Inline function or the Straight Tally table, I can just about guarantee that a CLR will not find it's way into my code.

    I personally don't mind a little CLR, where it is non specific string ops, and the performance is that much better.

    Sam

    Very cool, Sam. To answer your question, I just let the natural rowcount put out the 6000 count.

    The data you said you built as a "huge repetition" may fall into the category where the "fixed field" anomoly does trick SQL Server into evaluating the split just once. That's why I built the test table where the length of each element is random. Then again, considering what the CLR does, maybe not.

    So far as running a CLR, no can do... I don't have C# or VB on my box and, since I'm a bit over-cautious, won't run someone else's DLL.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Chandra Sekhar Manchukonda (6/29/2009)


    We are having only five characters ( , ', ", & ) that need escape character in XML.

    We can use replace function for handling them.

    DECLARE @test-2 xml

    DECLARE @chars nvarchar(500)

    set @chars = '1&4,4,5,6'

    set @chars = '' + REPLACE(REPLACE(REPLACE(REPLACE(@chars,'&','&'),'>','>'),'<','<') ,',','') + ''

    select @test-2 = @chars

    select c.value('.','nvarchar(10)')

    from @test.nodes('X') as t(c)

    Just curious... are you suggesting that you're still going to use XML as a splitter even though it's just been proven how slow it is and that it requires such special character replacement?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I have a set of comma delimited strings which have between 1 and 100's of invoice records per string (with each invoice record having the same 10 data fields). Your code turns the data into rows with admirable efficiency. However, I can't work out how to transform the results into columns with one row per invoice record. Can anyone help?

Viewing 15 posts - 16 through 30 (of 49 total)

You must be logged in to reply to this topic. Login to reply