Parse a very large delimited string

  • Trying to parse a very basic table with 2 colums, and ~9,000 rows. Columns are ID and delimited_string. ID is a varchar (50), and delimited_string is a varchar (max). The max len of the delimited_string is 313,895 characters. The delimiter is a semi-colon. I've tried a bunch of different methods, but can't get the query to finish, including this one (which has worked well for me in the past):

    WITH tmp (

    [ID]

    ,DataItem

    ,[delimited_string]

    )

    AS (

    SELECT [ID]

    ,CAST(LEFT([delimited_string], CHARINDEX(';', [delimited_string] + ';') - 1) AS VARCHAR(MAX))

    ,STUFF([delimited_string], 1, CHARINDEX(';', [delimited_string]+ ';'), '')

    FROM input_table

    UNION ALL

    SELECT [ID]

    ,CAST(LEFT([delimited_string], CHARINDEX(';', [delimited_string]+ ';') - 1) AS VARCHAR(MAX))

    ,STUFF([delimited_string], 1, CHARINDEX(';', [delimited_string] + ';'), '')

    FROM tmp

    WHERE [delimited_string] > ''

    )

    SELECT [ID]

    ,DataItem

    INTO split_table

    FROM tmp

    OPTION (MAXRECURSION 0)

    GO

    Any thoughts?

  • Thats where you need a tally table and to split an alternative method.

    Take a look at delimitedsplit8k (https://www.sqlservercentral.com/articles/tally-oh-an-improved-sql-8k-%e2%80%9ccsv-splitter%e2%80%9d-function) and work that into the solution, it's a much more efficient delimited string splitting function.

  • create table #longdelim (ID VARCHAR(50), DelimString VARCHAR(MAX))
    insert into #longdelim values ('123','a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z')
    select tab.id, spl.item from #longdelim tab
    cross apply dbo.delimitedsplit8k(DelimString,';') spl

     

  • @Ant-Green -- Thanks for the info! While this works well, like I said, I have input strings that are up to 314,000 characters long, and this looks to only parse the first 8000 characters. The link you mentioned explicitly said not to change to VARCHAR(MAX). Is there a way I can amend the function to parse all of the string? I understand that performance will suffer (I did run it as is, and it was crazy quick), but I haven't found even a usable method for accomplishing this task yet. Thanks in advance for the reply!

  • As you're on SQL Server 2017, why not use STRING_SPLIT?

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • @thom-2 A -- Thank you for the suggestion! I was able to achieve my goal. For anyone looking to do something similar in the future, here's what I did:

    CREATE TABLE #TempSubProduct (
    [ID] VARCHAR(100)
    ,SubName VARCHAR(8000) ------ Made it large in case there were some very large parsed strings
    )

    DECLARE @PID VARCHAR(100)
    DECLARE @Name VARCHAR(max)

    DECLARE Split_Product CURSOR LOCAL FAST_FORWARD
    FOR
    SELECT [ID]
    ,[delimited_string]
    FROM input_file

    OPEN Split_Product

    FETCH NEXT
    FROM Split_Product
    INTO @PID
    ,@Name

    WHILE @@FETCH_STATUS = 0
    BEGIN
    INSERT INTO #TempSubProduct
    SELECT @PID
    ,SPL.value
    FROM STRING_SPLIT(@Name, ';') AS SPL

    FETCH NEXT
    FROM Split_Product
    INTO @PID
    ,@Name
    END

    SELECT *
    FROM #TempSubProduct
    GO

    --Trim leading spaces from names
    UPDATE #TempSubProduct
    SET subname = LTRIM(subname)
    GO

    SELECT count(*)
    FROM #TempSubProduct

    --73,987,928 strings parsed
    --Get distinct count
    SELECT subname
    ,count(*) AS appearance_count
    INTO #appearance_count
    FROM #TempSubProduct
    GROUP BY subname

    SELECT *
    FROM #appearance_count
    ORDER BY appearance_count DESC
    GO

    This block ran in a matter of a few minutes. Thanks to all for the help in getting this solved!

  • Why go to all the work of coding a cursor?    It's easy enough to do set-based if you CROSS APPLY the STRING_SPLIT() function.

    SELECT [ID], [SubName] = SPL.value
    INTO #TempSubProduct
    FROM input_file
    CROSS APPLY STRING_SPLIT([delimited_string], ';') SPL

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • I'[m always curious...  what is the data in the VARCHAR(MAX) column that is being split?  Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Large amount of email recipients (> 5000 per row). Final count of split values was ~73 million records

  • That! is a valid question Dixie. I'm new to using SQL server regularly so saw an sample cursor that I could easily adapt and used it. Appreciate the simplification!

  • zackattack05 wrote:

    Large amount of email recipients (> 5000 per row). Final count of split values was ~73 million records

    Been there and done that in the past!  Thank you for the info!

    As a bit of a sidebar, I was totally gobsmacked that a single email could actually have something like this and then, as a part of a job I was doing for a company, actually saw a shedload of emails that had this trait.  It was some hierarchical email tracking software that I was writing for the auto-discover of "legs" of admissible emails.  As a bit of a sidebar, that worked out really well.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here https://sqlperformance.com/2012/07/t-sql-queries/split-strings you have a performance comparison for some methods.

    As you can see, the CLR versions have no comparison with the rest. You can build them with the code at: http://dataeducation.com/sqlclr-string-splitting-part-2-even-faster-even-more-scalable/

    However, you can already find them made at: https://www.inforcustom.duckdns.org/en-us/documentacion/insertbulkbestia/ibbdetalleobjetoscomunes#ibbclr.StringToTableStringsPairs You can use them for free, for as long as you want.

     

    Francisco

  • fgrodriguez wrote:

    Here https://sqlperformance.com/2012/07/t-sql-queries/split-strings you have a performance comparison for some methods.

    As you can see, the CLR versions have no comparison with the rest. You can build them with the code at: http://dataeducation.com/sqlclr-string-splitting-part-2-even-faster-even-more-scalable/

    However, you can already find them made at: https://www.inforcustom.duckdns.org/en-us/documentacion/insertbulkbestia/ibbdetalleobjetoscomunes#ibbclr.StringToTableStringsPairs You can use them for free, for as long as you want.

    Francisco

    A big note on that: it was not a fair test on Jeff Moden's function, DelimitedSplit8k, as Bertrand altered the function in the test causing significant performance degradation.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Thom A wrote:

    fgrodriguez wrote:

    Here https://sqlperformance.com/2012/07/t-sql-queries/split-strings you have a performance comparison for some methods.

    As you can see, the CLR versions have no comparison with the rest. You can build them with the code at: http://dataeducation.com/sqlclr-string-splitting-part-2-even-faster-even-more-scalable/

    However, you can already find them made at: https://www.inforcustom.duckdns.org/en-us/documentacion/insertbulkbestia/ibbdetalleobjetoscomunes#ibbclr.StringToTableStringsPairs You can use them for free, for as long as you want.

    Francisco

    A big note on that: it was not a fair test on Jeff Moden's function, DelimitedSplit8k, as Bertrand altered the function in the test causing significant performance degradation.

    Thanks for that.  I'll also mention that the conclusion that he came to would be faulty even if he made the perfect copy because his test data is "grooved".  In other words, it repeats the same text for test data for all of the rows giving the data a cardinality of 1... XML loves that and it makes it look like the XML is a contender.  Try it with real data.  The XML will die.  In a presentation I did (I forget how many rows were involved) but using the same number of elements per row, DelimitedSplit8K took 22 seconds where the XML method took 8 MINUTES.

    Unfortunately, the closed that thread before I could reply with that response and that thread has led many astray.  The "Devil's in the Data" folks.

    Heh... sounds like a good title for an article I started on the subject a long time ago.

    {EDIT} I just looked it up... it was 100K rows and it was 25 seconds for the DelimitedSplitN4K function (I didn't change his stuff like he did mine and so had to use DelimitedN4K instead) and about 3 seconds more than 8 MINUTES for the XML method.  The XML method also broke under certain conditions where the DelimitedSplitN4K did not.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Also... make sure you test the "Devil" out of any SQLCLR you might use for this... a lot of people have a lot of different ideas of what should happen if you have (for example) adjacent delimiters, leading delimiters, trailing delimiters, and what should be returned if you pass and empty or null string and whether or not things like spaces should be trimmed from returned elements or not.  Like I said, the "Devil's in the Data". 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply