Recursive CTE comma list parser

  • I recently had to import data from a third party system that had (ugh) comma separated fields in a table. I came up with a recursive cte to do the parsing. I thought I'd post the query here and ask for some critiques and criticisms.

    CREATE TABLE #tmp (id int, string varchar(1000))

    INSERT INTO #tmp (id, string)

    SELECT 1, 'abcd, efgh, ijkl, mnop, qrst, uvwx, yz' UNION ALL

    SELECT 2, 'zyxw, vuts, rqpo, nmlk, jihg, fedc, ba' UNION ALL

    SELECT 3, 'the, quick, brown, fox, jumped, over, the, lazy, dog'

    ;WITH test (id, lft, rght, idx)

    AS

    (

    SELECT t.id

    ,LEFT(t.string, CHARINDEX(', ', t.string) - 1)

    ,SUBSTRING(t.string, CHARINDEX(', ', t.string) + 2, DATALENGTH(t.string))

    ,0

    FROM #tmp t

    UNION ALL

    SELECT c.id

    ,CASE WHEN CHARINDEX(', ', c.rght) = 0 THEN c.rght ELSE LEFT(c.rght, CHARINDEX(', ', c.rght) - 1) END

    ,CASE WHEN CHARINDEX(', ', c.rght) > 0 THEN SUBSTRING(c.rght, CHARINDEX(', ', c.rght) + 2, DATALENGTH(c.rght))

    ELSE '' END

    ,idx + 1

    FROM test c

    WHERE DATALENGTH(c.rght) > 0

    )

    select * from test order by id, idx

    DROP TABLE #tmp

  • Nice, I like it better than some of the comma split functions I have seen.

    What have you seen this scale out to?

    For curiosity sake, I decided to see what kind of performance I would see if I changed this from a 3 row temp table to a 10,000 and then a 100,000 row temp table. I also changed your ID to and identity for insert ease with the bigger row counts.

    I saw this procedure jump from instantaneous to 6 seconds to 1m01s to process for each of the row count levels.

    If I specify an ID to query after the CTE is done, then the processing time drops dramatically. And yet again, significant performance gains by changing the CTE to query a specific ID.

    The following is only for the CTE and Select statement immediately following it.

    100,000 Records IO

    Scan count 2, logical reads 7796635, physical reads 0

    10,000 Records IO

    Scan count 2, logical reads 522008, physical reads 0

    1000 Records IO

    Scan count 2, logical reads 52262, physical reads 0

    10 Records IO

    Scan count 2, logical reads 603, physical reads 0

    10 Records Processing Time

    125 ms

    1000 Records Processing Time

    437 ms

    10000 Records Processing Time

    4390 ms

    100,000 Records Processing Time

    43406 ms

    Now if I query the 100,000 Record temp table, the processing time drops to:

    15 ms

    Scan count 2, logical reads 43, physical reads 0

    I think this query can be very useful - depending on implementation and scale.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • What have you seen this scale out to?

    It's funny you should ask. I'm using a modified version of it in testing right now. I've run it against about 180000 records so far. The performance has been (mostly) acceptable so far, but the list I'm splitting uses a more complicated delimiter, so I have to use PATINDEX('%[A-Z]_[A-Z]_****%') as my delimiter instead of charindex. I'm about bump the data set up to around a million (in dev).

    I'll update later with some more info.

  • First run on a set with 1.3 M rows. The result set expanded to 3.8 M rows.

    I had 167323868 logical reads against my temp table, 0 physical.

    Unfortunately that took 772563ms. On the plus side, the other options I've tried for splitting that dataset used a while loop and took much longer.

    UPDATE

    I broke the dataset up into 100000 row chunks, and I'm seeing comparable performance to CirquedeSQLeil. Running it against the whole dataset takes about 10x the logical reads as running it against 1/10th of the dataset.

  • That seems to be scaling about the same as my tests.

    I do like the performance gains when a specific ID can be passed though. This script could prove very handy.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Give this one a try. You may need to modify it a bit, but if you give me some sample data I can do some of the work.

    USE [SandBox]

    GO

    /****** Object: UserDefinedFunction [dbo].[DelimitedSplit] Script Date: 12/08/2009 11:17:39 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE function [dbo].[DelimitedSplit] (

    @pString varchar(max),

    @pDelimiter char(1)

    )

    returns table

    as

    return

    with

    a1 as (select 1 as N union all

    select 1 union all

    select 1 union all

    select 1 union all

    select 1 union all

    select 1 union all

    select 1 union all

    select 1 union all

    select 1 union all

    select 1),

    a2 as (select

    1 as N

    from

    a1 as a

    cross join a1 as b),

    a3 as (select

    1 as N

    from

    a2 as a

    cross join a2 as b),

    --a4 as (select

    -- 1 as N

    -- from

    -- a3 as a

    -- cross join a2 as b),

    Tally as (select top (len(@pString))

    row_number() over (order by a.N) as N

    from

    a3 as a

    cross join a2 as b),

    ItemSplit(

    ItemOrder,

    Item

    ) as (

    SELECT

    N,

    SUBSTRING(@pDelimiter + @pString + @pDelimiter,N + 1,CHARINDEX(@pDelimiter,@pDelimiter + @pString + @pDelimiter,N + 1) - N - 1)

    FROM

    Tally

    WHERE

    N < LEN(@pDelimiter + @pString + @pDelimiter)

    AND SUBSTRING(@pDelimiter + @pString + @pDelimiter,N,1) = @pDelimiter --Notice how we find the delimiter

    )

    select

    row_number() over (order by ItemOrder) as ItemID,

    Item

    from

    ItemSplit

    ;

    GO

    Also, the reason yours doesn't scale well is the recursion. It is basically RBAR, just like those that use a WHILE loop.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply