Set-Based String Splitting table function

  • Jeff Moden - Tuesday, March 20, 2018 3:10 PM

    bevanward - Tuesday, March 20, 2018 3:04 PM

    Jeff Moden - Monday, March 19, 2018 7:06 PM

    Cool feedback.  Thanks, Bevan.

    I think the concern of some of the other poster's is having to do with performance.  Generally speaking and with few exceptions, an mTVF (Multi-statement Table Valued Function) is going to be slower than an iTVF (Inline Table Valued Function).  Your use of LEAD may help overcome that but, if you can use lead, take a look at a modernized version of DelimitedSplit8K that a good online friend, Eirikur Eiriksson , wrote.  I've test it and it's twice as fast as DelimitedSplit8k and approaches CLR speeds very closely.  Of course, it requires 2012+.

    http://www.sqlservercentral.com/articles/SQL+Server+2012/106903/

    Hi Jeff

    Have modified Eirikur's version for multiple character delimiters and longer fields. Certainly runs fast from what I've seen so far have to test more.
    The only issue I have stumbled into so far is where delimiters overlap and it splits multiple times - for example when splitting with three dots and try and pass through a area with lots of dots. Any thoughts appreciated. Thanks for pointing this all out.
    Thanks
    Bevan

    Post the modifications you made.

    This is what it looks like - did not realise can not attach .sql files.

    Thanks
    Bevan

    use tempdb
    go

    drop function if exists DelimitedSplit_Max_Multi_LAG
    go

    create function [dbo].[DelimitedSplit_Max_Multi_LAG]
    /*
    Based on
    http://www.sqlservercentral.com/articles/SQL+Server+2012/106903/
    and http://www.sqlservercentral.com/articles/Tally+Table/72993/

    */

    --===== Define I/O parameters
       (@pString NVARCHAR(max), @pDelimiter NVARCHAR(1000))
    RETURNS TABLE WITH SCHEMABINDING AS
    RETURN
    --===== "Inline" CTE Driven "Tally Table†produces values from 0 up to 100,000,000...
      -- enough to cover a big file
    WITH E1(N) AS (
          SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
          SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
          SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
          ),         --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows
       E6(N) AS (SELECT 1 FROM E4 a, E4 b), --10E+6 or 1,000,000
       E8(N) AS (SELECT 1 FROM E6 a, E6 b), --10E+8 or 100,000,000
         
    cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front
           -- for both a performance gain and prevention of accidental "overruns"
          SELECT 0 UNION ALL
          SELECT TOP (DATALENGTH(ISNULL(@pString,1))/2) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E8
          ),
    cteStart(N1) AS (--==== Find right edge of delimiter
          SELECT t.N+case when t.N = 0 then 0 else datalength(@pDelimiter)/2 end
           FROM cteTally t
          WHERE (SUBSTRING(@pString,t.N,datalength(@pDelimiter)/2) = @pDelimiter OR t.N = 0)
          )
    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1),
       Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF((LEAD(s.N1,1,1) OVER (ORDER BY s.N1) - case when s.N1=1 or count(*) over (partition by (select null)) = ROW_NUMBER() OVER(ORDER BY s.N1) then 1 else (datalength(@pDelimiter)/2) end),0)-s.N1,datalength(@pString)/2))
     FROM cteStart s
    ;
    GO

    select * from dbo.DelimitedSplit_Max_Multi_LAG('The shortbeard codling can grow to a maximum standard length of about 33 cm (13 in) and takes the form of a somewhat spindle-shaped cylinder. The snout is fairly blunt and the upper jaw is slightly longer than the lower jaw. There is a short barbel on the chin and a flattened spine on the operculum. There are two dorsal fins with no spines and a total of from 69 to 76 soft rays. The first dorsal fin has 6 rays, the second of which is the longest, and the second dorsal fin originates immediately behind the first fin. Each pectoral fin has 17 to 23 soft rays. The pelvic fins are widely separated and each has two elongated soft rays; these fins originate to the front of the pectoral fins. The anal fin has no spines and has 57 to 63 soft rays. The caudal fin is small and rounded. The general colour of this fish is tan, the dorsal and anal fins having dark edges.','. ')

  • bevanward - Tuesday, March 20, 2018 1:55 AM

    Hi Scott
    Thanks for your message! It is hard to know how long to look for something before starting to write. What Jeff wrote looks perfect.
    Had you seen the modernised version that Jeff mentions above? Seems to be worth considering from the looks of the performance.

    Thanks again keep well
    Bevan

    Hi Bevan,

    Your article has triggered a dialogue which is a good thing - you've found Jeff's code, and no, I was not aware of the modernised version that Jeff mentioned.  So your post has triggered further education on my part 🙂
    Since I'm on SS 2012, I'll definitely look at the code from  Eirikur Eiriksson.
    Not to hijack this post, but since it's related, I thought I'd share my use case. 
    I've got a source table where multiple values are stored in a pipe-delimited string.  So, instead of columns foo1-foo50, there's a single foo column, with data "A|B|C|D|etc".  It would be rare (but possible) to have a complete set of 50 tokens.
    There are multiple such columns, and not always a one-to-one match of tokens within the different columns (i.e. foo might contain 20 tokens, bar might contain 15 tokens, and blah might contain 3 tokens).
    I have to "unwind" this data, converting multiple tokens into multiple rows, essentially normalizing the data.  A bit like UNPIVOT, but with a concatenated string rather than multiple columns.
    Jeff's code has been invaluable in helping me to code this up.  I never would have come up with his approach on my own.
    Here is my code.  fn_MoH_SplitDelimitedStringToTable is identical to fn_DelimitedSplit8K, just a different name.

    CREATE VIEW [phi].[vwPROCEDURE_V]
    AS
    WITH cteProcedure_Code AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS VARCHAR(8))  AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(src.procedure_code_p,'-'),'|',ISNULL(src.procedure_code_sec,'-'))
      ) ds
    )
    ,
    cteProcedure_Date AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS VARCHAR(9))  AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(CONVERT(VARCHAR(8),src.procedure_date_p,112),'-'),'|',ISNULL(src.procedure_date_sec,'-'))
      ) ds
    )
    ,
    cteProcedure_Location AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS CHAR(1))   AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(src.procedure_location_p,'-'),'|',ISNULL(src.procedure_location_sec,'-'))
      ) ds
    )
    ,
    cteBlock_Num AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS VARCHAR(4))  AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(src.block_num_p,'-'),'|',ISNULL(src.block_num_sec,'-'))
      ) ds
    )
    SELECT
       -- Source
       a.FACILITY_STAY_EPISODE_ID
      ,a.FACILITY_STAY_ID
      ,a.FACILITY_ID
      ,a.clinical_codeset
      
      -- Procedure Code
      ,b.ItemNumber-1 AS [ROW_NUMBER]
      ,b.ItemNumber-1 AS procedure_code_number
      ,CASE WHEN b.Item='-' THEN NULL ELSE b.Item END AS procedure_code
      
      -- Procedure Date
      ,c.ItemNumber-1 AS procedure_date_number
      ,CASE WHEN c.Item='-' THEN NULL ELSE CAST(c.Item AS DATE) END AS procedure_date
      
      -- Procedure Location
      ,d.ItemNumber-1 AS procedure_location_number
      ,CASE WHEN d.Item='-' THEN NULL ELSE d.Item END AS procedure_location
      
      -- Procedure Type
      ,CAST(
       CASE
        WHEN b.ItemNumber-1 = 0 THEN 'P'
        ELSE FORMAT(b.ItemNumber-1,'00')
       END AS CHAR(2)
       )      AS procedure_type
      
       -- Block Number
      ,e.ItemNumber-1 AS block_num_number
      ,CASE WHEN e.Item='-' THEN NULL ELSE e.Item END AS block_num

    -- Source
    FROM trn.PROCEDURE a

    -- Procedure Code  
    LEFT JOIN cteProcedure_Code b
    ON  a.FACILITY_STAY_EPISODE_ID=b.FACILITY_STAY_EPISODE_ID

    -- Procedure Date
    LEFT JOIN cteProcedure_Date c
    ON  a.FACILITY_STAY_EPISODE_ID=c.FACILITY_STAY_EPISODE_ID
    AND  b.ItemNumber=c.ItemNumber

    -- Procedure Location
    LEFT JOIN cteProcedure_Location d
    ON  a.FACILITY_STAY_EPISODE_ID=d.FACILITY_STAY_EPISODE_ID
    AND  b.ItemNumber=d.ItemNumber

    -- Block Num
    LEFT JOIN cteBlock_Num e
    ON  a.FACILITY_STAY_EPISODE_ID=e.FACILITY_STAY_EPISODE_ID
    AND  b.ItemNumber=e.ItemNumber
    GO

    The source table has about 5M rows; the output has about 17M rows.  We don't have the fastest server, and the load takes about 4 hours to run.  Any performance improvements (i.e. Eirikur's code) or improvements to my own approach will certainly help.

    Regards,
    Scott

  • I have previously modified my copy of Jeff's code to only return non-blank items, and return them stripped of leading and trailing spaces.

    Here is an excerpt of my changes:

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
       --===== Strip leading and trailing spaces, only return non-empty results.

      SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY l.N1),
        Item = LTRIM(RTRIM(SUBSTRING(@pString, l.N1, l.L1)))
      FROM cteLen l
      WHERE LEN(SUBSTRING(@pString, l.N1, l.L1)) > 0

    I want to do a similar thing for Eirikur's code, but it's a bit trickier due to the LEAD() windowing function. 

    I include my changes in their entirety, including a longer header with use cases.  Some of the header comments are meant to help less TSQL-aware colleagues.  The changes are at the bottom of the code.

    Compare the results of my use cases with the original code vs. my changes - my desired output should be obvious by comparing the two results.

    If there is a better (i.e. more performant) way to implement these changes please let me know.


    /*=====================================================================
    Function Name    : fn_MoH_DelimitedSplit8K_LEAD.sql
    Purpose      : User Defined Function to split a string based
             on a delimiter and return the tokens in the
             string as a table object. Uses the LEAD
             windowing function for performance gains over
             the DelimitedSplit8K function.
    SQL Server Version  : Microsoft SQL Server 2012 (SP3) (KB3072779) - 11.0.6020.0 (X64)

    Other Code Called   : None

    Originally Written by : Scott In Sydney
    Date       : 21MAR2018
    Stored Process Version : 1.0

    =======================================================================

    This code was originally published at:
    http://www.sqlservercentral.com/articles/SQL+Server+2012/106903/

    See also:
    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    Full attribution is given to the original author Eirikur Eiriksson by
    virtue of the above URL. My assumption is that this code is in the
    public domain by virtue of that URL.

    =======================================================================

    Modification History  : Original version

    =====================================================================*/

    /*---------------------------------------------------------------------
    Usage:

    -- Create a synonym to this function in the desired database
    USE Dev
    GO

    DROP SYNONYM [dbo].[fn_MoH_SplitDelimitedStringToTable]
    CREATE SYNONYM [dbo].[fn_MoH_SplitDelimitedStringToTable]
    FOR [master].[dbo].[fn_MoH_DelimitedSplit8K_LEAD]

    SELECT ds.ItemNumber, CAST(ds.Item AS VARCHAR(5)) AS Item
    FROM fn_MoH_SplitDelimitedStringToTable(' ','foo bar blah') ds

    -- default delimiter is a blank space, but a UDF require all parms to be specified, use "DEFAULT" for the default delimiter
    SELECT ds.ItemNumber, CAST(ds.Item AS VARCHAR(5)) AS Item
    FROM dbo.fn_MoH_SplitDelimitedStringToTable(DEFAULT,'foo bar  blah') ds

    Returns:

    foo
    bar
    blah

    as a table, with Column Name "Item".

    SELECT ds.ItemNumber, CAST(ds.Item AS VARCHAR(5)) AS Item
    FROM master.dbo.fn_MoH_DelimitedSplit8K_LEAD(',','fie, fi,,,, fo, fum') ds

    Returns:

    fie
    fi
    fo
    fum

    as a table, with Column Name "Item".

    =======================================================================

    -- Cross Apply with another table
    SELECT s.name, ds.Item
    FROM (
      SELECT TOP 3 * FROM sys.databases
    ) s
    CROSS APPLY
    fn_MoH_SplitDelimitedStringToTable(' ','foo bar blah') ds

    SELECT s.name, ds.Item
    FROM (
      SELECT TOP 3 * FROM sys.databases
    ) s
    CROSS APPLY
    dbo.fn_MoH_SplitDelimitedStringToTable(' ',' foo bar blah ') ds

    SELECT s.name, ds.Item
    FROM (
      SELECT TOP 3 * FROM sys.databases
    ) s
    CROSS APPLY
    master.dbo.fn_MoH_DelimitedSplit8K_LEAD('^','foo^^bar^^^blah') ds

    Returns foo/bar/blah cross joined (Cartesian Product) with the first
    three rows from sys.objects.

    =======================================================================

    -- Split a delimted string supplied by a table instead of hard coded
    IF OBJECT_ID('tempdb..#temp') IS NOT NULL DROP TABLE #temp

    SELECT *
    INTO #temp
    FROM (
       SELECT ' foo| bar| blah '  
       UNION ALL
       SELECT 'fie|fi|fo|fum'
       ) d (string)

    SELECT * FROM #temp

    SELECT t.string, ds.ItemNumber, ds.Item
    FROM #temp t
    CROSS APPLY
    master.dbo.fn_MoH_DelimitedSplit8K_LEAD('|',t.string) ds

    Returns the delimited string data split by the supplied delimiter,
    with the string data supplied by a column in a table.

    -- Nested query
    SELECT s.name, x.ItemNumber, x.Item
    FROM (
      SELECT TOP 3 * FROM sys.databases
    ) s
    CROSS APPLY
    (
      SELECT ds.*
      FROM #temp t
      CROSS APPLY
      master.dbo.fn_MoH_DelimitedSplit8K_LEAD('|',t.string) ds
    ) x

    IF OBJECT_ID('tempdb..#temp') IS NOT NULL DROP TABLE #temp

    Returns the delimited string data split by the supplied delimiter,
    with the string data supplied by a column in a table,
    cross joined (Cartesian Product) with the first
    three rows from sys.objects.

    =======================================================================

    -----------------------------------------------------------------------
    Notes:

    The input delimited string is limited to 8000 characters.

    Do not mark this user defined function as a system function via
    sp_MS_marksystemobject. This only works for stored procedures.

    Instead, create a synonym in each database requiring this UDF.
    This will allow a two-level name to be used to call this UDF.

    Alternatively, use a three-level name to call this UDF,
    since it does not require the calling database context.

    ---------------------------------------------------------------------*/

    ALTER FUNCTION [dbo].[fn_MoH_DelimitedSplit8K_LEAD]
    --===== Define I/O parameters
    ( @pDelimiter        CHAR(1) = ' '
    , @pString         VARCHAR(8000)
    )
    --WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
    RETURNS TABLE WITH SCHEMABINDING AS
    RETURN
      --===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
      -- enough to cover VARCHAR(8000)
      WITH E1(N) AS (
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
       SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
      ), --10E+1 or 10 rows
      E2(N) AS (
       SELECT 1 FROM E1 a, E1 b
      ), --10E+2 or 100 rows
      E4(N) AS (
       SELECT 1 FROM E2 a, E2 b
      ), --10E+4 or 10,000 rows max
      cteTally(N) AS (
       --==== This provides the "zero base" and limits the number of rows right up front
       -- for both a performance gain and prevention of accidental "overruns"
       SELECT 0
       UNION ALL
       SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4  
      ),
      cteStart(N1) AS (
       --==== This returns N+1 (starting position of each "element" just once for each delimiter)
       SELECT t.N+1
       FROM cteTally t
       WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)
      ),
      cteResults AS (
       --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
       SELECT Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF((LEAD(s.N1,1,1) OVER (ORDER BY s.N1) - 1),0)-s.N1,8000))
       FROM cteStart s
      )
      --===== Strip leading and trailing spaces, only return non-empty results.
      SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),
        LTRIM(RTRIM(r.Item)) AS Item
      FROM cteResults r
      WHERE LEN(r.Item) > 0
    ;

    /******* END OF FILE *******/

  • Also, I've done a performance comparison of DelimitedSplit8K vs. DelimitedSplit8K_LEAD, given my use case and minor code changes noted in previous posts above.

    DelimitedSplit8K:

    SET STATISTICS TIME ON

    SELECT src.FACILITY_STAY_EPISODE_ID
      ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
      ,CAST(ds.Item AS VARCHAR(8))  AS Item
    FROM trn.DIAGNOSIS src
    CROSS APPLY
    master.dbo.fn_MoH_DelimitedSplit8K('|',
      CONCAT(src.diagnosis_code_p,'|',src.diagnosis_code_sec)
    ) ds
    WHERE ds.Item IS NOT NULL

    SQL Server parse and compile time:
     CPU time = 0 ms, elapsed time = 6 ms.

    SQL Server Execution Times:
     CPU time = 0 ms, elapsed time = 0 ms.

    (17277002 rows affected)

    SQL Server Execution Times:
     CPU time = 3927157 ms, elapsed time = 4867563 ms. (Which is 01:27:07)

    DelimitedSplit8K_LEAD:

    SET STATISTICS TIME ON

    SELECT src.FACILITY_STAY_EPISODE_ID
      ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
      ,CAST(ds.Item AS VARCHAR(8))  AS Item
    FROM trn.DIAGNOSIS src
    CROSS APPLY
    master.dbo.fn_MoH_DelimitedSplit8K_LEAD('|',
      CONCAT(src.diagnosis_code_p,'|',src.diagnosis_code_sec)
    ) ds
    WHERE ds.Item IS NOT NULL

    SQL Server parse and compile time:
     CPU time = 11 ms, elapsed time = 11 ms.

    SQL Server Execution Times:
     CPU time = 0 ms, elapsed time = 0 ms.

    (17277002 rows affected)

    SQL Server Execution Times:
     CPU time = 906594 ms, elapsed time = 1832653 ms. (Which is 00:30:22)

    Hope this helps...

  • Scott In Sydney - Tuesday, March 20, 2018 5:05 PM

    bevanward - Tuesday, March 20, 2018 1:55 AM

    Hi Scott
    Thanks for your message! It is hard to know how long to look for something before starting to write. What Jeff wrote looks perfect.
    Had you seen the modernised version that Jeff mentions above? Seems to be worth considering from the looks of the performance.

    Thanks again keep well
    Bevan

    Hi Bevan,

    Your article has triggered a dialogue which is a good thing - you've found Jeff's code, and no, I was not aware of the modernised version that Jeff mentioned.  So your post has triggered further education on my part 🙂
    Since I'm on SS 2012, I'll definitely look at the code from  Eirikur Eiriksson.
    Not to hijack this post, but since it's related, I thought I'd share my use case. 
    I've got a source table where multiple values are stored in a pipe-delimited string.  So, instead of columns foo1-foo50, there's a single foo column, with data "A|B|C|D|etc".  It would be rare (but possible) to have a complete set of 50 tokens.
    There are multiple such columns, and not always a one-to-one match of tokens within the different columns (i.e. foo might contain 20 tokens, bar might contain 15 tokens, and blah might contain 3 tokens).
    I have to "unwind" this data, converting multiple tokens into multiple rows, essentially normalizing the data.  A bit like UNPIVOT, but with a concatenated string rather than multiple columns.
    Jeff's code has been invaluable in helping me to code this up.  I never would have come up with his approach on my own.
    Here is my code.  fn_MoH_SplitDelimitedStringToTable is identical to fn_DelimitedSplit8K, just a different name.

    CREATE VIEW [phi].[vwPROCEDURE_V]
    AS
    WITH cteProcedure_Code AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS VARCHAR(8))  AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(src.procedure_code_p,'-'),'|',ISNULL(src.procedure_code_sec,'-'))
      ) ds
    )
    ,
    cteProcedure_Date AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS VARCHAR(9))  AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(CONVERT(VARCHAR(8),src.procedure_date_p,112),'-'),'|',ISNULL(src.procedure_date_sec,'-'))
      ) ds
    )
    ,
    cteProcedure_Location AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS CHAR(1))   AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(src.procedure_location_p,'-'),'|',ISNULL(src.procedure_location_sec,'-'))
      ) ds
    )
    ,
    cteBlock_Num AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS VARCHAR(4))  AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(src.block_num_p,'-'),'|',ISNULL(src.block_num_sec,'-'))
      ) ds
    )
    SELECT
       -- Source
       a.FACILITY_STAY_EPISODE_ID
      ,a.FACILITY_STAY_ID
      ,a.FACILITY_ID
      ,a.clinical_codeset
      
      -- Procedure Code
      ,b.ItemNumber-1 AS [ROW_NUMBER]
      ,b.ItemNumber-1 AS procedure_code_number
      ,CASE WHEN b.Item='-' THEN NULL ELSE b.Item END AS procedure_code
      
      -- Procedure Date
      ,c.ItemNumber-1 AS procedure_date_number
      ,CASE WHEN c.Item='-' THEN NULL ELSE CAST(c.Item AS DATE) END AS procedure_date
      
      -- Procedure Location
      ,d.ItemNumber-1 AS procedure_location_number
      ,CASE WHEN d.Item='-' THEN NULL ELSE d.Item END AS procedure_location
      
      -- Procedure Type
      ,CAST(
       CASE
        WHEN b.ItemNumber-1 = 0 THEN 'P'
        ELSE FORMAT(b.ItemNumber-1,'00')
       END AS CHAR(2)
       )      AS procedure_type
      
       -- Block Number
      ,e.ItemNumber-1 AS block_num_number
      ,CASE WHEN e.Item='-' THEN NULL ELSE e.Item END AS block_num

    -- Source
    FROM trn.PROCEDURE a

    -- Procedure Code  
    LEFT JOIN cteProcedure_Code b
    ON  a.FACILITY_STAY_EPISODE_ID=b.FACILITY_STAY_EPISODE_ID

    -- Procedure Date
    LEFT JOIN cteProcedure_Date c
    ON  a.FACILITY_STAY_EPISODE_ID=c.FACILITY_STAY_EPISODE_ID
    AND  b.ItemNumber=c.ItemNumber

    -- Procedure Location
    LEFT JOIN cteProcedure_Location d
    ON  a.FACILITY_STAY_EPISODE_ID=d.FACILITY_STAY_EPISODE_ID
    AND  b.ItemNumber=d.ItemNumber

    -- Block Num
    LEFT JOIN cteBlock_Num e
    ON  a.FACILITY_STAY_EPISODE_ID=e.FACILITY_STAY_EPISODE_ID
    AND  b.ItemNumber=e.ItemNumber
    GO

    The source table has about 5M rows; the output has about 17M rows.  We don't have the fastest server, and the load takes about 4 hours to run.  Any performance improvements (i.e. Eirikur's code) or improvements to my own approach will certainly help.

    Regards,
    Scott

    Hi Scott

    Just an initial thought that might be completely off but thought I should mention first up.

    Can you run a bunch of batches in parallel?

    If you can monopolize the resources on the server then ...

    Run the original SQL2012 lag split for each column concurrently 

    For example run a bunch of these concurrently:

    insert into tmp<field one>
    select PrimaryKey, Ordinal, Item from <table> cross apply dbo.DelimitedSplit8K_LEAD(<field one>,'|')
    delete tmp<field one> where Item is null

    If you monitor your resource consumption on the server as you add more of these and measure overall time that may be the most simple way to get the improvement.

    Then once it is complete recomposite the results into the final required shape?

    I think with that number of records you should be able to get that processing time down a lot.

    Will have more of look tonight when I get off

    Thanks
    Bevan

  • bevanward - Wednesday, March 21, 2018 5:22 AM

    Scott In Sydney - Tuesday, March 20, 2018 5:05 PM

    bevanward - Tuesday, March 20, 2018 1:55 AM

    Hi Scott
    Thanks for your message! It is hard to know how long to look for something before starting to write. What Jeff wrote looks perfect.
    Had you seen the modernised version that Jeff mentions above? Seems to be worth considering from the looks of the performance.

    Thanks again keep well
    Bevan

    Hi Bevan,

    Your article has triggered a dialogue which is a good thing - you've found Jeff's code, and no, I was not aware of the modernised version that Jeff mentioned.  So your post has triggered further education on my part 🙂
    Since I'm on SS 2012, I'll definitely look at the code from  Eirikur Eiriksson.
    Not to hijack this post, but since it's related, I thought I'd share my use case. 
    I've got a source table where multiple values are stored in a pipe-delimited string.  So, instead of columns foo1-foo50, there's a single foo column, with data "A|B|C|D|etc".  It would be rare (but possible) to have a complete set of 50 tokens.
    There are multiple such columns, and not always a one-to-one match of tokens within the different columns (i.e. foo might contain 20 tokens, bar might contain 15 tokens, and blah might contain 3 tokens).
    I have to "unwind" this data, converting multiple tokens into multiple rows, essentially normalizing the data.  A bit like UNPIVOT, but with a concatenated string rather than multiple columns.
    Jeff's code has been invaluable in helping me to code this up.  I never would have come up with his approach on my own.
    Here is my code.  fn_MoH_SplitDelimitedStringToTable is identical to fn_DelimitedSplit8K, just a different name.

    CREATE VIEW [phi].[vwPROCEDURE_V]
    AS
    WITH cteProcedure_Code AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS VARCHAR(8))  AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(src.procedure_code_p,'-'),'|',ISNULL(src.procedure_code_sec,'-'))
      ) ds
    )
    ,
    cteProcedure_Date AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS VARCHAR(9))  AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(CONVERT(VARCHAR(8),src.procedure_date_p,112),'-'),'|',ISNULL(src.procedure_date_sec,'-'))
      ) ds
    )
    ,
    cteProcedure_Location AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS CHAR(1))   AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(src.procedure_location_p,'-'),'|',ISNULL(src.procedure_location_sec,'-'))
      ) ds
    )
    ,
    cteBlock_Num AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS VARCHAR(4))  AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(src.block_num_p,'-'),'|',ISNULL(src.block_num_sec,'-'))
      ) ds
    )
    SELECT
       -- Source
       a.FACILITY_STAY_EPISODE_ID
      ,a.FACILITY_STAY_ID
      ,a.FACILITY_ID
      ,a.clinical_codeset
      
      -- Procedure Code
      ,b.ItemNumber-1 AS [ROW_NUMBER]
      ,b.ItemNumber-1 AS procedure_code_number
      ,CASE WHEN b.Item='-' THEN NULL ELSE b.Item END AS procedure_code
      
      -- Procedure Date
      ,c.ItemNumber-1 AS procedure_date_number
      ,CASE WHEN c.Item='-' THEN NULL ELSE CAST(c.Item AS DATE) END AS procedure_date
      
      -- Procedure Location
      ,d.ItemNumber-1 AS procedure_location_number
      ,CASE WHEN d.Item='-' THEN NULL ELSE d.Item END AS procedure_location
      
      -- Procedure Type
      ,CAST(
       CASE
        WHEN b.ItemNumber-1 = 0 THEN 'P'
        ELSE FORMAT(b.ItemNumber-1,'00')
       END AS CHAR(2)
       )      AS procedure_type
      
       -- Block Number
      ,e.ItemNumber-1 AS block_num_number
      ,CASE WHEN e.Item='-' THEN NULL ELSE e.Item END AS block_num

    -- Source
    FROM trn.PROCEDURE a

    -- Procedure Code  
    LEFT JOIN cteProcedure_Code b
    ON  a.FACILITY_STAY_EPISODE_ID=b.FACILITY_STAY_EPISODE_ID

    -- Procedure Date
    LEFT JOIN cteProcedure_Date c
    ON  a.FACILITY_STAY_EPISODE_ID=c.FACILITY_STAY_EPISODE_ID
    AND  b.ItemNumber=c.ItemNumber

    -- Procedure Location
    LEFT JOIN cteProcedure_Location d
    ON  a.FACILITY_STAY_EPISODE_ID=d.FACILITY_STAY_EPISODE_ID
    AND  b.ItemNumber=d.ItemNumber

    -- Block Num
    LEFT JOIN cteBlock_Num e
    ON  a.FACILITY_STAY_EPISODE_ID=e.FACILITY_STAY_EPISODE_ID
    AND  b.ItemNumber=e.ItemNumber
    GO

    The source table has about 5M rows; the output has about 17M rows.  We don't have the fastest server, and the load takes about 4 hours to run.  Any performance improvements (i.e. Eirikur's code) or improvements to my own approach will certainly help.

    Regards,
    Scott

    Hi Scott

    Just an initial thought that might be completely off but thought I should mention first up.

    Can you run a bunch of batches in parallel?

    If you can monopolize the resources on the server then ...

    Run the original SQL2012 lag split for each column concurrently 

    For example run a bunch of these concurrently:

    insert into tmp<field one>
    select PrimaryKey, Ordinal, Item from <table> cross apply dbo.DelimitedSplit8K_LEAD(<field one>,'|')
    delete tmp<field one> where Item is null

    If you monitor your resource consumption on the server as you add more of these and measure overall time that may be the most simple way to get the improvement.

    Then once it is complete recomposite the results into the final required shape?

    I think with that number of records you should be able to get that processing time down a lot.

    Will have more of look tonight when I get off

    Thanks
    Bevan

    Hi Scott,
    can you post the code of the functions used please?
    May have missed them if you've done so earlier, struggling with the connection so cannot see those

    😎

  • Eirikur Eiriksson - Wednesday, March 21, 2018 9:02 AM

    bevanward - Wednesday, March 21, 2018 5:22 AM

    Scott In Sydney - Tuesday, March 20, 2018 5:05 PM

    bevanward - Tuesday, March 20, 2018 1:55 AM

    Hi Scott
    Thanks for your message! It is hard to know how long to look for something before starting to write. What Jeff wrote looks perfect.
    Had you seen the modernised version that Jeff mentions above? Seems to be worth considering from the looks of the performance.

    Thanks again keep well
    Bevan

    Hi Bevan,

    Your article has triggered a dialogue which is a good thing - you've found Jeff's code, and no, I was not aware of the modernised version that Jeff mentioned.  So your post has triggered further education on my part 🙂
    Since I'm on SS 2012, I'll definitely look at the code from  Eirikur Eiriksson.
    Not to hijack this post, but since it's related, I thought I'd share my use case. 
    I've got a source table where multiple values are stored in a pipe-delimited string.  So, instead of columns foo1-foo50, there's a single foo column, with data "A|B|C|D|etc".  It would be rare (but possible) to have a complete set of 50 tokens.
    There are multiple such columns, and not always a one-to-one match of tokens within the different columns (i.e. foo might contain 20 tokens, bar might contain 15 tokens, and blah might contain 3 tokens).
    I have to "unwind" this data, converting multiple tokens into multiple rows, essentially normalizing the data.  A bit like UNPIVOT, but with a concatenated string rather than multiple columns.
    Jeff's code has been invaluable in helping me to code this up.  I never would have come up with his approach on my own.
    Here is my code.  fn_MoH_SplitDelimitedStringToTable is identical to fn_DelimitedSplit8K, just a different name.

    CREATE VIEW [phi].[vwPROCEDURE_V]
    AS
    WITH cteProcedure_Code AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS VARCHAR(8))  AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(src.procedure_code_p,'-'),'|',ISNULL(src.procedure_code_sec,'-'))
      ) ds
    )
    ,
    cteProcedure_Date AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS VARCHAR(9))  AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(CONVERT(VARCHAR(8),src.procedure_date_p,112),'-'),'|',ISNULL(src.procedure_date_sec,'-'))
      ) ds
    )
    ,
    cteProcedure_Location AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS CHAR(1))   AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(src.procedure_location_p,'-'),'|',ISNULL(src.procedure_location_sec,'-'))
      ) ds
    )
    ,
    cteBlock_Num AS
    (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,CAST(ds.ItemNumber AS SMALLINT) AS ItemNumber
        ,CAST(ds.Item AS VARCHAR(4))  AS Item
      FROM trn.PROCEDURE src
      CROSS APPLY
      dbo.fn_MoH_SplitDelimitedStringToTable('|',
       CONCAT(ISNULL(src.block_num_p,'-'),'|',ISNULL(src.block_num_sec,'-'))
      ) ds
    )
    SELECT
       -- Source
       a.FACILITY_STAY_EPISODE_ID
      ,a.FACILITY_STAY_ID
      ,a.FACILITY_ID
      ,a.clinical_codeset
      
      -- Procedure Code
      ,b.ItemNumber-1 AS [ROW_NUMBER]
      ,b.ItemNumber-1 AS procedure_code_number
      ,CASE WHEN b.Item='-' THEN NULL ELSE b.Item END AS procedure_code
      
      -- Procedure Date
      ,c.ItemNumber-1 AS procedure_date_number
      ,CASE WHEN c.Item='-' THEN NULL ELSE CAST(c.Item AS DATE) END AS procedure_date
      
      -- Procedure Location
      ,d.ItemNumber-1 AS procedure_location_number
      ,CASE WHEN d.Item='-' THEN NULL ELSE d.Item END AS procedure_location
      
      -- Procedure Type
      ,CAST(
       CASE
        WHEN b.ItemNumber-1 = 0 THEN 'P'
        ELSE FORMAT(b.ItemNumber-1,'00')
       END AS CHAR(2)
       )      AS procedure_type
      
       -- Block Number
      ,e.ItemNumber-1 AS block_num_number
      ,CASE WHEN e.Item='-' THEN NULL ELSE e.Item END AS block_num

    -- Source
    FROM trn.PROCEDURE a

    -- Procedure Code  
    LEFT JOIN cteProcedure_Code b
    ON  a.FACILITY_STAY_EPISODE_ID=b.FACILITY_STAY_EPISODE_ID

    -- Procedure Date
    LEFT JOIN cteProcedure_Date c
    ON  a.FACILITY_STAY_EPISODE_ID=c.FACILITY_STAY_EPISODE_ID
    AND  b.ItemNumber=c.ItemNumber

    -- Procedure Location
    LEFT JOIN cteProcedure_Location d
    ON  a.FACILITY_STAY_EPISODE_ID=d.FACILITY_STAY_EPISODE_ID
    AND  b.ItemNumber=d.ItemNumber

    -- Block Num
    LEFT JOIN cteBlock_Num e
    ON  a.FACILITY_STAY_EPISODE_ID=e.FACILITY_STAY_EPISODE_ID
    AND  b.ItemNumber=e.ItemNumber
    GO

    The source table has about 5M rows; the output has about 17M rows.  We don't have the fastest server, and the load takes about 4 hours to run.  Any performance improvements (i.e. Eirikur's code) or improvements to my own approach will certainly help.

    Regards,
    Scott

    Hi Scott

    Just an initial thought that might be completely off but thought I should mention first up.

    Can you run a bunch of batches in parallel?

    If you can monopolize the resources on the server then ...

    Run the original SQL2012 lag split for each column concurrently 

    For example run a bunch of these concurrently:

    insert into tmp<field one>
    select PrimaryKey, Ordinal, Item from <table> cross apply dbo.DelimitedSplit8K_LEAD(<field one>,'|')
    delete tmp<field one> where Item is null

    If you monitor your resource consumption on the server as you add more of these and measure overall time that may be the most simple way to get the improvement.

    Then once it is complete recomposite the results into the final required shape?

    I think with that number of records you should be able to get that processing time down a lot.

    Will have more of look tonight when I get off

    Thanks
    Bevan

    Hi Scott,
    can you post the code of the functions used please?
    May have missed them if you've done so earlier, struggling with the connection so cannot see those

    😎

    Hi Eirikur

    Firstly thanks for your article Reaping the benefits of the Window functions in T-SQL! Jeff pointed it out to me after I had a go at writing a set-based split function and since have had a great education in taking it all much further.

    My main need to adapt what you have written are:
    1. Multiple character delimiters
    2. Text longer than varchar(8000)

    I was wondering if you can have a look at what I have come up with from your experience. It is already significantly faster than what I have had in use however would appreciate any comments if you get a chance.
    Many thanks
    Bevan

    use tempdb
    go

    drop function if exists DelimitedSplit_Max_Multi_LAG
    go

    create function [dbo].[DelimitedSplit_Max_Multi_LAG]

    /*
    This script is based on a modification of an excellent script by Eirikur Eiriksson
    http://www.sqlservercentral.com/articles/SQL+Server+2012/106903/

    And commentary and excellent work from Jeff Moden
    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    Author:        Bevan Ward
    Date:        20-Feb-2018
    Purpose:    Table function that takes a string of nvarchar(max) value and a delimiter of nvarchar(1000) and splits the String into records on the delimiter boundaries. If there are no instances of the delimiter then string is returned as a single ordinal.
    Dependency:    SQL2012+
    Based on:
    Version:    1

    Syntax:        select * from dbo.DelimitedSplit_Max_Multi_LAG('<input string>','<delimiter>')
    Example:    select * from dbo.DelimitedSplit_Max_Multi_LAG('Apple,Orange,Pear,Grape',',')
                select a.name, b.ItemNumber, b.Item from sys.objects a cross apply dbo.DelimitedSplit_Max_Multi_LAG(name,'_') b

    Pseudo code:
        1. Find character position of delimiters - right edge as position + length of delimiter except for first position
        2. Use LEAD to pull up the next sequence position and substring the piece from the string
        3. Retun this result with a position number

    Known issues: over printing delimiters are not catered for - for example the delimiter of '..' when splitting a string that includes ... is not catered for

    Form of output:
    +----------+----+
    |ItemNumber|Item|
    +----------+----+
    | 1   | <> |
    +----------+----+
    | n   | <> |
    +----------+----+

    */

    --===== Define I/O parameters
       (@pString NVARCHAR(max), @pDelimiter NVARCHAR(1000))
    RETURNS TABLE WITH SCHEMABINDING AS
    RETURN
    --===== "Inline" CTE Driven "Tally Table†produces values from 0 up to 100,000,000...
      -- enough to cover a big file
    WITH E1(N) AS (
          SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
          SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
          SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
          ),         --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows
       E6(N) AS (SELECT 1 FROM E4 a, E4 b), --10E+6 or 1,000,000
       E8(N) AS (SELECT 1 FROM E6 a, E6 b), --10E+8 or 100,000,000
         
    cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front
           -- for both a performance gain and prevention of accidental "overruns"
          SELECT 0 UNION ALL
          SELECT TOP (DATALENGTH(ISNULL(@pString,1))/2) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E8
          ),
    cteStart(N1) AS (--==== Find right edge of delimiter
          SELECT t.N+case when t.N = 0 then 0 else datalength(@pDelimiter)/2 end
           FROM cteTally t
          WHERE (SUBSTRING(@pString,t.N,datalength(@pDelimiter)/2) = @pDelimiter OR t.N = 0)
          )
    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1),
       Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF((LEAD(s.N1,1,1) OVER (ORDER BY s.N1) - case when s.N1=1 or count(*) over (partition by (select null)) = ROW_NUMBER() OVER(ORDER BY s.N1) then 1 else (datalength(@pDelimiter)/2) end),0)-s.N1,datalength(@pString)/2))
     FROM cteStart s
    ;
    GO

  • bevanward - Wednesday, March 21, 2018 5:22 AM

    Hi Scott

    Just an initial thought that might be completely off but thought I should mention first up.

    Can you run a bunch of batches in parallel?

    If you can monopolize the resources on the server then ...

    Run the original SQL2012 lag split for each column concurrently 

    For example run a bunch of these concurrently:

    insert into tmp<field one>
    select PrimaryKey, Ordinal, Item from <table> cross apply dbo.DelimitedSplit8K_LEAD(<field one>,'|')
    delete tmp<field one> where Item is null

    If you monitor your resource consumption on the server as you add more of these and measure overall time that may be the most simple way to get the improvement.

    Then once it is complete recomposite the results into the final required shape?

    I think with that number of records you should be able to get that processing time down a lot.

    Will have more of look tonight when I get off

    Thanks
    Bevan

    Hi Bevan,

    Thanks for the idea.  I suppose I could embed the separate queries into separate data sources in SSDT vs. using a view, and run those separate queries in parallel on SSDT.  For various reasons, I like embedding the logic in views and using those views as a data source in SSDT where possible.  But that doesn't mean I'm married to that approach, esp. for performance gains.

    I've had limited success using temp tables in SSDT, but I think there are options I can use to make this possible (persistent connections).  Of course, I could just create permanent tables in my database and then drop them.

    Having said that, I'm quite happy with the performance gains from Eirikur's code.

    I've had good success in SSDT using the Balanced Data Distributor in other jobs, but I've never used that with a view, especially with a complex view such as mine.  I wonder if I could use the BDD using my current view to achieve parallel processing?  One thing to consider is the BDD doesn't guarantee any order in the target table, but that shouldn't be an issue in my processing.

    When I ran a single query for performance testing, it ran in 30 mins (see above post).  When I ran the data load in my Dev environment  overnight, the entire view / data load ran in 32 mins.  So I'm wondering if splitting out the queries into parallel processes will gain me much?

    Regards,
    Scott

  • Eirikur Eiriksson - Wednesday, March 21, 2018 9:02 AM

    Hi Scott,
    can you post the code of the functions used please?
    May have missed them if you've done so earlier, struggling with the connection so cannot see those

    😎

    Hi Eirikur,

    Thanks for your code contributions to the community.

    My code changes are here:  https://www.sqlservercentral.com/Forums/Dialogs/PostDetails.aspx?PostID=1928604
    You may also be interested in:  https://www.sqlservercentral.com/Forums/Dialogs/PostDetails.aspx?PostID=1928608

    Let me know if you still have problems seeing the code.

    The crux of my changes are:

     ...

      cteStart(N1) AS (
      --==== This returns N+1 (starting position of each "element" just once for each delimiter)
      SELECT t.N+1
      FROM cteTally t
      WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)
    ),
    cteResults AS (
      --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
      SELECT Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF((LEAD(s.N1,1,1) OVER (ORDER BY s.N1) - 1),0)-s.N1,8000))
      FROM cteStart s
    )
    --===== Strip leading and trailing spaces, only return non-empty results.
    SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),
      LTRIM(RTRIM(r.Item)) AS Item
    FROM cteResults r
    WHERE LEN(r.Item) > 0

    In summary, I've added an extra cteResults, moved the ROW_NUMBER() derivation to the final select statement, added a WHERE clause to only return non-blank tokens, and strip leading/trailing spaces from the results.

    It appears to work correctly.  But, it is critical for my processing that ROW_NUMBER() corresponds to the order of the tokens in the delimited string.  This appears to be the case when spot checking source vs. target data.

  • Hi Guys,

    and thanks for the nice comments, wouldn't want to take too much credit for this as it is a community (SSC or Sql Server Community) effort.

    I got some suggestions:

    1. The (n)varchar(max) can hold up to (2^31) - 1 or 2147483647 characters, the tally table must be able to produce that number, optimal construct would be using 15^8 or something like

    WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
    , NUMS(N) AS (SELECT 0 UNION ALL
         SELECT TOP (DATALENGTH(ISNULL(@pString,1))/2) ROW_NUMBER() OVER (ORDER BY @@VERSION)
         FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8)

    2. Consider using a single character token as a delimiter and replace the multi character delimiter with the token before passing it to the function. This is straight forward as one can encapsulate the single character token function in another function.

    3. One of the reasons for the LEAD/LAG versions performing so well is that they utilize the special worktable, but it has limits. If the number of rows exceed that limit, then the benefits are minimal or even negative.

    😎

    I've done some extensive work on this and even have an article draft with code and a test harness somewhere. Will try to find it and post back when I have the chance.


     

  • Eirikur Eiriksson - Thursday, March 22, 2018 4:14 AM

    Hi Guys,

    and thanks for the nice comments, wouldn't want to take too much credit for this as it is a community (SSC or Sql Server Community) effort.

    I got some suggestions:

    1. The (n)varchar(max) can hold up to (2^31) - 1 or 2147483647 characters, the tally table must be able to produce that number, optimal construct would be using 15^8 or something like

    WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
    , NUMS(N) AS (SELECT 0 UNION ALL
         SELECT TOP (DATALENGTH(ISNULL(@pString,1))/2) ROW_NUMBER() OVER (ORDER BY @@VERSION)
         FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8)

    2. Consider using a single character token as a delimiter and replace the multi character delimiter with the token before passing it to the function. This is straight forward as one can encapsulate the single character token function in another function.

    3. One of the reasons for the LEAD/LAG versions performing so well is that they utilize the special worktable, but it has limits. If the number of rows exceed that limit, then the benefits are minimal or even negative.

    😎

    I've done some extensive work on this and even have an article draft with code and a test harness somewhere. Will try to find it and post back when I have the chance.


     

    Hi Eirikur

    Thanks for this - most appreciated.

    I'll update the tally table to your new method - the order by @@version is something I haven't seen before but am sure is better then order by (select null).

    The replacement prior to running a singe delimiter makes a lot of sense as it will stop the need for testing for overlapping delimiters too.

    The way I use it means that I'm not sure on the data size so I'll have to start doing some more complete testing once I've made these changes.

    Thanks again - the way you have collated, documented and enhanced the community work is excellent and appreciated by many!
    Bevan

Viewing 11 posts - 16 through 25 (of 25 total)

You must be logged in to reply to this topic. Login to reply