Split column value to rows.

  • Jacob Wilkins

    One Orange Chip

    Points: 27886

    Eirikur Eiriksson - Friday, June 2, 2017 8:38 AM

    Jacob Wilkins - Friday, June 2, 2017 8:18 AM

    Eirikur Eiriksson - Friday, June 2, 2017 2:22 AM

    Be careful with that order by clause, it will significantly slow down the query!
    😎

    Indeed.

    I kept it included because very strictly speaking there is no guarantee of correctness of the query without it.

    To guarantee the query is correct, you have to know that the N returned for a string of length 3 will be 1,2, and 3.

    Anything else would yield incorrect results.

    As you did, I noticed that I seemed to get correct results every time without it, but without the ORDER BY, there's just no guarantee you'll get the right N from the APPLY.

    Perhaps the engine will actually return them in the desired order under every possible circumstance, but I'm a paranoid DBA,so I couldn't bring myself to make that assumption 🙂

    From a set based perspective, the order of appearance is irrelevant, the result set will be correct with or without the order by clause as the N will always correspond correctly to the position of the character returned. In fact, when using MAXDOP 1, the output is always appearing in the correct order although as you said, there is no guarantee that will always be so.
    😎

    A greater culprit is not using the TOP operator, especially when handling greater variety of input length, the constant scan operator will feed the join operator maximum number of rows for every iteration.

    It's not the presentation order I'm talking about; it's actually the correctness of the query.

    We're returning some values for N from the APPLY, and using them in the SUBSTRING function. 

    Let's say we're dealing with a category with LEN of 3.

    It matters very much whether the three N returned are {1,2,3} or {2,4,5} or something else.

    If the three N returned are anything but {1,2,3}, the results returned for that ID and category will not be correct.

    To actually guarantee that you return only N from 1 to LEN(category), and subsequently get each character of the category from the SUBSTRING, you need the ORDER BY.

    Cheers!

  • Ed Wagner

    SSC Guru

    Points: 286982

    Eirikur Eiriksson - Friday, June 2, 2017 3:23 AM

    Using the LEN function in the WHERE clause is a killer here, slows the query down several times, much more efficient to use the TOP operator within the numbers CTE. The reason for this is that when using the former, the server will generate the full output of the CTE and then filter is applied afterwards whilst the TOP operator will limit the number generated.
    😎

    Eirikur, you sure hit the nail on the head there.  After tearing into it a little more, the LEN in the WHERE clause just kills it.

    It looks like using the ITVF instead of the table-value constructor in the CROSS APPLY makes a real difference.  Here's another variation to run through your mill.  My timing is meaningless because we have completely different hardware.

    The plan looks horrible because of the function (it always looks that way), but it performs well.

    INSERT INTO @times(task,d) VALUES ('TallyN #3', GETDATE());

    SELECT @id = ID,
           @n = t.N,
           @char = SUBSTRING(d.Category, t.N, 1)
    FROM #test d
      CROSS APPLY (SELECT TOP (LEN(d.Category)) N
                    FROM util.dbo.TallyN(5)) t (N);
          
    INSERT INTO @times(task,d) VALUES ('TallyN #3', GETDATE());

  • Jacob Wilkins

    One Orange Chip

    Points: 27886

    Ok, so I finally got a chance to look at this again.

    There are a couple ways of modifying the strictly-speaking-unsafe queries to make them safe that don't result in too much of a slowdown.

    One is by using ROW_NUMBER() to create the values in the APPLY, and the other is just to add a WHERE N<=LEN(Category) to query in the APPLY.

    The latter is slightly but consistently the faster of the two on my machine.

    One interesting thing of note is that using OUTER APPLY instead of CROSS APPLY is consistently good for an improvement of about 300ms or so across these similar solutions.

    Here are those modified queries, along with a modified TallyN #3 (modified to get the modest speed bump from OUTER APPLY, and a MAXDOP 1 added, since it also likes going parallel), in a form to be used with Eirikur's test harness:

    INSERT INTO @TIMER(T_TXT) VALUES ('Jacob #3');

    SELECT
    @INT_BUCKET = SSC.SSC_ID
    ,@INT_BUCKET = NM.N
    ,@CHAR_BUCKET = SUBSTRING(SSC.CATEGORY,NM.N,1)
    FROM dbo.TBL_SAMPLE_SPLIT_CHARS SSC
    OUTER APPLY (SELECT TOP(LEN(SSC.CATEGORY)) N=ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM (VALUES (0),(0),(0),(0),(0)) X(N)) AS NM
    OPTION (MAXDOP 1);

    INSERT INTO @TIMER(T_TXT) VALUES ('Jacob #3');

    INSERT INTO @TIMER(T_TXT) VALUES ('Jacob #4');

    SELECT @INT_BUCKET = SSC.SSC_ID,
     @INT_BUCKET = t.n,
     @CHAR_BUCKET = SUBSTRING(SSC.CATEGORY, t.N, 1)
    FROM dbo.TBL_SAMPLE_SPLIT_CHARS SSC
     OUTER APPLY
     (
     SELECT TOP (LEN(category)) n
     FROM (VALUES(1),(2),(3),(4),(5))n(n)
     WHERE n<=LEN(category)
     )t(n)
    OPTION (MAXDOP 1);

    INSERT INTO @TIMER(T_TXT) VALUES ('Jacob #4');

    INSERT INTO @TIMER(T_TXT) VALUES ('TallyN #3');

    SELECT @INT_BUCKET = SSC.SSC_ID,
     @INT_BUCKET = t.n,
     @CHAR_BUCKET = SUBSTRING(SSC.CATEGORY, t.N, 1)
    FROM dbo.TBL_SAMPLE_SPLIT_CHARS SSC
    OUTER APPLY (SELECT TOP (LEN(SSC.Category)) N
          FROM TallyN(5)) t (N)
    OPTION (MAXDOP 1);

    INSERT INTO @TIMER(T_TXT) VALUES ('TallyN #3');

    And some representative results on my system, with Jacob #1 and Jacob #2 removed (#1 was much slower than these, and the #2 Eirikur provided wasn't all that different from Eirikur's, so removing them helped get in more test runs):


    T_TXT                DURATION
    -------------------- -----------
    DRY RUN              526000
    EE OUTER APPLY       2550000
    EE CROSS APPLY       2853000
    Jacob #4             3273000
    Jacob #3             3650000
    TallyN #3            8630000:

  • Ed Wagner

    SSC Guru

    Points: 286982

    I don't know I'm doing differently than you guys, but my times are a whole lot different.


    Run              Time
    ---------------  -----
    EE Outer Apply   1026
    Jacob #4         1383
    TallyN #3        2356
    Jacob #1         4756

    And yes, I've already checked my row counts to make sure.

  • Eirikur Eiriksson

    SSC Guru

    Points: 182438

    Ed Wagner - Friday, June 2, 2017 7:00 PM

    I don't know I'm doing differently than you guys, but my times are a whole lot different.


    Run              Time
    ---------------  -----
    EE Outer Apply   1026
    Jacob #4         1383
    TallyN #3        2356
    Jacob #1         4756

    And yes, I've already checked my row counts to make sure.

    Ed, the difference is that you are using milliseconds in the final timer result query whereas we are using microseconds😉
    😎

  • Eirikur Eiriksson

    SSC Guru

    Points: 182438

    Ed Wagner - Thursday, June 1, 2017 7:28 PM

    I see I forgot to define dbo.TallyN.  It's Itzik's zero-read tally ITVF.  I use it so much I sometimes think it's a part of SQL Server.  It isn't, but it should be.

    ALTER FUNCTION dbo.TallyN(@N Bigint) RETURNS TABLE WITH SCHEMABINDING
    AS
    --Credit: This function was written by Itzik Ben-Gan at http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
    RETURN WITH level0 AS (
    SELECT 0 AS g UNION ALL SELECT 0),           --2
    level1 AS (SELECT 0 AS g FROM level0 AS a CROSS JOIN level0 AS b), --2^2 = 4
    level2 AS (SELECT 0 AS g FROM level1 AS a CROSS JOIN level1 AS b), --4^2 = 16
    level3 AS (SELECT 0 AS g FROM level2 AS a CROSS JOIN level2 AS b), --16^2 = 256
    level4 AS (SELECT 0 AS g FROM level3 AS a CROSS JOIN level3 AS b), --256^2 = 65536
    level5 AS (SELECT 0 AS g FROM level4 AS a CROSS JOIN level4 AS b), --65536^2 = 4294967296
    Tally AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM level5)
    SELECT TOP (@N) N
    FROM Tally
    ORDER BY N;

    If I'm working with varchar(8000) or less then I normally use a simpler version which will output the maximum of 10K rows. It tends to run quite a lot faster than the full blown version.
    😎

    CREATE OR ALTER FUNCTION dbo.NUMGEN10K
    (
      @MAX_N INT
    )
    RETURNS TABLE
    WITH SCHEMABINDING
    AS
    RETURN
    WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))
    SELECT TOP(@MAX_N) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
    FROM T T1,T T2,T T3,T T4
    GO

  • Ed Wagner

    SSC Guru

    Points: 286982

    Eirikur Eiriksson - Saturday, June 3, 2017 1:57 AM

    Ed Wagner - Friday, June 2, 2017 7:00 PM

    I don't know I'm doing differently than you guys, but my times are a whole lot different.


    Run              Time
    ---------------  -----
    EE Outer Apply   1026
    Jacob #4         1383
    TallyN #3        2356
    Jacob #1         4756

    And yes, I've already checked my row counts to make sure.

    Ed, the difference is that you are using milliseconds in the final timer result query whereas we are using microseconds😉
    😎

    Yeah, but it's more than that.  Even after dividing your times, mine are at least half of yours.  Sometimes they're really different and the rank is different.  Your EE OUTER APPLY wins in either case, but the times are proportionally different from each other.  I've learned something from the whole discussion, so I think that having fun has been worth it - the differences are what bugs me.  You know, I'm content to just let it go, but I will play with it a bit more on my machine. 😉

  • Jacob Wilkins

    One Orange Chip

    Points: 27886

    You have me curious, Ed. 🙂

    Having said that, it looks like the main difference is the ratio of Jacob #1 to the others that is different in your results. Everything else looks approximately right (full disclaimer, I'm eyeballing that).

    Mind sharing the full test script you're using?

    Cheers!

  • Ed Wagner

    SSC Guru

    Points: 286982

    Jacob Wilkins - Tuesday, June 6, 2017 9:18 AM

    You have me curious, Ed. 🙂

    Having said that, it looks like the main difference is the ratio of Jacob #1 to the others that is different in your results. Everything else looks approximately right (full disclaimer, I'm eyeballing that).

    Mind sharing the full test script you're using?

    Cheers!

    Not at all.  I don't have it here at work, but it should still be in an SSMS window at home.  And yes, it's saved for future playing.

  • sql_learner29

    SSCrazy Eights

    Points: 9107

    @Joe Torre 
    Thanks a lot Joe..

    Your code works like a charm..
    You are a genius..

    Thanks again..

    [font="Comic Sans MS"]
    ---------------------------------------------------

    Thanks [/font]

  • Joe Torre

    SSChampion

    Points: 10247

    Glad it helped
  • sgmunson

    SSC Guru

    Points: 110459

    Just for gits and shiggles, I figured I'd try this on for size, and try another option based on Jeff Moden's DelimitedSplit8K function.

    Here's the function with the delimiter removed and everything stripped out that didn't need to be there:
    USE [LOCAL_DB]
    GO

    IF OBJECT_ID(N'dbo.NonDelimitedSplit8K', N'IF') IS NOT NULL
        BEGIN
        DROP FUNCTION dbo.NonDelimitedSplit8K;
        END;

    GO

    /****** Object: UserDefinedFunction [dbo].[DelimitedSplit8K]  Script Date: 6/21/2017 9:50:34 AM ******/
    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    CREATE FUNCTION dbo.NonDelimitedSplit8K (
        @pString varchar(5)
    )
    RETURNS TABLE WITH SCHEMABINDING AS
    RETURN
    WITH E1(N) AS (
             --10E+1 or 10 rows
        SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
        SELECT 1 UNION ALL SELECT 1 --UNION ALL SELECT 1 UNION ALL
        --SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
    ),
        cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
           -- for both a performance gain and prevention of accidental "overruns"
            SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
            FROM E1
    )
    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
    SELECT    ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N),
            Item   = SUBSTRING(@pString, l.N, 1)
    FROM cteTally l;
    GO

    Now here's the test rig:
    USE LOCAL_DB;
    GO

    SET NOCOUNT ON;
    DBCC FREEPROCCACHE;
    /*
    IF OBJECT_ID(N'dbo.TEST', N'U') IS NOT NULL
        BEGIN
        DROP TABLE dbo.TEST;
        END;
    GO

    CREATE TABLE dbo.TEST (
        ID int NOT NULL CONSTRAINT PK_TEST_ID PRIMARY KEY CLUSTERED,
        CATEGORY varchar(10),
        CAT_LEN AS LEN(CATEGORY)
    );
    */
    DECLARE @SAMPLE_SIZE AS int            = 2000000;
    DECLARE @MAX_CHARS AS int            = 5;
    DECLARE @INT_BUCKET AS int            = 0;
    DECLARE @CHAR_BUCKET AS varchar(10) = '';
    DECLARE @BATCH_ID AS int            = 1;
    DECLARE @TIMER AS TABLE (
        T_TXT varchar(30) NOT NULL,
        T_TS datetime2(7) NOT NULL DEFAULT (SYSDATETIME()),
        T_BATCH int NOT NULL
    );
    /*
    INSERT INTO TEST (ID, CATEGORY)
        VALUES    (1,'5HW6'),
                (2,'256'),
                (3,'57'),
                (4,'3'),
                (5,'4578'),
                (6,'256HW');

    SELECT *
    FROM dbo.TEST;
    */
    /*
    WITH T(N) AS (

            SELECT X.N
            FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N)
    ),
        NUMS(N) AS (

            SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N
            FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9
    )
    INSERT INTO dbo.TEST (ID, CATEGORY)
    SELECT
        NM.N,
        SUBSTRING(REPLACE(CONVERT(varchar(36), NEWID(), 0), CHAR(45), CHAR(65)), 1, (ABS(CHECKSUM(NEWID())) % @MAX_CHARS) + 1)
    FROM NUMS AS NM;
    */

    WHILE @BATCH_ID <= 10
        BEGIN
        INSERT INTO @TIMER(T_TXT, T_BATCH) VALUES('NON-DELIMITED-SPLIT8K', @BATCH_ID);
        SELECT @INT_BUCKET = T.ID, @CHAR_BUCKET = T.CATEGORY, @CHAR_BUCKET = S.Item
        FROM dbo.TEST AS T
            OUTER APPLY dbo.NonDelimitedSplit8K(T.CATEGORY) AS S
        INSERT INTO @TIMER(T_TXT, T_BATCH) VALUES('NON-DELIMITED-SPLIT8K', @BATCH_ID);

        INSERT INTO @TIMER(T_TXT, T_BATCH) VALUES('EE OUTER APPLY', @BATCH_ID);
        SELECT @INT_BUCKET = T.ID, @CHAR_BUCKET = T.CATEGORY, @CHAR_BUCKET = SUBSTRING(T.CATEGORY, S.N, 1)
        FROM dbo.TEST AS T
            OUTER APPLY (SELECT TOP (LEN(T.CATEGORY)) X.N FROM (VALUES (1),(2),(3),(4),(5)) AS X(N)) AS S
        INSERT INTO @TIMER(T_TXT, T_BATCH) VALUES('EE OUTER APPLY', @BATCH_ID);

        INSERT INTO @TIMER(T_TXT, T_BATCH) VALUES('EE OUTER APPLY SGM', @BATCH_ID);
        SELECT @INT_BUCKET = T.ID, @CHAR_BUCKET = T.CATEGORY, @CHAR_BUCKET = SUBSTRING(T.CATEGORY, S.N, 1)
        FROM dbo.TEST AS T
            OUTER APPLY (SELECT TOP (T.CAT_LEN) X.N FROM (VALUES (1),(2),(3),(4),(5)) AS X(N)) AS S
        INSERT INTO @TIMER(T_TXT, T_BATCH) VALUES('EE OUTER APPLY SGM', @BATCH_ID);

        SELECT @BATCH_ID = @BATCH_ID + 1;

        END;

    SELECT
        T.T_BATCH,
        T.T_TXT,
        DATEDIFF(MICROSECOND, MIN(T.T_TS), MAX(T.T_TS)) AS DURATION
    FROM @TIMER AS T
    GROUP BY T_BATCH, T.T_TXT
    ORDER BY T_BATCH, DURATION ASC;

    --DROP TABLE TEST;
    --TRUNCATE TABLE dbo.TEST;

    And finally, the results:


    T_BATCH    T_TXT    DURATION
    1    NON-DELIMITED-SPLIT8K    15600
    1    EE OUTER APPLY    46800
    1    EE OUTER APPLY SGM    46800
    2    EE OUTER APPLY    46800
    2    NON-DELIMITED-SPLIT8K    46800
    2    EE OUTER APPLY SGM    62400
    3    NON-DELIMITED-SPLIT8K    62400
    3    EE OUTER APPLY    63200
    3    EE OUTER APPLY SGM    100600
    4    EE OUTER APPLY SGM    93600
    4    NON-DELIMITED-SPLIT8K    93600
    4    EE OUTER APPLY    109200
    5    EE OUTER APPLY    109200
    5    NON-DELIMITED-SPLIT8K    109200
    5    EE OUTER APPLY SGM    124800
    6    EE OUTER APPLY    124800
    6    NON-DELIMITED-SPLIT8K    124800
    6    EE OUTER APPLY SGM    140400
    7    NON-DELIMITED-SPLIT8K    140400
    7    EE OUTER APPLY SGM    156000
    7    EE OUTER APPLY    156000
    8    EE OUTER APPLY    171600
    8    NON-DELIMITED-SPLIT8K    171600
    8    EE OUTER APPLY SGM    187200
    9    NON-DELIMITED-SPLIT8K    187200
    9    EE OUTER APPLY    202800
    9    EE OUTER APPLY SGM    249600
    10    EE OUTER APPLY SGM    218400
    10    EE OUTER APPLY    218400
    10    NON-DELIMITED-SPLIT8K    218400

    Just not sure why I get such large differences in the timings.

    Steve
    ‌(aka sgmunson)
    ‌:) 🙂 🙂
    Health & Nutrition

Viewing 12 posts - 16 through 27 (of 27 total)

You must be logged in to reply to this topic. Login to reply