detect if previous record is higher or lower ?

  • Jeff, Iā€™d like to take you up on your offer regarding Quirky Update. Iā€™d appreciate some sample code regarding this. Many thanks for your patience, have a great weekend.

    kind regards

    Mick

  • I'll try to hammer out an example tonight after work, Mick.  Sorry for the delay in responding.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Tuesday, June 5, 2018 7:50 AM

    I'll try to hammer out an example tonight after work, Mick.  Sorry for the delay in responding.

    That's brilliant, thank you Jeff

  • Ugh!  I'm going to have to take it back.  I'm assuming that you want the return to return to the screen.  If that's the case, then the "Quirky Update" is no faster than David's fine code and also has the disadvantage of needing an extra column and being unsupported.  With that in mind, I'm not going to post the code for it because I don't want someone that might not be able to support it to use it when there's a very good supported method available (David's fine code).

    I will, however, post the test data generator I used for my tests in case someone want's to race David's code.

    --=====================================================================================================================
    --      Create a million row test table with 1000 names each having 1000 dates and each of those having an
    --      [AMOUNT of JOBS] ranging from 1 to 10.
    --      Not to worry, this only takes 2 to 4 seconds to execute on my litte I5 laptop.
    --=====================================================================================================================
    -- DROP TABLE dbo.Test -- Commented out for safety
    --===== Create the table.
     CREATE TABLE dbo.Test
            (
             [Name]             VARCHAR(50)     NOT NULL
            ,[Date]             SMALLDATETIME   NOT NULL
            ,[AMOUNT of JOBS]   INT             NOT NULL
            )
    ;
    --===== Add the quintessential Clustered Index
     CREATE UNIQUE CLUSTERED INDEX PK_Test ON dbo.Test ([Name],[Date])
    ;
    --===== Populate the test table with random constrained test data.
       WITH cteTally AS
    (--==== Generate a simple thousand row source of sequential numbers.
     SELECT TOP 1000
            N = CONVERT(INT,ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))
       FROM      sys.all_columns ac1
      CROSS JOIN sys.all_columns ac2
    )--===== Creates a cross join of a thousand names with a thousand dates (1 million rows) and assigns
          -- a random number from 1 to 10 to each and dumps it into the test table.
     INSERT INTO dbo.Test WITH (TABLOCK)
            (Name, [Date], [AMOUNT of JOBS])
     SELECT  Name               = 'Person'+RIGHT(t1.N+10000,4) --Simulated names
            ,[Date]             = DATEADD(dd,t2.N-1,'2015')
            ,[AMOUNT of JOBS]   = ABS(CHECkSUM(NEWID())%10)+1
       FROM      cteTally t1
      CROSS JOIN cteTally t2
    ;
     SELECT * FROM dbo.Test;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I recommend using a straight self join rather than the CTE as the CTE will have both sort operators and a hash match in the plan.
    šŸ˜Ž

    Self join example

    USE TEEST;
    GO
    SET NOCOUNT ON;

    IF OBJECT_ID(N'dbo.TBL_TEST_PREV') IS NOT NULL DROP TABLE dbo.TBL_TEST_PREV;

    CREATE TABLE dbo.TBL_TEST_PREV
    (
      [Name] VARCHAR(50)
     ,[Date] smalldatetime
     ,[AMOUNT of JOBS] Int
    )
    ;
    INSERT INTO dbo.TBL_TEST_PREV(name,[date],[AMOUNT of JOBS])
    VALUES
    ('John','12/jan/2017','3')
    ,('John','13/jan/2017','4')
    ,('John','14/jan/2017','2')
    ,('John','15/jan/2017','4')
    ,('John','16/jan/2017','4')
    ,('John','17/jan/2017','1');

    SELECT
     TP.Name
    ,TP.Date
    ,TP.[AMOUNT of JOBS] AS CURR_JOBS
    ,CASE SIGN(TP.[AMOUNT of JOBS]  - ISNULL(PP.[AMOUNT of JOBS],TP.[AMOUNT of JOBS]))
       WHEN -1 THEN 'DOWN'
       WHEN 0 THEN 'STATIC'
       WHEN 1 THEN 'UP'
     END AS DIRECTION
    FROM    dbo.TBL_TEST_PREV TP
    LEFT OUTER JOIN dbo.TBL_TEST_PREV PP
    ON     TP.Name     = PP.Name
    AND     TP.Date     = PP.Date + 1
    ;

    Plugging Jeff's fine data sample into a simple test harness will show the difference

    USE TEEST;
    GO
    SET NOCOUNT ON;

    -- /*
    --=====================================================================================================================
    --  Create a million row test table with 1000 names each having 1000 dates and each of those having an
    --  [AMOUNT of JOBS] ranging from 1 to 10.
    --  Not to worry, this only takes 2 to 4 seconds to execute on my litte I5 laptop.
    --=====================================================================================================================
    -- DROP TABLE dbo.Test -- Commented out for safety
    --===== Create the table.
    CREATE TABLE dbo.Test
       (
       [Name]     VARCHAR(50)  NOT NULL
       ,[Date]     SMALLDATETIME NOT NULL
       ,[AMOUNT of JOBS] INT     NOT NULL
       )
    ;
    --===== Add the quintessential Clustered Index
    CREATE UNIQUE CLUSTERED INDEX PK_Test ON dbo.Test ([Name],[Date])
    ;
    --===== Populate the test table with random constrained test data.
     WITH cteTally AS
    (--==== Generate a simple thousand row source of sequential numbers.
    SELECT TOP 1000
       N = CONVERT(INT,ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))
     FROM  sys.all_columns ac1
    CROSS JOIN sys.all_columns ac2
    )--===== Creates a cross join of a thousand names with a thousand dates (1 million rows) and assigns
      -- a random number from 1 to 10 to each and dumps it into the test table.
    INSERT INTO dbo.Test WITH (TABLOCK)
       (Name, [Date], [AMOUNT of JOBS])
    SELECT Name     = 'Person'+RIGHT(t1.N+10000,4) --Simulated names
       ,[Date]     = DATEADD(dd,t2.N-1,'2015')
       ,[AMOUNT of JOBS] = ABS(CHECkSUM(NEWID())%10)+1
     FROM  cteTally t1
    CROSS JOIN cteTally t2
    ;
    -- */
    DECLARE @CHAR_BUCKET  VARCHAR(50)  = '';
    DECLARE @DATE_BUCKET  SMALLDATETIME = GETDATE();
    DECLARE @INT_BUCKET  INT     = 0;
    DECLARE @STR_BUCKET  VARCHAR(10)  = '';
    DECLARE @TRES TABLE
    (
    TT_TS DATETIME2(7)  NOT NULL DEFAULT (SYSDATETIME())
    ,TT_TXT VARCHAR(50)  NOT NULL
    );

    INSERT INTO @TRES(TT_TXT) VALUES('DRY RUN');
    SELECT
     @CHAR_BUCKET = T.Name
    ,@DATE_BUCKET = T.Date
    ,@INT_BUCKET = T.[AMOUNT of JOBS]
    FROM dbo.Test T;
    INSERT INTO @TRES(TT_TXT) VALUES('DRY RUN');

    INSERT INTO @TRES(TT_TXT) VALUES('SELF JOIN');
    SELECT
     @CHAR_BUCKET = TP.Name
    ,@DATE_BUCKET = TP.Date
    ,@INT_BUCKET = TP.[AMOUNT of JOBS] -- AS CURR_JOBS
    ,@STR_BUCKET = CASE SIGN(TP.[AMOUNT of JOBS]  - ISNULL(PP.[AMOUNT of JOBS],TP.[AMOUNT of JOBS]))
       WHEN -1 THEN 'DOWN'
       WHEN 0 THEN 'STATIC'
       WHEN 1 THEN 'UP'
     END --AS DIRECTION
    FROM    dbo.Test TP
    LEFT OUTER JOIN dbo.Test PP
    ON     TP.Name     = PP.Name
    AND     TP.Date     = PP.Date + 1
    ;
    INSERT INTO @TRES(TT_TXT) VALUES('SELF JOIN');

    INSERT INTO @TRES(TT_TXT) VALUES('CTE JOIN');
    ;WITH cte ([Name],[Date],[AMOUNT of JOBS],RowNum) AS (
    SELECT [Name],[Date],[AMOUNT of JOBS],
    ROW_NUMBER() OVER (PARTITION BY [Name] ORDER BY [Date] ASC)
    FROM dbo.Test
    )
    SELECT
      @CHAR_BUCKET = cte.[Name]
     ,@DATE_BUCKET = cte.[Date]
     ,@INT_BUCKET = cte.[AMOUNT of JOBS],
      @STR_BUCKET = CASE
      WHEN prev.[AMOUNT of JOBS] IS NULL THEN 'Static'
      WHEN prev.[AMOUNT of JOBS] = cte.[AMOUNT of JOBS] THEN 'Static'
      WHEN prev.[AMOUNT of JOBS] < cte.[AMOUNT of JOBS] THEN 'Up'
      ELSE 'Down'
      END
    FROM cte
    LEFT JOIN cte prev ON prev.[Name] = cte.[Name] AND prev.RowNum = cte.RowNum-1;
    INSERT INTO @TRES(TT_TXT) VALUES('CTE JOIN');

    SELECT
    T.TT_TXT
    ,DATEDIFF(MICROSECOND,MIN(T.TT_TS),MAX(T.TT_TS)) AS DURATION
    FROM @TRES T
    GROUP BY T.TT_TXT
    ORDER BY DURATION;

    Results

    TT_TXT  DURATION
    ----------- ---------
    DRY RUN  176776
    SELF JOIN 1250797
    CTE JOIN  2585184

  • The only problem with using date in such a fashion is that it must be certain and continuous without any gaps or duplication.  Of course, there are rabbit holes for all code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Wednesday, June 6, 2018 6:17 AM

    The only problem with using date in such a fashion is that it must be certain and continuous without any gaps or duplication.  Of course, there are rabbit holes for all code.

    As always Jeff, you never run out of spannersšŸ˜‰
    My take on it is that if there are gaps, then those should be regarded as restarts, that is the value returned would be 'STATIC'
    šŸ˜Ž

    BTW, did also test a calculated column, but since the function cannot be made deterministic and therefore the column cannot be persisted, the performance is too poor to even post the code. Further, in this case, using the LAG Window function, slows down drastically after the internal limit is reach and it ends up slower than the self-join. An obvious alternative would be to add a normal column and insert the previous value on the insert.

  • David Burrows - Wednesday, May 23, 2018 2:03 AM

    Performance could be an issue here and Jeff's suggestion of Quirky Update would probably the best.
    However, a possibility

    WITH cte ([Name],[Date],[AMOUNT of JOBS],RowNum) AS (
     SELECT [Name],[Date],[AMOUNT of JOBS],
      ROW_NUMBER() OVER (PARTITION BY [Name] ORDER BY [Date] ASC)
     FROM test
    )
    SELECT cte.[Name],cte.[Date],cte.[AMOUNT of JOBS],
     CASE
      WHEN prev.[AMOUNT of JOBS] IS NULL THEN 'Static'
      WHEN prev.[AMOUNT of JOBS] = cte.[AMOUNT of JOBS] THEN 'Static'
      WHEN prev.[AMOUNT of JOBS] < cte.[AMOUNT of JOBS] THEN 'Up'
      ELSE 'Down'
      END
    FROM cte
    LEFT JOIN cte prev ON prev.[Name] = cte.[Name] AND prev.RowNum = cte.RowNum-1;

    This is how we do it with attribution gap analyze for the most part. We do this on billions of records easy enough with CTE's or even just JOIN (SELECT ROW_NUMBER) AS b num = num - 1 as the joined table. While not the most optimized approach it works easy enough to offset record positions. These days, we do this entire process in a data lake (Azure) as it's pretty fast to create lobs of datasets and join them using USQL.

    Missing data is easily solved with a proper date dimension table if you're doing this based on dates. This is accomplished simply by joining your date dimension table and flagging it with a missing flag in the event you have gaps in date ranges for example. The record still gets joined as it gets assigned a row number with no related attributes or metrics. Similar approaches can be created on other sequential data.

    In cases where I use this, we don't care for gaps in data though and we have tons of duplicates (which in the CTE case, duplicates do not cause us issues). We are measuring the time difference down to the microsecond from the record that comes after the prior record in consumer journeys with ads. This allows us to see how much time between certain ads--certain tactics--takes on the way either to a conversion or a non-conversion.

    Alternatively, the fastest approach for us (using Tally table prob be nice) with the above method is to create a copy of the dataset and index the joined keys. This ultimately is the final approach on extremely large datasets. But bear in mind, I did this originally in MPP where duplication of data has to happen in most use cases. Now I do this in the data lake where we have scalability to create multiple schemas in a single query and slice/dice/chop/etc on the fly per query. Works out nicely.

  • Eirikur Eiriksson - Wednesday, June 6, 2018 6:56 AM

    Jeff Moden - Wednesday, June 6, 2018 6:17 AM

    The only problem with using date in such a fashion is that it must be certain and continuous without any gaps or duplication.  Of course, there are rabbit holes for all code.

    As always Jeff, you never run out of spannersšŸ˜‰
    My take on it is that if there are gaps, then those should be regarded as restarts, that is the value returned would be 'STATIC'
    šŸ˜Ž

    BTW, did also test a calculated column, but since the function cannot be made deterministic and therefore the column cannot be persisted, the performance is too poor to even post the code. Further, in this case, using the LAG Window function, slows down drastically after the internal limit is reach and it ends up slower than the self-join. An obvious alternative would be to add a normal column and insert the previous value on the insert.

    Yes.  That'll work.  They just need to know that would happen.  I wonder what they'd want to do for someone that made a sale only every other day, though.  It's not a problem of the code.  It's a problem of what they expect and that's actually the only thing I wanted to point out to them (not you... you get it).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, June 7, 2018 6:50 AM

    Eirikur Eiriksson - Wednesday, June 6, 2018 6:56 AM

    Jeff Moden - Wednesday, June 6, 2018 6:17 AM

    The only problem with using date in such a fashion is that it must be certain and continuous without any gaps or duplication.  Of course, there are rabbit holes for all code.

    As always Jeff, you never run out of spannersšŸ˜‰
    My take on it is that if there are gaps, then those should be regarded as restarts, that is the value returned would be 'STATIC'
    šŸ˜Ž

    BTW, did also test a calculated column, but since the function cannot be made deterministic and therefore the column cannot be persisted, the performance is too poor to even post the code. Further, in this case, using the LAG Window function, slows down drastically after the internal limit is reach and it ends up slower than the self-join. An obvious alternative would be to add a normal column and insert the previous value on the insert.

    Yes.  That'll work.  They just need to know that would happen.  I wonder what they'd want to do for someone that made a sale only every other day, though.  It's not a problem of the code.  It's a problem of what they expect and that's actually the only thing I wanted to point out to them (not you... you get it).

    Good point as always Jeff, no the question to the OP would be how to deal with gaps or missing dates, if it is a "sales by day" related, are missing dates zero in sales which means a down - up or are those "closed for business" dates which should be ignored? If it is the latter, then I would go with a filtering on the sequence type of approach. The inherit problem with the ROW_NUMBER is that if the correct index isn't in place, then it will introduce a sort operator into the execution plan. Even with the index, there will still be a rather hefty cost of a HASH MATCH join which resource cost is directly related to the cardinality of the set. Reaching a resource limit, it will spill to tempdb and bomb out the performance.
    šŸ˜Ž

    Further, the ROW_NUMBER approach degrades in performance when the internal cardinality limit is reached, pushes the set from a highly optimised temp table structure to a physical temp table which has all the locks and latches.

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply