Islands And Gaps - I'm Stuck...Again

  • I feel like I've solved this type of problem before, and yet I'm stuck and need some expert help. Here is some DDL so you can see a relevant snippet of what I'm working with:
    CREATE TABLE #Scratch
    (
        PayeeID CHAR(3) NOT NULL,
        HRAttribute VARCHAR(25) NOT NULL,
        ValueText VARCHAR(25) NOT NULL,
        StartDate DATE NOT NULL,
    )
    GO

    INSERT INTO #Scratch (PayeeID, HRAttribute, ValueText, StartDate)
    SELECT '123', 'SUPERVISOR ID', '111', '09/29/2014' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '111', '10/07/2014' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '111', '04/01/2015' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '222', '09/25/2015' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '12/22/2015' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '04/01/2016' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '222', '10/05/2016' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '01/09/2017' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '03/01/2017' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '10/01/2017' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '01/01/2018' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '03/01/2018' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '222', '06/29/2018' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '222', '07/12/2018'
    GO

    From there I basically want to identify "islands" of data--that is, contiguous groups of rows where PayeeID, HRAttribute, and ValueText are identical, when sorted by StartDate. Here's the SQL I have so far, which seems to almost be working but not quite:
    SELECT
        *,
        RN1 = ROW_NUMBER() OVER(PARTITION BY PayeeID, HRAttribute, ValueText ORDER BY StartDate),
        RN2 = ROW_NUMBER() OVER(PARTITION BY PayeeID, HRAttribute ORDER BY StartDate),
        GROUP_NUMBER = ROW_NUMBER() OVER(PARTITION BY PayeeID, HRAttribute ORDER BY StartDate) - ROW_NUMBER() OVER(PARTITION BY PayeeID, HRAttribute, ValueText ORDER BY StartDate)
    FROM #Scratch
    ORDER BY StartDate
    ;

    PayeeID HRAttribute  ValueText StartDate RN1      RN2      GROUP_NUMBER
    -----------------------------------------------------------------------------
    123 SUPERVISOR ID 111 2014-09-29 1       1       0
    123 SUPERVISOR ID 111 2014-10-07 2       2       0
    123 SUPERVISOR ID 111 2015-04-01 3       3       0
    123 SUPERVISOR ID 222 2015-09-25 1       4       3
    123 SUPERVISOR ID 333 2015-12-22 1       5       4
    123 SUPERVISOR ID 333 2016-04-01 2       6       4
    123 SUPERVISOR ID 222 2016-10-05 2       7       5
    123 SUPERVISOR ID 333 2017-01-09 3       8       5
    123 SUPERVISOR ID 333 2017-03-01 4       9       5
    123 SUPERVISOR ID 333 2017-10-01 5       10       5
    123 SUPERVISOR ID 333 2018-01-01 6       11       5
    123 SUPERVISOR ID 333 2018-03-01 7       12       5
    123 SUPERVISOR ID 222 2018-06-29 3       13       10
    123 SUPERVISOR ID 222 2018-07-12 4       14       10

    As you can see, my column GROUP_NUMBER would work fine to identify "islands" if ValueText never reverted to a value it previously held. But in this case, value can change to a new value and then back to a previous value. On 10/05/2016, when this Payee's Supervisor ID changes back to '222', I want RN1 to reset to 1, but instead it continues to 2. I understand why that is, the ROW_NUMBER code is easy for me to comprehend, but I'm not sure what to change to get the desired results. Thoughts?


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • Dang.  I've run into this problem before and resolved it but I've misplaced my notes on how I fixed it and so doing this totally from memory, which might not be all that it's cracked up to be. 😀

    The following works for what you have and, IIRC, will maintain the temporal order because of the values of R1-R2 but I don't have the time to test all of the scenarios that I tested with before and so I might not be recalling correctly. The ",FinalGrouping" column is what you're after.  I left all the other columns in place just so you can see how it works.  What people forget is that R1-R2 is a grouping for each (in this case) ValueText and so you need that extra grouping using DENSE_RANK to resolve what looks like a tie but actually isn't.


    WITH cte AS
    (
     SELECT *
            ,R1 = ROW_NUMBER() OVER (PARTITION BY PayeeID, HRAttribute            ORDER BY StartDate, ValueText)
            ,R2 = ROW_NUMBER() OVER (PARTITION BY PayeeID, HRAttribute, ValueText ORDER BY StartDate, ValueText)
    FROM #Scratch
    )
     SELECT *
            ,R1R2 = R1-R2
            ,FinalGrouping = DENSE_RANK() OVER (PARTITION BY PayeeID, HRAttribute ORDER BY R1-R2,ValueText)
       FROM cte
      ORDER BY PayeeID, HRAttribute, StartDate, ValueText
    ;

    Please do test the be-jeebers out of it and forgive me if I've remembered incorrectly.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Awesome, thank you Jeff! I will test this out tomorrow and post back!


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • Here is a more efficient way as it eliminates 3 out of 4 sort operators found in the ROW_NUMBER solution's execution plan.
    😎

    USE TEEST;
    GO
    SET NOCOUNT ON;

    IF OBJECT_ID(N'tempdb..#Scratch') IS NOT NULL DROP TABLE #Scratch;
    CREATE TABLE #Scratch
    (
      PayeeID CHAR(3) NOT NULL,
      HRAttribute VARCHAR(25) NOT NULL,
      ValueText VARCHAR(25) NOT NULL,
      StartDate DATE NOT NULL,
    )

    INSERT INTO #Scratch (PayeeID, HRAttribute, ValueText, StartDate)
    SELECT '123', 'SUPERVISOR ID', '111', '09/29/2014' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '111', '10/07/2014' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '111', '04/01/2015' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '222', '09/25/2015' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '12/22/2015' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '04/01/2016' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '222', '10/05/2016' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '01/09/2017' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '03/01/2017' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '10/01/2017' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '444', '10/02/2017' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '01/01/2018' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '333', '03/01/2018' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '222', '06/29/2018' UNION ALL
    SELECT '123', 'SUPERVISOR ID', '222', '07/12/2018'
    ;
    ;WITH GROUPED_EMP AS
    (
      SELECT
       CASE
        WHEN SC.ValueText = LAG(SC.ValueText,1,SC.ValueText) OVER
                  (
                   PARTITION BY SC.PayeeID
                       ,SC.HRAttribute
                   ORDER BY  SC.[StartDate] ASC
                  ) THEN 0
        ELSE 1
       END AS GRP_FLG
       ,SC.PayeeID
       ,SC.HRAttribute
       ,SC.ValueText
       ,SC.StartDate
      FROM  #Scratch  SC
    )
    SELECT
      GE.PayeeID
     ,GE.HRAttribute
     ,GE.ValueText
     ,GE.StartDate
     ,SUM(GE.GRP_FLG) OVER
       (
        PARTITION BY GE.PayeeID
            ,GE.HRAttribute
        ORDER BY  GE.StartDate ASC
        ROWS BETWEEN UNBOUNDED PRECEDING
          AND  CURRENT ROW
       ) AS GRP_NO
    FROM  GROUPED_EMP  GE;

    Output

    PayeeID HRAttribute  ValueText StartDate GRP_NO
    ------- -------------- ---------- ---------- -------
    123  SUPERVISOR ID 111   2014-09-29 0
    123  SUPERVISOR ID 111   2014-10-07 0
    123  SUPERVISOR ID 111   2015-04-01 0
    123  SUPERVISOR ID 222   2015-09-25 1
    123  SUPERVISOR ID 333   2015-12-22 2
    123  SUPERVISOR ID 333   2016-04-01 2
    123  SUPERVISOR ID 222   2016-10-05 3
    123  SUPERVISOR ID 333   2017-01-09 4
    123  SUPERVISOR ID 333   2017-03-01 4
    123  SUPERVISOR ID 333   2017-10-01 4
    123  SUPERVISOR ID 444   2017-10-02 5
    123  SUPERVISOR ID 333   2018-01-01 6
    123  SUPERVISOR ID 333   2018-03-01 6
    123  SUPERVISOR ID 222   2018-06-29 7
    123  SUPERVISOR ID 222   2018-07-12 7

    An added bonus is that it increments the Group Number in chronological order.

  • I completely missed that this was posted on a 2012 forum.  LAG is definitely the way to go for this.  Thanks, Eirikur.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • To add to Eirikur's good code... You'll want to add, what Itzik Ben-Gan has dubbed, a POC index.
    If you check the execution plan for the code that Eirikur supplied, you'll notice that the most expensive operation is the initial sort caused by LAG function.
    Adding the following index will completely eliminate the need for that sort.

    CREATE INDEX ix_ScratchPayeeID_HRAttribute_StartDate ON #Scratch (
        PayeeID,
        HRAttribute,
        StartDate
        )
    INCLUDE (
        ValueText
        );

  • Thanks everyone! Jason, just a quick note to say that my actual source table is a denormalized mess. I did an UNPIVOT to get it into the above shape, so it's just in a CTE at that point, not a temp table. The temp table was more for purposes of easily consumable DDL for this post. But I will consider materializing it into a temp table so I can add that index if need be. Will post another update later. Thanks again to Eirikur and Jeff for the tips. 

    Quick forum question while I'm here: how do I post tabular data like Eirikur? I didn't know how to do that for my initial post but I'd like to go back and add it for future reference.


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • Jason A. Long - Thursday, July 26, 2018 8:47 AM

    To add to Eirikur's good code... You'll want to add, what Itzik Ben-Gan has dubbed, a POC index.
    If you check the execution plan for the code that Eirikur supplied, you'll notice that the most expensive operation is the initial sort caused by LAG function.
    Adding the following index will completely eliminate the need for that sort.

    CREATE INDEX ix_ScratchPayeeID_HRAttribute_StartDate ON #Scratch (
        PayeeID,
        HRAttribute,
        StartDate
        )
    INCLUDE (
        ValueText
        );

    Thanks Jason, good point, thought of adding the Partition Order Covering (POC) index addition but ran out of time this morning.
    😎 

    Here is Brent Ozar's brilliant write-up on the subject.

  • autoexcrement - Thursday, July 26, 2018 9:14 AM

    Thanks everyone! Jason, just a quick note to say that my actual source table is a denormalized mess. I did an UNPIVOT to get it into the above shape, so it's just in a CTE at that point, not a temp table. The temp table was more for purposes of easily consumable DDL for this post. But I will consider materializing it into a temp table so I can add that index if need be. Will post another update later. Thanks again to Eirikur and Jeff for the tips. 

    Quick forum question while I'm here: how do I post tabular data like Eirikur? I didn't know how to do that for my initial post but I'd like to go back and add it for future reference.

    You are very welcome.
    😎

    I use the method of "results to text" in SSMS, chop off the trailing things with columnar editing (the ALT KEY and mouse select), paste it into a Plain IF code section and finally format it in a fixed width font. Not perfect but serves the purpose.
    My annoyance is that there is no "PRE" type IF code which would make it better.

  • Wow, never used results to text, and never knew about "alt + select"! Thanks. Still, getting any semblance of clean formatting in my post took ages. Hope it's helpful to someone. 🙂


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • autoexcrement - Thursday, July 26, 2018 9:14 AM

    Thanks everyone! Jason, just a quick note to say that my actual source table is a denormalized mess. I did an UNPIVOT to get it into the above shape, so it's just in a CTE at that point, not a temp table. The temp table was more for purposes of easily consumable DDL for this post. But I will consider materializing it into a temp table so I can add that index if need be. Will post another update later. Thanks again to Eirikur and Jeff for the tips. 

    Quick forum question while I'm here: how do I post tabular data like Eirikur? I didn't know how to do that for my initial post but I'd like to go back and add it for future reference.

    It's tough to make specific reccomendations without knowing what the source table actually looks like.or how this fits into the rest of the query.
    Basically there are two options... #1 keep it as is. #2 is to dump the UNPIVOT results into a #temp table and index that.
    There are positives and negatives to both...
    Using a temp table has the advantages of simplifying the orerall query by breaking the unpivot and gaps & islands into separate commands. It also allows SQL Server to create stastics for the #temp table and it would allow you to create the afformentioned POC index.
    The downside, is that creating temp tables will cause a recompolition when placed in a stored procedure . The impact of this type of recompolition can usually be negated by simply moving all DDL statements to the very beginning of the procedure.
    Also, it's entirely possible that the time needed create the POC index may completly negate the benefite of having in place for the gaps & islands command.

    My gut says that the upsides of using a #temp table would outweight the downsides but the only way to know for sure is to write it both ways and see which one performs better.

  • I came to the same conclusion. 🙂 I dumped the UNPIVOT results into a temp table and indexed it as you suggested.


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • autoexcrement - Thursday, July 26, 2018 12:15 PM

    I came to the same conclusion. 🙂 I dumped the UNPIVOT results into a temp table and indexed it as you suggested.

    Nice. Thanks for the feedback.

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply