Advanced grouping for gaps and islands

  • Hi,

    I'm trying to figure out how to group a set of data based on how many days a person is sick. The problem is a gaps and island issue combined with running streak, with a twist.

    The records are either of type Work or Sick and are only saved on it's registered date. So there are gaps and islands. I think it's easies to explain based of an example from SQL Fiddle

    A group of consecutive rows of type Sick are considered a group if there's not more than 5 days of work between two rows of Sick. Or if there's not more than 7 days between without any Work between. If there's Work between two rows of type Sick the number of sickdays are only incremented by "missing" days between a row of Sick until next row of Work (see date 2022-03-11).

    If I do this example in Excel it's easy. But there it's possible to reference the calculated column for previous row to easily increment. The formula then looks like this (I've manually added 0 to previous row before a new group by type Sick:

    Column AColumn BColumn C
    2022-03-01Work0
    2022-03-03Sick=IF(A3-A2>5;0;IF(B3="Sick"; IF(B2="Sick";A3-A2+C2;C2+1);IF(B2="Sick";C2+A3-1-A2;C2)))

    In SQL I've solved my problem using Recursive CTE, but that takes way to long. I've manged do narrow it down to roughly 3 seconds per person, but Im dealing with hundreds of thousands of rows so that solution is not doable.

    Is it possible to achive the numbering showing same as GoalNbr using SQL technique for gaps and islands (lag, lead, row_number, dense_rank and so on), and how would that solution look?

    Regards,

    Alfred

    SQL Code (if link gets broken)

    create table sample
    (
    RegDate date
    , Type varchar(10)
    , GoalNbr int
    )

    insert into sample values ('2022-02-15', 'Work', null)
    insert into sample values ('2022-02-17', 'Work', null)
    insert into sample values ('2022-02-18', 'Work', null)
    insert into sample values ('2022-02-21', 'Work', null)
    insert into sample values ('2022-02-22', 'Work', null)
    insert into sample values ('2022-02-23', 'Work', null)
    insert into sample values ('2022-02-24', 'Sick', 1)
    insert into sample values ('2022-02-25', 'Work', null)
    insert into sample values ('2022-02-26', 'Work', null)
    insert into sample values ('2022-02-27', 'Work', null)
    insert into sample values ('2022-02-28', 'Work', null)
    insert into sample values ('2022-03-01', 'Work', null)
    insert into sample values ('2022-03-03', 'Sick', 1)
    insert into sample values ('2022-03-07', 'Sick', 5)
    insert into sample values ('2022-03-08', 'Sick', 6)
    insert into sample values ('2022-03-10', 'Work', null)
    insert into sample values ('2022-03-11', 'Sick', 8)
    insert into sample values ('2022-03-14', 'Work', null)
    insert into sample values ('2022-03-15', 'Work', null)
    insert into sample values ('2022-03-17', 'Sick', 11)
    insert into sample values ('2022-03-18', 'Work', null)
    insert into sample values ('2022-03-19', 'Work', null)
    insert into sample values ('2022-03-20', 'Work', null)
    insert into sample values ('2022-03-21', 'Sick', 12)
    insert into sample values ('2022-03-22', 'Work', null)
    insert into sample values ('2022-03-24', 'Sick', 13)
    insert into sample values ('2022-03-25', 'Work', null)
    insert into sample values ('2022-03-28', 'Work', null)
    insert into sample values ('2022-03-29', 'Sick', 14)
    insert into sample values ('2022-03-31', 'Sick', 15)
    insert into sample values ('2022-04-01', 'Sick', 16)
    insert into sample values ('2022-04-04', 'Work', null)
    insert into sample values ('2022-04-05', 'Work', null)
    insert into sample values ('2022-04-07', 'Work', null)

    select * from sample order by RegDate
  • I don't understand how you get to those results, can you attach the spreadsheet with some data and the formula?

  • As you have not included any examples of the 7 day condition in the test data I am going to ignore it.

    What I have done works with the test data and should give you an idea of how to achieve your result.

    WITH Boundaries
    AS
    (
    SELECT RegDate, [Type], GoalNbr
    ,CASE
    WHEN DATEDIFF(day, COALESCE(MAX(CASE WHEN [Type] = 'Sick' THEN RegDate ELSE '1900' END) OVER (ORDER BY RegDate ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING), '1900'), RegDATE) > 6
    AND [Type] = 'Sick'
    THEN 1
    ELSE 0
    END AS Boundary5
    ,CASE
    WHEN LAG([Type]) OVER (ORDER BY RegDate) = 'Sick'
    THEN (DATEDIFF(day, LAG(RegDate) OVER (ORDER BY RegDate), RegDate)) - 1
    ELSE 0
    END AS Days2Add
    FROM [sample]
    )
    ,Grps
    AS
    (
    SELECT RegDate, [Type], GoalNbr
    ,SUM(Boundary5) OVER (ORDER BY RegDate) AS Grp
    ,Days2Add
    FROM Boundaries
    )
    ,RNs
    AS
    (
    SELECT RegDate, [Type], GoalNbr, Grp, Days2Add
    ,ROW_NUMBER() OVER (PARTITION BY Grp, [Type] ORDER BY RegDate) AS rn
    FROM Grps
    )
    SELECT RegDate, [Type], GoalNbr
    ,CASE
    WHEN [Type] = 'Sick'
    THEN rn + SUM(Days2Add) OVER (PARTITION BY Grp ORDER BY RegDate)
    END AS CalcValue
    FROM RNs
    ORDER BY RegDate;
  • I've done some testing and when I added partition by PersonId and Group7 it gives me exactly what I want.

    The trick with datediff  (Boundary5 in CTE Boundaries) was what I missed. Very clever solution

    Thank you so much Ken ❤️

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply