Islands and Gaps Question

  • Hey guys, feeling kinda dumb I can't figure this one out on my own. 🙁 Hopefully you can advise. Here's some sample data:

    IF OBJECT_ID('tempdb..#EMP_TEST') IS NOT NULL
        DROP TABLE #EMP_TEST
    GO

    CREATE TABLE #EMP_TEST
    (
        EID SMALLINT NOT NULL,
        Stat CHAR(1) NOT NULL,
        StartDate DATE NOT NULL,
        EndDate DATE NOT NULL
    )
    GO

    INSERT INTO #EMP_TEST
    (
        EID,
        Stat,
        StartDate,
        EndDate
    )
    SELECT 12345, 'A', '20111215', '20120101' UNION ALL
    SELECT 12345, 'A', '20120101', '20120118' UNION ALL
    SELECT 12345, 'A', '20120118', '20120201' UNION ALL
    SELECT 12345, 'A', '20120201', '20120321' UNION ALL
    SELECT 12345, 'A', '20120321', '20120401' UNION ALL
    SELECT 12345, 'A', '20120401', '20121109' UNION ALL
    SELECT 12345, 'L', '20121109', '20130101' UNION ALL
    SELECT 12345, 'L', '20130101', '20130103' UNION ALL
    SELECT 12345, 'A', '20130103', '20130203' UNION ALL
    SELECT 12345, 'R', '20130203', '20141210'
    ;

    This is a table that stores employee status data. A = Active, L = Leave, etc. What I need to do is consolidate these contiguous blocks into single rows. So, expected output would look like this:

    EIDStatStartDateEndDate
    12345A2011121520121109
    12345L2012110920130103
    12345A2013010320130203
    12345R2013020320141210

    The issue that's throwing me is that the person goes from A to L and back to A. So if I try to partition by EID, STAT in any type of windowing function, I get all screwd up. What is the sexy solution to this problem? Dazzle me please, SQL wizards!


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • It requires two different ROW_NUMBER()s.

    ;
    WITH StatusGroups AS
    (
        SELECT *
        ,    Grp = ROW_NUMBER() OVER(PARTITION BY EID ORDER BY StartDate, Stat) - ROW_NUMBER() OVER(PARTITION BY EID, Stat ORDER BY StartDate)
        FROM #EMP_TEST
    )
        SELECT EID, Stat, MIN(StartDate) AS StartDate, MAX(EndDate) AS EndDate
        FROM StatusGroups
        GROUP BY EID, Stat, Grp
        ORDER BY EID, StartDate
    ;

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • DREW = TOO SEXY


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • autoexcrement - Friday, December 15, 2017 12:46 PM

    DREW = TOO SEXY

    what results would you expect for this set of data?

    INSERT INTO #EMP_TEST
    (
      EID,
      Stat,
      StartDate,
      EndDate
    )
    SELECT 12345, 'A', '20111215', '20120101' UNION ALL
    SELECT 12345, 'A', '20120101', '20120118' UNION ALL
    SELECT 12345, 'A', '20120118', '20120201' UNION ALL
    SELECT 12345, 'A', '20120201', '20120321' UNION ALL
    SELECT 12345, 'A', '20120321', '20120401' UNION ALL
    SELECT 12345, 'A', '20120401', '20121109' UNION ALL
    SELECT 12345, 'L', '20121109', '20130101' UNION ALL
    SELECT 12345, 'L', '20130101', '20130103' UNION ALL
    SELECT 12345, 'A', '20130103', '20130203' UNION ALL
    SELECT 12345, 'T', '20120203', '20120208'

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • take a look at this

  • J Livingston SQL - Friday, December 15, 2017 1:01 PM

    autoexcrement - Friday, December 15, 2017 12:46 PM

    DREW = TOO SEXY

    what results would you expect for this set of data?

    INSERT INTO #EMP_TEST
    (
      EID,
      Stat,
      StartDate,
      EndDate
    )
    SELECT 12345, 'A', '20111215', '20120101' UNION ALL
    SELECT 12345, 'A', '20120101', '20120118' UNION ALL
    SELECT 12345, 'A', '20120118', '20120201' UNION ALL
    SELECT 12345, 'A', '20120201', '20120321' UNION ALL
    SELECT 12345, 'A', '20120321', '20120401' UNION ALL
    SELECT 12345, 'A', '20120401', '20121109' UNION ALL
    SELECT 12345, 'L', '20121109', '20130101' UNION ALL
    SELECT 12345, 'L', '20130101', '20130103' UNION ALL
    SELECT 12345, 'A', '20130103', '20130203' UNION ALL
    SELECT 12345, 'T', '20120203', '20120208'

    My code was written with the assumption that the ranges do not overlap, since an overlap would indicate contradictory statuses.  That is, an employee cannot both be active and terminated at the same time.  The fact that this data contains such an overlap indicates that there is a data quality issue, not an issue with the code.

    EDIT: I should also not that my code assumes that the statuses are contiguous.  If there are gaps between statuses, then my code will not produce the correct results.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen - Friday, December 15, 2017 1:18 PM

    J Livingston SQL - Friday, December 15, 2017 1:01 PM

    autoexcrement - Friday, December 15, 2017 12:46 PM

    DREW = TOO SEXY

    what results would you expect for this set of data?

    INSERT INTO #EMP_TEST
    (
      EID,
      Stat,
      StartDate,
      EndDate
    )
    SELECT 12345, 'A', '20111215', '20120101' UNION ALL
    SELECT 12345, 'A', '20120101', '20120118' UNION ALL
    SELECT 12345, 'A', '20120118', '20120201' UNION ALL
    SELECT 12345, 'A', '20120201', '20120321' UNION ALL
    SELECT 12345, 'A', '20120321', '20120401' UNION ALL
    SELECT 12345, 'A', '20120401', '20121109' UNION ALL
    SELECT 12345, 'L', '20121109', '20130101' UNION ALL
    SELECT 12345, 'L', '20130101', '20130103' UNION ALL
    SELECT 12345, 'A', '20130103', '20130203' UNION ALL
    SELECT 12345, 'T', '20120203', '20120208'

    My code was written with the assumption that the ranges do not overlap, since an overlap would indicate contradictory statuses.  That is, an employee cannot both be active and terminated at the same time.  The fact that this data contains such an overlap indicates that there is a data quality issue, not an issue with the code.

    Drew

    absolutely agree Drew and  it certainly was not my intention to question your code...I just wanted to ensure that the OP understood that overlaps (for what ever reason) will cause problems.

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Yes, for the sake of this thread, it's safe to assume no overlaps. But good question. Thank you guys!


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • Joe Torre - Friday, December 15, 2017 1:07 PM

    take a look at this

    This article depends on data where the records are for one unit (day), but the data above uses ranges instead of a record for every unit.  You could convert the data above to one record per unit (day), but it would probably be less efficient than the code I supplied above, although it would account for gaps where my code assumes contiguous statuses.  However, if there are not contiguous statuses there is another method that will produce contiguous statuses that would probably still be more efficient.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA


  • WITH
       emp
       (
        EID,
        Stat,
        StartDate,
        EndDate
       )
    AS
       (SELECT 12345, 'A', '20111215', '20120101' UNION ALL
        SELECT 12345, 'A', '20120101', '20120118' UNION ALL
        SELECT 12345, 'A', '20120118', '20120201' UNION ALL
        SELECT 12345, 'A', '20120201', '20120321' UNION ALL
        SELECT 12345, 'A', '20120321', '20120401' UNION ALL
        SELECT 12345, 'A', '20120401', '20121109' UNION ALL
        SELECT 12345, 'L', '20121109', '20130101' UNION ALL
        SELECT 12345, 'L', '20130101', '20130103' UNION ALL
        SELECT 12345, 'A', '20130103', '20130203' UNION ALL
        SELECT 12345, 'R', '20130203', '20141210'
       ),
    workCte
       ( GapStart
       , GapEnd
       , GapDays
       )
    AS
       (SELECT Lag(emp.StartDate) OVER (PARTITION BY emp.EID ORDER BY emp.EID, emp.StartDate)
          , emp.StartDate
          , DateDiff(dd, Lag(emp.StartDate) OVER (PARTITION BY emp.EID ORDER BY emp.EID, emp.StartDate), emp.StartDate)-1
        FROM emp
       ) ,
       Cal
          (
           dd
          ) AS
          (
           SELECT DateAdd(dd, n.number, (SELECT Min(emp.StartDate) FROM emp))
           FROM master.dbo.spt_values n
           WHERE n.type='p'
           AND n.number<DateDiff(dd,(SELECT Max(emp.EndDate) FROM emp),(SELECT Min(emp.StartDate) FROM emp))
          )   
    SELECT Coalesce(workCte.GapStart, 0)  GapStart
         , Coalesce(workCte.GapEnd, 0)    GapEnd
         , Coalesce(workCte.GapDays, 0)   GapDays
    FROM workCte
    ORDER BY workCte.GapStart;

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply