Gaps and islands sql : Condense upstream data

  • Hi ,

    There is data from upstream of different buses usage with StartDateKey and EndDateKey . There is need to condense data as shown in scenario 1 and 2 .

    In scenario 1 , Datekey sequence is followed , where EndDatekey of current Row is previous day of StartDateKey in next row of same bus .

    In Scenario 2 , DateKey sequence is followed but there is gap of n days (n>1) (Highlighted in Colors)

    Following Query is working fine for scenario 1 .

    DROP table #temp

    CREATE table #temp (SysKey int identity(1,1) ,BusName varchar(9) ,BusType varchar(6) ,StartDateKey int ,EndDatekey int )

    Insert into #temp (BusName,BusType,StartDateKey,EndDatekey)

    Select 'A' ,'AC',20130101,20130112 union ALL

    Select 'A' ,'AC', 20130113 ,20130115 union ALL

    Select 'A' ,'AC', 20130116 ,20130118 Union ALL

    Select 'B' , 'NON-AC', 20130119 ,20130121 union ALL

    Select 'B' , 'NON-AC', 20130122 ,20130124 Union ALL

    Select 'A' ,'AC', 20130125, NULL

    ;WITH SequencedData AS (

    SELECT

    SysKey, BusName, BusType, StartDateKey, EndDatekey,

    seq = SysKey - ROW_NUMBER() OVER(PARTITION BY BusName ORDER BY StartDateKey)

    FROM #temp

    )

    SELECT

    SysKey = MAX(SysKey),

    BusName, BusType,

    StartDate = MIN(StartDateKey),

    EndDate = MAX(EndDatekey)

    FROM SequencedData

    GROUP BY BusName, BusType, seq

    ORDER BY MAX(SysKey)

    How to tweak above query to handle both scenario 1 and scenario 2

    Thanks

    Surya Sunil

  • do a web search for: itzik ben-gan gaps and islands

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • do you need SysKey?

    I'm testing out this scenario where the later date is earlier in the insert and syskey for the later date will have a lower value.

    The max(syskey) requirement is making me spin in my chair.

    Can only seem to make something close with a recursive CTE.

    CREATE table #temp2 (SysKey int identity(1,1) ,BusName varchar(9) ,BusType varchar(6) ,StartDateKey Datetime ,EndDatekey Datetime)

    SET DATEFORMAT YMD

    Insert into #temp2 (BusName,BusType,StartDateKey,EndDatekey)

    Select 'A' ,'AC','20130108','20130109' union ALL

    Select 'A' ,'AC','20130101','20130105' union ALL

    Select 'A' ,'AC','20130106','20130107' union ALL

    Select 'A' ,'AC', '20130117' ,'20130118' Union ALL

    Select 'A' ,'AC', '20130113' ,'20130115' union ALL

    Select 'B' , 'NON-AC', '20130119' ,'20130121' union ALL

    Select 'B' , 'NON-AC', '20130122' ,'20130124' Union ALL

    Select 'A' ,'AC', '20130125', NULL

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • is the sample data you have provided already been processed to create a "pivot/crosstab"?

    why are you storing dates as integer?

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • maybe.....

    EDIT ::: maybe not.......further testing required me thinks !

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Hi ,

    Thanks for looking into post .

    Dates are integer in the upstream , which I need to consolidate for Datawarehouse purpose .

  • Hi ,

    MAX(Syskey) is not required . I made it , when i was doing hit and trail .

    Thanks

    Surya Sunil

  • sunil.mvs (4/27/2016)


    Hi ,

    MAX(Syskey) is not required . I made it , when i was doing hit and trail .

    Thanks

    Surya Sunil

    Tried this, seems to work, but I made all the dates into datetime.

    If you want it to work with int, you may need to hack this to make it convert on the fly since using an int, it wont work when going from one month into another month cos it does a data add of one day and joins.

    DROP table #temp

    CREATE table #temp (SysKey int identity(1,1) ,BusName varchar(9) ,BusType varchar(6) ,StartDateKey Datetime ,EndDatekey Datetime)

    SET DATEFORMAT YMD

    Insert into #temp (BusName,BusType,StartDateKey,EndDatekey)

    Select 'A' ,'AC','20130101','20130112' union ALL

    Select 'A' ,'AC', '20130113' ,'20130115' union ALL

    Select 'A' ,'AC', '20130117' ,'20130118' Union ALL

    Select 'B' , 'NON-AC', '20130119' ,'20130121' union ALL

    Select 'B' , 'NON-AC', '20130122' ,'20130124' Union ALL

    Select 'A' ,'AC', '20130125', NULL

    ;WITH recCTE AS (

    SELECT BusName,BusType,StartDateKey,EndDatekey,

    Convert(varchar(9), '') BusName2, Convert(varchar(6), '') BusType2, CONVERT(DateTime, Null) StartDateKey2, CONVERT(DateTime, Null) EndDatekey2,

    0 as [Level], CONVERT(DateTime, StartDateKey) startd, CONVERT(DateTime, EndDatekey) endd --, CONVERT(DateTime, StartDateKey) minDate

    FROM #temp

    UNION ALL

    SELECT

    A.BusName, A.BusType,A.StartDateKey,A.EndDatekey,

    B.BusName, B.BusType, B.StartDateKey, B.EndDatekey,

    [Level] + 1, ISNULL(B.StartDateKey2,B.StartDateKey), A.EndDatekey --,CASE WHEN B.startd < minDate THEN B.startd ELSE A.StartDateKey END

    FROM #temp AS A

    INNER JOIN recCTE AS B ON A.StartDateKey = B.EndDatekey + 1 AND A.Busname = B.busName AND a.bustype = b.bustype

    )

    select BusName, BusType,startd,endd from (

    SELECT startd,endd,BusName, BusType,RANK() OVER (PARTITION BY startd, BusName, BusType ORDER BY Level DESC) rid1,

    RANK() OVER (PARTITION BY endD, BusName, BusType ORDER BY Level DESC) RID2

    FROM recCTE AA )window

    Where RID1*RID2=1

    order by Startd,Busname,BusType

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • would you ever have a situation where enddate of one row is the same as startdate of another row?

    eg

    DROP table #temp

    CREATE table #temp (SysKey int identity(1,1) ,BusName varchar(9) ,BusType varchar(6) ,StartDateKey Datetime ,EndDatekey Datetime)

    SET DATEFORMAT YMD

    Insert into #temp (BusName,BusType,StartDateKey,EndDatekey)

    Select 'A' ,'AC','20130101','20130105' union ALL

    Select 'A' ,'AC', '20130105' ,'20130110' union ALL

    Select 'A' ,'AC', '20130110' ,'20130115' union ALL

    Select 'A' ,'AC', '20130117' ,'20130118' Union ALL

    Select 'B' , 'NON-AC', '20130119' ,'20130121' union ALL

    Select 'B' , 'NON-AC', '20130122' ,'20130124' Union ALL

    Select 'A' ,'AC', '20130125', NULL

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • MadAdmin (4/28/2016)


    sunil.mvs (4/27/2016)


    Hi ,

    MAX(Syskey) is not required . I made it , when i was doing hit and trail .

    Thanks

    Surya Sunil

    Tried this, seems to work, but I made all the dates into datetime.

    If you want it to work with int, you may need to hack this to make it convert on the fly since using an int, it wont work when going from one month into another month cos it does a data add of one day and joins.

    DROP table #temp

    CREATE table #temp (SysKey int identity(1,1) ,BusName varchar(9) ,BusType varchar(6) ,StartDateKey Datetime ,EndDatekey Datetime)

    SET DATEFORMAT YMD

    Insert into #temp (BusName,BusType,StartDateKey,EndDatekey)

    Select 'A' ,'AC','20130101','20130112' union ALL

    Select 'A' ,'AC', '20130113' ,'20130115' union ALL

    Select 'A' ,'AC', '20130117' ,'20130118' Union ALL

    Select 'B' , 'NON-AC', '20130119' ,'20130121' union ALL

    Select 'B' , 'NON-AC', '20130122' ,'20130124' Union ALL

    Select 'A' ,'AC', '20130125', NULL

    ;WITH recCTE AS (

    SELECT BusName,BusType,StartDateKey,EndDatekey,

    Convert(varchar(9), '') BusName2, Convert(varchar(6), '') BusType2, CONVERT(DateTime, Null) StartDateKey2, CONVERT(DateTime, Null) EndDatekey2,

    0 as [Level], CONVERT(DateTime, StartDateKey) startd, CONVERT(DateTime, EndDatekey) endd --, CONVERT(DateTime, StartDateKey) minDate

    FROM #temp

    UNION ALL

    SELECT

    A.BusName, A.BusType,A.StartDateKey,A.EndDatekey,

    B.BusName, B.BusType, B.StartDateKey, B.EndDatekey,

    [Level] + 1, ISNULL(B.StartDateKey2,B.StartDateKey), A.EndDatekey --,CASE WHEN B.startd < minDate THEN B.startd ELSE A.StartDateKey END

    FROM #temp AS A

    INNER JOIN recCTE AS B ON A.StartDateKey = B.EndDatekey + 1 AND A.Busname = B.busName AND a.bustype = b.bustype

    )

    select BusName, BusType,startd,endd from (

    SELECT startd,endd,BusName, BusType,RANK() OVER (PARTITION BY startd, BusName, BusType ORDER BY Level DESC) rid1,

    RANK() OVER (PARTITION BY endD, BusName, BusType ORDER BY Level DESC) RID2

    FROM recCTE AA )window

    Where RID1*RID2=1

    order by Startd,Busname,BusType

    not getting the expected results for the following

    Insert into #temp (BusName,BusType,StartDateKey,EndDatekey)

    Select 'A' ,'AC', '20130401', '20130402' union ALL

    Select 'A' ,'AC', '20130403' ,'20130404' union ALL

    Select 'A' ,'AC', '20130405' ,'20130406' union ALL

    Select 'A' ,'AC', '20130407' ,'20130408' Union ALL

    Select 'A' ,'AC', '20130409' ,'20130410' Union ALL

    Select 'A' ,'AC', '20130411' ,'20130430' ;

    +-----------------------------------------------------------------------+

    ¦ BusName ¦ BusType ¦ startd ¦ endd ¦

    ¦---------+---------+-------------------------+-------------------------¦

    ¦ A ¦ AC ¦ 2013-04-01 00:00:00.000 ¦ 2013-04-06 00:00:00.000 ¦

    ¦ A ¦ AC ¦ 2013-04-03 00:00:00.000 ¦ 2013-04-08 00:00:00.000 ¦

    ¦ A ¦ AC ¦ 2013-04-05 00:00:00.000 ¦ 2013-04-10 00:00:00.000 ¦

    ¦ A ¦ AC ¦ 2013-04-07 00:00:00.000 ¦ 2013-04-30 00:00:00.000 ¦

    +-----------------------------------------------------------------------+

    --EDIT : assume it should be like this

    +-----------------------------------------------------------------------+

    ¦ BusName ¦ BusType ¦ sdate ¦ edate ¦

    ¦---------+---------+-------------------------+-------------------------¦

    ¦ A ¦ AC ¦ 2013-04-01 00:00:00.000 ¦ 2013-04-30 00:00:00.000 ¦

    +-----------------------------------------------------------------------+

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • J Livingston SQL (4/28/2016)

    not getting the expected results for the following

    [/code]

    Hi

    Yes seems to not work in this case, strangely. I am guessing cos the CTE is making multiple roots to the trees in the joins.

    Is there any other non recrusive CTE method you can think of to solve this or maybe add an easy change onto the query?

    I did this purely to avoid using a cursor .(and hopefully the mention of a cursor could make other readers look into solving this type of problem as well.)

    It is a fun problem.

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • maybe.....

    CREATE table #temp (SysKey int identity(1,1) ,BusName varchar(9) ,BusType varchar(6) ,StartDateKey Datetime ,EndDatekey Datetime)

    Insert into #temp (BusName,BusType,StartDateKey,EndDatekey)

    Select 'A' ,'AC', '20130401', '20130402' union ALL

    Select 'A' ,'AC', '20130403' ,'20130404' union ALL

    Select 'A' ,'AC', '20130405' ,'20130406' union ALL

    Select 'A' ,'AC', '20130407' ,'20130408' Union ALL

    Select 'A' ,'AC', '20130409' ,'20130410' Union ALL

    Select 'A' ,'AC', '20130411' ,'20130430' Union ALL

    Select 'B' ,'BC', '20130401', '20130401' union ALL

    Select 'B' ,'BC', '20130401' ,'20130404' union ALL

    Select 'B' ,'BC', '20130404' ,'20130406' union ALL

    Select 'B' ,'BC', '20130410' ,'20130420' ;

    WITH cte as (

    SELECT BusName,

    BusType,

    StartDateKey,

    EndDatekey,

    CASE

    WHEN DATEDIFF(day, LAG(enddatekey, 1, StartDateKey) OVER(PARTITION BY BusName, BusType ORDER BY Startdatekey ), StartDateKey) < 2

    THEN NULL

    ELSE ROW_NUMBER() OVER(PARTITION BY BusName, BusType ORDER BY Startdatekey )

    END sortkey,

    ROW_NUMBER() OVER(PARTITION BY BusName, BusType ORDER BY Startdatekey ) rn

    FROM #temp

    )

    ,

    cte2 as (

    SELECT BusName,

    BusType,

    StartDateKey,

    EndDatekey,

    CAST(

    SUBSTRING(

    MAX( CAST(rn AS BINARY(4)) + CAST(sortkey AS BINARY(4)) )

    OVER( PARTITION BY BusName, BusType ORDER BY rn ROWS UNBOUNDED PRECEDING ),

    5, 4)

    AS INT) AS lastval

    FROM cte

    )

    SELECT BusName,

    BusType,

    MIN(StartDateKey) AS sdate,

    MAX(EndDatekey) AS edate

    FROM cte2

    GROUP BY BusName, BusType, lastval

    ORDER BY BusName, BusType, sdate

    DROP TABLE #temp

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Here's another way, starting with the dates as INT as in the OP's situation.

    Sample Data:

    CREATE table #temp (SysKey int identity(1,1) ,BusName varchar(9) ,BusType varchar(6) ,StartDateKey int ,EndDatekey int)

    Insert into #temp (BusName,BusType,StartDateKey,EndDatekey)

    Select 'A' ,'AC', 20130401, 20130402 union ALL

    Select 'A' ,'AC', 20130403 ,20130404 union ALL

    Select 'A' ,'AC', 20130405, 20130406 union ALL

    Select 'A' ,'AC', 20130407 ,20130408 Union ALL

    Select 'A' ,'AC', 20130409 ,20130410 Union ALL

    Select 'A' ,'AC', 20130411 ,20130430 Union ALL

    Select 'B' ,'BC', 20130401, 20130401 union ALL

    Select 'B' ,'BC', 20130401 ,20130404 union ALL

    Select 'B' ,'BC', 20130404 ,20130406 union ALL

    Select 'B' ,'BC', 20130410 ,20130420 ;

    Solution:

    WITH convert_to_date AS

    (

    SELECT BusName,

    BusType,

    StartDateKey=CAST(CAST(StartDateKey AS CHAR(8)) AS DATE),

    EndDateKey=CAST(CAST(EndDateKey AS CHAR(8)) AS DATE)

    FROM #temp

    ),

    covered_dates AS

    (

    SELECT *

    FROM convert_to_date

    CROSS APPLY

    (

    SELECT TOP(DATEDIFF(dd,StartDateKey,EndDateKey)+1)

    covered_date=DATEADD(dd,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1,StartDateKey)

    FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n1(n) --An on-the-fly numbers table

    CROSS JOIN

    (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n2(n)

    CROSS JOIN

    (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n3(n)

    CROSS JOIN

    (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0))n4(n)

    )x

    ),

    distinct_numbered_covered_dates AS

    (

    SELECT *,RN=ROW_NUMBER() OVER (PARTITION BY BusName ORDER BY covered_Date ASC)

    FROM (

    SELECT DISTINCT BusName,BusType,covered_date

    FROM covered_dates

    )x

    )

    SELECT BusName,

    BusType,

    StartDateKey=MIN(covered_date),

    EndDateKey=MAX(covered_date)

    FROM distinct_numbered_covered_dates

    GROUP BY BusName,BusType,DATEADD(dd,-RN,covered_date);

    The approach is this:

    1) Convert integers to dates

    2) For each row in the data, use CROSS APPLY to generate a list of the dates covered by that interval

    3) Since some ranges might overlap and cover the same dates, get distinct combinations of bus and covered dates

    4) Number the covered dates in ascending order, partitioned by bus

    5) Islands of consecutive dates will have the feature that subtracting ROW_NUMBER days from the covered_date will result in the same date, so just get MIN and MAX grouped by that expression.

    Cheers!

  • Hi Jacob.....

    on a larger data set I am getting

    A TOP or FETCH clause contains an invalid value.

    any ideas please?

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • J Livingston SQL (4/29/2016)


    Hi Jacob.....

    on a larger data set I am getting

    A TOP or FETCH clause contains an invalid value.

    any ideas please?

    Ah, most likely you have a NULL for the start or end date. Just have to modify to handle NULLs appropriately.

    Cheers!

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply