enumerating gaps between islands, ideas?

  • Hi all,

    I have a system log with NULL gaps between a sequence of numbers...see "BEFORE" sample below.

    The number of gaps between the Sequence_ID's are arbitrary, but generally less then 50 records.

    I'd like enumerate the gaps to produce the "AFTER" result, but do it with a single query or view, not through procedures.

    I've been playing with windowed functions and groupings with no success. I'm guessing it'll need some recursive CTE logic, but I haven't been able to figure it out the correct loop.

    Anyone have any ideas?

    Thanks!

    BEFORE:

    PK_IDSequence_ID

    1035586935587

    3035586234 NULL

    8355585 NULL

    1235584 NULL

    4675583 35583

    4035582 NULL

    6035382 NULL

    1435581 NULL

    2035580 NULL

    3435553 35563

    6603589 NULL

    9475559 35552

    AFTER:

    PK_IDSequence_ID

    1035586935587

    3035586234 3

    8355585 2

    1235584 1

    4675583 35583

    4035582 4

    6035382 3

    1435581 2

    2035580 1

    3435553 35563

    6603589 1

    9475559 35552

  • Quick thought, no need to use recursion or loops, rather use the LAG window function. Here is a quick sample code with some test data.

    😎

    USE tempdb;

    GO

    DECLARE @TEST_DATA TABLE

    (

    TD_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED

    ,TD_NUM INT NOT NULL

    );

    DECLARE @SAMPLE_SIZE INT = 10;

    ;WITH T(N) AS (SELECT N FROM ( VALUES (NULL),(NULL),(NULL),(NULL),(NULL)

    ,(NULL),(NULL),(NULL),(NULL),(NULL)) AS X(N))

    ,NUMS(N) AS(SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM T T1, T T2, T T3, T T4, T T5, T T6, T T7)

    INSERT INTO @TEST_DATA (TD_NUM)

    SELECT

    ABS(FLOOR(CHECKSUM(NEWID()) / (RAND() * 10)))

    FROM NUMS NM;

    SELECT

    TD.TD_ID

    ,TD.TD_NUM

    ,TD.TD_NUM - LAG(TD.TD_NUM) OVER (ORDER BY TD.TD_ID) AS GAP

    FROM @TEST_DATA TD;

    Results

    TD_ID TD_NUM GAP

    ----------- ----------- ----------

    1 467441025 NULL

    2 336642282 -130798743

    3 336717627 75345

    4 355014646 18297019

    5 52905649 -302108997

    6 303559379 250653730

    7 316539004 12979625

    8 181767044 -134771960

    9 521113123 339346079

    10 392579077 -128534046

  • Thanks, I'll take a look.

    Realized my sample data was bad. The PK will be in increasing order, which might help, but it'll have same gaps, like the Sequence_ID.

    Here's better sample code:

    DECLARE @system_log TABLE(

    PK_ID int PRIMARY KEY

    ,Sequence_ID int null

    )

    INSERT @system_log(

    PK_ID

    ,Sequence_ID

    )VALUES

    (1035590, 35587),

    (1035589, NULL),

    (1035586, NULL),

    (1035585, NULL),

    (1035584, NULL),

    (1035583, 35583),

    (1035582, NULL),

    (1035581, NULL),

    (1035579, NULL),

    (1035578, 35553),

    (1035554, NULL),

    (1035550, 35550)

    SELECT

    *

    FROM @system_log

    ORDER by PK_ID desc

    Result:

    PK_IDSequence_ID

    103559035587

    1035589NULL

    1035586NULL

    1035585NULL

    1035584NULL

    103558335583

    1035582NULL

    1035581NULL

    1035579NULL

    103557835553

    1035554NULL

    103555035550

    Expected Result:

    PK_IDSequence_ID

    103559035587

    10355894

    10355863

    10355852

    10355841

    103558335583

    10355823

    10355812

    10355791

    103557835553

    10355541

    103555035550

  • Here is another code example using your data sample

    😎

    USE tempdb;

    GO

    ;WITH system_log(PK_ID,Sequence_ID) AS

    ( SELECT PK_ID,Sequence_ID FROM

    (VALUES

    (1035590, 35587),

    (1035589, NULL),

    (1035586, NULL),

    (1035585, NULL),

    (1035584, NULL),

    (1035583, 35583),

    (1035582, NULL),

    (1035581, NULL),

    (1035579, NULL),

    (1035578, 35553),

    (1035554, NULL),

    (1035550, 35550)) AS X(PK_ID,Sequence_ID)

    )

    SELECT

    SL.PK_ID

    ,LEAD(SL.PK_ID,1,SL.PK_ID) OVER

    (

    ORDER BY SL.PK_ID

    ) AS LEAD_PK_ID

    ,LAG(SL.PK_ID,1,SL.PK_ID) OVER

    (

    ORDER BY SL.PK_ID

    ) AS LAG_PK_ID

    ,ISNULL(SL.Sequence_ID,LEAD(SL.PK_ID,1,SL.PK_ID) OVER

    (

    ORDER BY SL.PK_ID

    ) - SL.PK_ID) AS FORWARD_GAP

    ,ISNULL(SL.Sequence_ID,SL.PK_ID - LAG(SL.PK_ID,1,SL.PK_ID) OVER

    (

    ORDER BY SL.PK_ID

    ) ) AS REVERSE_GAP

    FROM system_log SL;

    Results

    PK_ID LEAD_PK_ID LAG_PK_ID FORWARD_GAP REVERSE_GAP

    ----------- ----------- ----------- ----------- -----------

    1035550 1035554 1035550 35550 35550

    1035554 1035578 1035550 24 4

    1035578 1035579 1035554 35553 35553

    1035579 1035581 1035578 2 1

    1035581 1035582 1035579 1 2

    1035582 1035583 1035581 1 1

    1035583 1035584 1035582 35583 35583

    1035584 1035585 1035583 1 1

    1035585 1035586 1035584 1 1

    1035586 1035589 1035585 3 1

    1035589 1035590 1035586 1 3

    1035590 1035590 1035589 35587 35587

  • If you Binoogle Itzik Ben-Gan gaps and islands TSQL all of the top hits are relevant to your need.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thanks for the tip. I've actually recently purchased "Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions" written by Itzik Ben-Gan.

    I think the main problem is the data set doesn't have a predictable Window to partition against, so I can't use Window Functions w/o have a way to traverse the data.

    I'm working on a query using recursive CTE and LEAD/LAG. Looking at using a CROSS APPLY or FOR XML subqueries to generate the "Window"

    I might break down and write a procedural solution. Still really want this as a single query.

  • Here is a potential solution. It works with the sample data provided.

    DECLARE @system_log TABLE(

    PK_ID int PRIMARY KEY

    ,Sequence_ID int null

    )

    INSERT @system_log(

    PK_ID

    ,Sequence_ID

    )VALUES

    (1035590, 35587),

    (1035589, NULL),

    (1035586, NULL),

    (1035585, NULL),

    (1035584, NULL),

    (1035583, 35583),

    (1035582, NULL),

    (1035581, NULL),

    (1035579, NULL),

    (1035578, 35553),

    (1035554, NULL),

    (1035550, 35550);

    WITH BaseData as (

    SELECT

    PK_ID,

    Sequence_ID,

    rn3 = row_number() over (order by PK_ID) - row_number() over (partition by case when Sequence_ID is null then 0 else 1 end order by PK_ID)

    FROM

    @system_log

    ), InertimData as (

    select

    bd1.PK_ID,

    bd1.Sequence_ID,

    cn = row_number() over (partition by case when bd1.Sequence_ID is null then 0 else 1 end, bd1.rn3 order by bd1.PK_ID)

    from

    BaseData bd1

    )

    select

    PK_ID,

    coalesce(Sequence_ID, cn) Sequence_ID

    from

    InertimData

    order by

    PK_ID desc;

  • Not exactly a traditional gaps and islands problem but fun nevertheless. Here's a relatively simple alternative.

    DECLARE @system_log TABLE(

    PK_ID int PRIMARY KEY

    ,Sequence_ID int null

    )

    INSERT @system_log(

    PK_ID

    ,Sequence_ID

    )VALUES

    (1035590, 35587),

    (1035589, NULL),

    (1035586, NULL),

    (1035585, NULL),

    (1035584, NULL),

    (1035583, 35583),

    (1035582, NULL),

    (1035581, NULL),

    (1035579, NULL),

    (1035578, 35553),

    (1035554, NULL),

    (1035552, 35551),

    (1035550, 35550)

    SELECT PK_ID, Sequence_ID=CASE WHEN Sequence_ID IS NOT NULL THEN Sequence_ID ELSE

    ROW_NUMBER() OVER (PARTITION BY rn2 ORDER BY PK_ID)-1 END

    FROM

    (

    SELECT PK_ID, Sequence_ID

    ,rn2=MAX(Sequence_ID) OVER (ORDER BY PK_ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

    FROM @system_log

    ) a

    ORDER by PK_ID desc;


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Awesome, thanks dwain! Very elegant solution. I should be able to transfer this logic to the actual data quite easily.

  • Thanks Lynn as well. Love having alternatives in the toolbelt.

  • dwain.c (9/1/2014)


    Not exactly a traditional gaps and islands problem but fun nevertheless. Here's a relatively simple alternative.

    DECLARE @system_log TABLE(

    PK_ID int PRIMARY KEY

    ,Sequence_ID int null

    )

    INSERT @system_log(

    PK_ID

    ,Sequence_ID

    )VALUES

    (1035590, 35587),

    (1035589, NULL),

    (1035586, NULL),

    (1035585, NULL),

    (1035584, NULL),

    (1035583, 35583),

    (1035582, NULL),

    (1035581, NULL),

    (1035579, NULL),

    (1035578, 35553),

    (1035554, NULL),

    (1035552, 35551),

    (1035550, 35550)

    SELECT PK_ID, Sequence_ID=CASE WHEN Sequence_ID IS NOT NULL THEN Sequence_ID ELSE

    ROW_NUMBER() OVER (PARTITION BY rn2 ORDER BY PK_ID)-1 END

    FROM

    (

    SELECT PK_ID, Sequence_ID

    ,rn2=MAX(Sequence_ID) OVER (ORDER BY PK_ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

    FROM @system_log

    ) a

    ORDER by PK_ID desc;

    Dwain, I like yours as it appears to be more efficient than mine. I rewrote it slightly to make it cleaner in my opinion.

    DECLARE @system_log TABLE(

    PK_ID int PRIMARY KEY

    ,Sequence_ID int null

    )

    INSERT @system_log(

    PK_ID

    ,Sequence_ID

    )VALUES

    (1035590, 35587),

    (1035589, NULL),

    (1035586, NULL),

    (1035585, NULL),

    (1035584, NULL),

    (1035583, 35583),

    (1035582, NULL),

    (1035581, NULL),

    (1035579, NULL),

    (1035578, 35553),

    (1035554, NULL),

    (1035550, 35550);

    WITH BaseData as (

    SELECT

    PK_ID,

    Sequence_ID,

    rn2 = MAX(Sequence_ID) OVER (ORDER BY PK_ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

    FROM

    @system_log

    )

    SELECT

    PK_ID,

    Sequence_ID = CASE WHEN Sequence_ID IS NOT NULL

    THEN Sequence_ID

    ELSE ROW_NUMBER() OVER (PARTITION BY rn2 ORDER BY PK_ID) - 1

    END

    FROM

    BaseData

    ORDER BY

    PK_ID DESC;

  • Lynn Pettis (9/1/2014)


    dwain.c (9/1/2014)


    Not exactly a traditional gaps and islands problem but fun nevertheless. Here's a relatively simple alternative.

    DECLARE @system_log TABLE(

    PK_ID int PRIMARY KEY

    ,Sequence_ID int null

    )

    INSERT @system_log(

    PK_ID

    ,Sequence_ID

    )VALUES

    (1035590, 35587),

    (1035589, NULL),

    (1035586, NULL),

    (1035585, NULL),

    (1035584, NULL),

    (1035583, 35583),

    (1035582, NULL),

    (1035581, NULL),

    (1035579, NULL),

    (1035578, 35553),

    (1035554, NULL),

    (1035552, 35551),

    (1035550, 35550)

    SELECT PK_ID, Sequence_ID=CASE WHEN Sequence_ID IS NOT NULL THEN Sequence_ID ELSE

    ROW_NUMBER() OVER (PARTITION BY rn2 ORDER BY PK_ID)-1 END

    FROM

    (

    SELECT PK_ID, Sequence_ID

    ,rn2=MAX(Sequence_ID) OVER (ORDER BY PK_ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

    FROM @system_log

    ) a

    ORDER by PK_ID desc;

    Dwain, I like yours as it appears to be more efficient than mine. I rewrote it slightly to make it cleaner in my opinion.

    DECLARE @system_log TABLE(

    PK_ID int PRIMARY KEY

    ,Sequence_ID int null

    )

    INSERT @system_log(

    PK_ID

    ,Sequence_ID

    )VALUES

    (1035590, 35587),

    (1035589, NULL),

    (1035586, NULL),

    (1035585, NULL),

    (1035584, NULL),

    (1035583, 35583),

    (1035582, NULL),

    (1035581, NULL),

    (1035579, NULL),

    (1035578, 35553),

    (1035554, NULL),

    (1035550, 35550);

    WITH BaseData as (

    SELECT

    PK_ID,

    Sequence_ID,

    rn2 = MAX(Sequence_ID) OVER (ORDER BY PK_ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

    FROM

    @system_log

    )

    SELECT

    PK_ID,

    Sequence_ID = CASE WHEN Sequence_ID IS NOT NULL

    THEN Sequence_ID

    ELSE ROW_NUMBER() OVER (PARTITION BY rn2 ORDER BY PK_ID) - 1

    END

    FROM

    BaseData

    ORDER BY

    PK_ID DESC;

    Thanks Lynn. Guess I skipped the last part of Make it Work, Make it Fast, Make it Pretty[/url]

    🙂


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Dwain,

    Actually, I just wanted to understand the pieces. I really need to learn the extensions to the windowing functions in 2012.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply