Finding Gaps in a Sequential Number Sequence

  • Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/slasham/findinggapsinasequentialnumbersequence.asp

  • If you just need to identify IF a list of numbers is in sequence, but not where the gaps are, I believe you can do it in a single pass of the table.  (Warning -- to understand this algorithm, some high school math is required!)

    -- Mathematical way of identifying if a list of numbers is sequential

    -- Based on formula:

    -- M + (M+1) + (M+2) + ... + (M+n-1) = n*M + n*(n-1)/2

    -- Calling the left side S (for Sum), and rearranging yields:

    -- n2 + (2*M - 1)*n - 2*S = 0

    -- Then the quadratic formula produces:

    -- n = ( (1-2*M) +- sqrt( (2*M -1)2 + 8*S) )/2

    -- I believe it can be shown that a list of numbers is a sequence if

    -- and only if the value in the sqrt() above is a perfect square.

    -- The code below is based on that assumption

    --

    -- Mike Arney, 4/3/2006

    set

    nocount on

    create

    table Numbers (id int not null)

    insert

    Numbers values (6)

    insert

    Numbers values (7)

    insert

    Numbers values (8)

    insert

    Numbers values (9)

    insert

    Numbers values (10)

    insert

    Numbers values (11)

    insert

    Numbers values (12)

    -- insert Numbers values (14) -- uncomment for testing

    declare

    @Sum bigint, @Count bigint, @min-2 bigint

    select

    @Sum = sum(convert(bigint, id)), @Count = count(*), @min-2 = min(id) from Numbers

    declare

    @sqrt float

    select

    @sqrt = sqrt( power((2*@min-2 - 1), 2) + 8*@Sum)

    if

    @sqrt = floor(@sqrt) select 'Sequence' else select 'Not Sequence'

    go

    drop

    table Numbers

    go

     

  • A self join solution:

    select a.SeqNumber, max(b.SeqNumber)

    from #SequenceTable a join #SequenceTable b on a.SeqNumber > b.SeqNumber

    group by a.SeqNumber

    having a.SeqNumber - max(b.SeqNumber) > 1

  • This way is a few hundred times faster (on large datasets)...NOTE:the details are missing but it still identifies the key holes in the sequence, except beiging and end which are obvious...?

    SELECT s1.SeqNumber

    FROM #SequenceTable s1

    LEFT JOIN #SequenceTable s2

    ON s1.SeqNumber = s2.SeqNumber -1

    WHERE s2.SeqNumber IS NULL

     

     

     

  • I agree, the join on "where Seq2.SeqNumber < Seq1.SeqNumber" will be a killer with large tables.

    This version uses a simple a = b-1 join to find the gaps, then a subquery to find the last value before the gap.  On 1 million rows it ran in 1/4 the elapsed time and 1/13 the CPU time of the original in the article.

    select

    LastSeqNumber, NextSeqNumber

        , FirstAvailable = LastSeqNumber + 1

        , LastAvailable = NextSeqNumber - 1

        , NumbersAvailable = NextSeqNumber - (LastSeqNumber + 1)

    from (

        select (SELECT TOP 1 SeqNumber FROM #SequenceTable WHERE SeqNumber < a.SeqNumber ORDER BY SeqNumber DESC) as LastSeqNumber,

        a.SeqNumber as NextSeqNumber

        from #SequenceTable a

        left join #SequenceTable b on a.SeqNumber = b.SeqNumber + 1

        where b.SeqNumber IS NULL

    ) a

    order by LastSeqNumber

     

  • Nice solution!

    I would like to contribute too. If we only want to find IF there is a gap in the sequence we only need to verify that COUNT(*) + MIN(ID) = MAX(ID), right?

    In that case, if we have an index on the ID columns we only need a quick scan on it with very little additional CPU+Memory usage with:

    create table Numbers (id int not null)

    insert Numbers values (6)

    insert Numbers values (7)

    insert Numbers values (8)

    -- insert Numbers values (14) -- uncomment for testing

    -- DELETE Numbers WHERE id = 14

    SELECT CASE WHEN COUNT(*) + MIN(ID) - 1 = MAX(ID) THEN 'Sequence' ELSE 'Not Sequence' END FROM NUMBERS

    go

    drop table Numbers

  • Thanks all for the great examples of other ways to get the same results.  It is interesting to see these different methods and the understandings of how SQL works under the hood so to speak, with different timings on the various methods.  For me timing was not a big issue, and my 31 second run over a table of 3.5 million rows seemed fair.  Timing becomes more significant if a process is being repetatively and frequently run. 

     

    Over my dataset, Scott Coleman’s method ran in 25 seconds, being significantly slower than his own reported difference, indicating other factors come into play.

     

    Brendan Smith’s method ran in 10 seconds, and although I accept this is faster, it does not do the computes for the additional rows in my output set, so is not a reliable comparison, but looks promising.

     

    Mike Gress’s example did not complete, I bombed it off at over 14 minutes with no results (sorry Mike).

     

    I didn’t try Hanslindgren’s method as it is not my database, so I was unable to add indexes to it. 

     

    To Mike Arney, my high school maths was never that good, but I am sure there will be some that can work this one out from your example.

     

    Joe Celko’s second method also did not complete, and I bombed this off at over 14 minutes with no results, and Joe’s first method took less than a second and produced no results (sorry Joe).

     

    Looking deeper at my source dataset, the timing differences I get to your own results may be due to the extreme number of gaps.  My data set produced 1.1 million rows (of identified gaps).  Also within my data set are sequence number duplications, which may have contributed to the failure of some of the scripts.

     

    Thanks you all for your feedback and suggestions.

     

    Stephen

  • Hi Stephen!

    Thank you, it is not always you get such an exhausting feedback of how things turn out when you post replies and hope you can help.

    Hanslindgren

  • Hi,

    I'm a latecomer to the party and didn't realize there had already been healthy debate, but here's what I came up with.

    Looking through the previous submissions above, it's the same concept as Brendan's, but extended to support the full functionality of the original post.

    For the original example (working on an unoptimized temp table) the time taken seems to be approx 0.25 secs/1000 records, so with 10000 records I'm getting something like 3 seconds run time. With the code below there seems to be no increase in time taken at all (around 0.25 secs for anything from 0 to 10000 records), but this is probably because for small datasets the main delay below is not due to the work itself, but rather the creation of the temp table.

    I'd be interested to see if anyone can make something faster! (ignoring the fact that I could probably have made it a table variable...)

    Thanks,

    Tao

     

    CREATE TABLE #StartsAndEnds (EventID Int, EventType NVarChar(20), IdentityTracker INT IDENTITY (1,1))

    INSERT INTO #StartsAndEnds (EventID, EventType)

    SELECT SeqNumber, Type

    FROM (

     SELECT Seq1.SeqNumber, 'Start' AS Type

     FROM #SequenceTable Seq1

     LEFT JOIN #SequenceTable Seq2 ON Seq2.SeqNumber = Seq1.SeqNumber + 1

     WHERE Seq2.SeqNumber Is Null

     

     UNION ALL

     

     SELECT Seq1.SeqNumber, 'End' AS Type

     FROM #SequenceTable Seq1

     LEFT JOIN #SequenceTable Seq2 ON Seq2.SeqNumber = Seq1.SeqNumber - 1

     WHERE Seq2.SeqNumber Is Null

    ) AS PingPong

    ORDER BY SeqNumber

    SELECT IsNull(FL1.EventID, 0) AS StartGap,

     FL2.EventID As EndGap,

     IsNull(FL1.EventID, 0) + 1 AS FirstAvailable,

     FL2.EventID - 1 AS LastAvailable,

     FL2.EventID - IsNull(FL1.EventID, 0) - 1 As AvailableCount

    FROM #StartsAndEnds FL2

    LEFT JOIN #StartsAndEnds FL1 ON FL2.IdentityTracker = FL1.IdentityTracker + 1

    WHERE FL2.IdentityTracker % 2 = 1

    DROP TABLE #StartsAndEnds

    http://poorsql.com for T-SQL formatting: free as in speech, free as in beer, free to run in SSMS or on your version control server - free however you want it.

  • Hi Tao, did you post the correct version of your code?

    I copied your code above and did a find/replace on #SequenceTable and SeqNumber swapping these for my table name and sequence field respectively, then ran it.  It was certainly quick, very impressive, 16 seconds to produce 142874 identified breaks in the sequence from a table containing 3,827,625 rows. 

    First few rows of output are as below

    StartGapEndGapFirstAvailableLastAvailableAvailableCount
    0NULL1NULLNULL
    0NULL1NULLNULL
    0NULL1NULLNULL
    0NULL1NULLNULL
    0NULL1NULLNULL
    0NULL1NULLNULL
    582246210758225621063882
    780258008678026800852060

    My original query over the same table takes 44 seconds and produces the following first few rows.

    StartGapEndGapFirstAvailableLastAvailableAvailableCount
    02111
    206242062620625206251
    520915209352092520921
    555185552055519555191
    582225822458223582231
    6210778025621087802415917
    800868008880087800871
    801608016280161801611

    I've checked the results and my query is showing the correct gaps.  Your result line showing 58224 to 62107 represents my end gap result line 5 above to my start gap result line 6 above.

    Love to see your revision as certainly looks like it will be very fast.

    Stephen

  • Thanks for your all excellent solutions. Much appreciated!

  • Mike Arney (4/3/2006)


    If you just need to identify IF a list of numbers is in sequence, but not where the gaps are, I believe you can do it in a single pass of the table. (Warning -- to understand this algorithm, some high school math is required!)<FONT color=#008000 size=2>

    Or some grade school one:

    SELECT CASE WHEN COUNT(Id) = MAX(Id) - MIN(Id) + 1 THEN 'In sequence' ELSE 'Not sequence' END FROM yourtable

    😀

  • Here is another thing i found and modified a little.

    Takes under a second for 1 millon rows and returns gaps size sorted 😉

    Change tables and fields surrounded by []

    ----------------------------------------

    SELECT

    LastSysId + 1 AS GapFrom,

    NextSysId - 1 AS GapTo,

    NextSysId - (LastSysId + 1) AS GapSize

    FROM

    (

    SELECT

    (

    SELECT TOP 1

    [SysId]

    FROM

    [yourtable]

    WHERE

    [SysId] < a.[SysId]

    ORDER BY

    [SysId] DESC

    ) AS LastSysId,

    a.[SysId] AS NextSysId

    FROM

    [yourtable] AS a

    LEFT JOIN

    [yourtable] AS b ON a.[SysId] = b.[SysId] + 1

    WHERE

    b.[SysId] IS NULL

    ) AS a

    WHERE

    LastSysId IS NOT NULL

    ORDER BY

    3 DESC

  • Martin Grape (7/2/2011)


    Here is another thing i found and modified a little.

    Takes under a second for 1 millon rows and returns gaps size sorted 😉

    Change tables and fields surrounded by []

    ----------------------------------------

    SELECT

    LastSysId + 1 AS GapFrom,

    NextSysId - 1 AS GapTo,

    NextSysId - (LastSysId + 1) AS GapSize

    FROM

    (

    SELECT

    (

    SELECT TOP 1

    [SysId]

    FROM

    [yourtable]

    WHERE

    [SysId] < a.[SysId]

    ORDER BY

    [SysId] DESC

    ) AS LastSysId,

    a.[SysId] AS NextSysId

    FROM

    [yourtable] AS a

    LEFT JOIN

    [kicks_medlem_20110630] AS b ON a.[SysId] = b.[SysId] + 1

    WHERE

    b.[SysId] IS NULL

    ) AS a

    WHERE

    LastSysId IS NOT NULL

    ORDER BY

    3 DESC

    Nicely done and I can verify the speed. The only problem is that the code doesn't recognize missing "SysID" of 1. In fact, it doesn't recognize any gap from 1 to x-1 if all the rows between 1 to x-1 are missing. That could be OK for some folks. For me, it's usually not. It's neither wrong nor right. "It Depends" on what it's being used for.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @jeff

    Here is a little stored procedure i made out of this that will return a list of gaps including missing 1-x (or 0-x or whatever-x) or

    return the first available number i a sequence. Njoy :hehe:

    -- Drop procedure if it exists

    IF EXISTS (SELECT * FROM [sys].[objects] WHERE OBJECT_ID = OBJECT_ID(N'[dbo].[FindGap]') AND [type] IN (N'P', N'PC')) DROP PROCEDURE [dbo].[FindGap]

    GO

    -- Create procedure

    CREATE PROCEDURE [dbo].[FindGap]

    @strFindGapTable VARCHAR(254),

    @strFindGapField VARCHAR(254),

    @binFindGapLower BIGINT = 1,

    @intFindGapMode INT = 0

    AS

    -- +---------------------------------------------------------------------------+

    -- | |

    -- | Type : SQL Procedure |

    -- | Name : FindGap |

    -- | Version : 1.1 |

    -- | Creator : Martin Grape, ActionBase, +46 73 391 88 89 |

    -- | Info : Returns first available numer in a field of sequence or |

    -- | : displays a list of available gaps and number of gaps. |

    -- | Input : @strFindGapTable -> Name of table to run on |

    -- | : @strFindGapField -> Name of field |

    -- | : @strFindGapLower -> Lower value to find gap from |

    -- | : @strFindGapMode -> 0=All gaps, 1=Next available number |

    -- | Example : EXEC FindGap 'yourtable', 'sysid', 1, 1 |

    -- | History : 2011-07-03 Created v1.0 |

    -- | : 2011-07-03 Added check for all numbers in sequence v1.1 |

    -- | : |

    -- | |

    -- +---------------------------------------------------------------------------+

    -- Declare variables

    DECLARE @strFindGapSQL AS NVARCHAR(2000);

    DECLARE @strFindGapPar AS NVARCHAR(2000);

    DECLARE @binFindGapMin AS BIGINT;

    DECLARE @binFindGapMax AS BIGINT;

    DECLARE @binFindGapCount AS BIGINT;

    DECLARE @binFindGapFirst AS BIGINT;

    -- Make SQL to find min and max values

    SET @strFindGapSQL = N'SELECT @binMinVal = MIN(' + @strFindGapField + '), @binMaxVal = MAX(' + @strFindGapField + '), @binCount = COUNT(*) FROM ' + @strFindGapTable;

    SET @strFindGapPar = N'@binMinVal BIGINT OUTPUT, @binMaxVal BIGINT OUTPUT, @binCount BIGINT OUTPUT';

    -- Get values into variables

    EXEC SP_EXECUTESQL @strFindGapSQL, @strFindGapPar, @binFindGapMin OUTPUT, @binFindGapMax OUTPUT, @binFindGapCount OUTPUT;

    -- Check mode

    IF @intFindGapMode = 1

    -- Next available sequence number

    BEGIN

    -- Check if min is bigger than supplied lower, then take that value

    IF @binFindGapMin > @binFindGapLower

    BEGIN

    RETURN @binFindGapLower;

    END

    -- Check if max is smaller than supplied lower, then take that value

    ELSE IF @binFindGapMax < @binFindGapLower

    BEGIN

    RETURN @binFindGapLower;

    END

    -- Check if max is the same as supplied lower, then return + 1

    ELSE IF @binFindGapMax = @binFindGapLower

    BEGIN

    RETURN @binFindGapLower + 1;

    END

    -- Check if no existing gaps, return higest + 1

    ELSE IF @binFindGapCount = @binFindGapMax - @binFindGapMin + 1

    BEGIN

    RETURN @binFindGapMax + 1

    END

    -- Get first available number

    ELSE

    BEGIN

    -- Create SQL to get next value

    SET @strFindGapSQL = N'SELECT TOP 1 @binMinVal = LastSysId + 1 FROM (SELECT (SELECT TOP 1 ' + @strFindGapField + ' FROM ' + @strFindGapTable + ' WHERE ' + @strFindGapField + ' < a.' + @strFindGapField + ' ORDER BY ' + @strFindGapField + ' DESC) AS LastSysId, a.' + @strFindGapField + ' AS NextSysId FROM ' + @strFindGapTable + ' AS a LEFT JOIN ' + @strFindGapTable + ' AS b ON a.' + @strFindGapField + ' = b.' + @strFindGapField + ' + 1 WHERE b.' + @strFindGapField + ' IS NULL) AS a WHERE LastSysId IS NOT NULL ORDER BY 1';

    SET @strFindGapPar = N'@binMinVal BIGINT OUTPUT';

    -- Get value into variable

    EXEC SP_EXECUTESQL @strFindGapSQL, @strFindGapPar, @binFindGapFirst OUTPUT;

    -- Return value

    RETURN @binFindGapFirst;

    END

    END

    ELSE

    -- All available sequence numbers as a list

    BEGIN

    -- Check if first value is bigger than supplied lower, then add a row

    IF @binFindGapLower < @binFindGapMin

    BEGIN

    -- Add a union from supplied lower to the first actual value

    SET @strFindGapSQL = N'SELECT ' + CAST(@binFindGapLower AS NVARCHAR) + ' AS GapFrom, ' + CAST(@binFindGapMin - 1 AS NVARCHAR) + ' AS GapTo, ' + CAST(@binFindGapMin - @binFindGapLower AS NVARCHAR) + ' AS GapSize UNION ';

    END

    ELSE

    BEGIN

    -- Reset variable

    SET @strFindGapSQL = '';

    END

    -- Check if to add

    SET @strFindGapSQL = @strFindGapSQL + 'SELECT LastSysId + 1 AS GapFrom, NextSysId - 1 AS GapTo, NextSysId - (LastSysId + 1) AS GapSize FROM (SELECT (SELECT TOP 1 ' + @strFindGapField + ' FROM ' + @strFindGapTable + ' WHERE ' + @strFindGapField + ' < a.' + @strFindGapField + ' ORDER BY ' + @strFindGapField + ' DESC) AS LastSysId, a.' + @strFindGapField + ' AS NextSysId FROM ' + @strFindGapTable + ' AS a LEFT JOIN ' + @strFindGapTable + ' AS b ON a.' + @strFindGapField + ' = b.SysId + 1 WHERE b.' + @strFindGapField + ' IS NULL) AS a WHERE LastSysId IS NOT NULL ORDER BY 1';

    -- Run the select

    EXEC (@strFindGapSQL);

    -- Return also the rowcount

    RETURN @@ROWCOUNT;

    END

    -- +------------------+

    -- | END OF PROCEDURE |

    -- +------------------+

    GO

    -- +---------+

    -- | Testing |

    -- +---------+

    -- Drop test table if it existst

    IF EXISTS (SELECT * FROM [sys].[objects] WHERE OBJECT_ID = OBJECT_ID(N'[dbo].[yourtable]') AND [type] IN (N'U')) DROP TABLE [dbo].[yourtable]

    GO

    -- Add dummy data

    SELECT 2 AS SysId INTO yourtable UNION

    SELECT 6 AS SysId UNION

    SELECT 8 AS SysId UNION

    SELECT 9 AS SysId UNION

    SELECT 11 AS SysId;

    -- Example 1: Adds a row with next available number starting on 0 , displays the result table and selects next available number

    DECLARE @binFindGapReturn BIGINT;

    EXEC @binFindGapReturn = FindGap 'yourtable', 'sysid', 0, 1;

    INSERT INTO yourtable (SysId) SELECT @binFindGapReturn

    SELECT * FROM yourtable ORDER BY 1

    EXEC @binFindGapReturn = FindGap 'yourtable', 'sysid', 0, 1;

    SELECT @binFindGapReturn AS NextValue;

    GO

    -- Example 2: Returns all the gaps starting on 1 as a resultset and returns number of gaps

    DECLARE @binFindGapReturn BIGINT;

    EXEC @binFindGapReturn = FindGap 'yourtable', 'sysid', 1, 2;

    SELECT @binFindGapReturn AS NoOfGaps;

    GO

    -- Example 3: Fill missing sequences to get max + 1

    INSERT INTO yourtable SELECT 1

    INSERT INTO yourtable SELECT 3

    INSERT INTO yourtable SELECT 4

    INSERT INTO yourtable SELECT 5

    INSERT INTO yourtable SELECT 7

    INSERT INTO yourtable SELECT 10

    DECLARE @binFindGapReturn BIGINT;

    EXEC @binFindGapReturn = FindGap 'yourtable', 'sysid', 0, 1;

    SELECT @binFindGapReturn AS NextValue;

    GO

    -- Drop test table if it existst

    IF EXISTS (SELECT * FROM [sys].[objects] WHERE OBJECT_ID = OBJECT_ID(N'[dbo].[yourtable]') AND [type] IN (N'U')) DROP TABLE [dbo].[yourtable]

    GO

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply