Listing gaps in numeric sequences

  • I have a database with products and their (unique) serial numbers. I would like a list of the serial numbers not present, or the serial numbers that border the gaps.

    For instance, consider a list of:

    1, 2, 3, 5, 6, 7, 8, 9

    What ways could I discover "4" or "3,5"?

    I'm thinking of creating a table with the complete numerical sequence and performing an OUTER JOIN (which has it's own side benefits), but there has to be a more elegant solution.

    Ideas?

  • There is... assuming that your serial numbers are actually integers...

    SELECT GapStart = (SELECT ISNULL(MAX(b.SerialNumber),0)+1

    FROM #yourtable b

    WHERE b.SerialNumber< a.SerialNumber),

    GapEnd = SerialNumber- 1

    FROM #yourtable a

    WHERE a.SerialNumber- 1 NOT IN (SELECT SerialNumberFROM #yourtable)

    AND a.SerialNumber- 1 > 0

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here's another way

    WITH CTE AS (

    SELECT SerialNumber,

    ROW_NUMBER() OVER(ORDER BY SerialNumber) AS rn

    FROM #yourtable)

    SELECT s.SerialNumber+1 AS GapStart,

    e.SerialNumber-1 AS GapEnd

    FROM CTE s

    INNER JOIN CTE e ON s.rn+1=e.rn

    AND s.SerialNumber+1<e.SerialNumber

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Nice.... just be a bit careful... haven't tested it but it looks like if the serial numbers are large, that'll take a while to resolve.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Why go for the hard way, and what's so non-elegant about an outer join..?

    To find the actual gaps - ie the numbers that's really missing, an outer join and a numbers table will do very nicely...

    selectn.number as 'missingProdNumbers'

    frommyNumbersTable n

    left join myProductsTable p

    onn.number = p.productNumber

    wherep.productNumber is null

    Very straightforward, and very simple. 🙂

    /Kenneth

  • Absolutely agree... if the numbers table is large enough... if it's not, then you can do something like the following...

    ... Also, in 2k5, you really don't need a recursive CTE to create a large list of numbers... the following will create a list of numbers with a million rows in about 844 milliseconds... and, it allows you some range control over what the numbers will be...

    SET Statistics TIME ON

    DECLARE @Bitbucket INT --Just for testing Tally creation speed... remove this for prod

    --===== Declare some local variables that could be parameters in a proc

    DECLARE @StartNumber INT

    DECLARE @EndNumber INT

    --===== Set those "parameters" for demonstration purposes

    SET @StartNumber = 1000000 --Inclusive

    SET @EndNumber = 2000000 --Inclusive

    ; WITH cTally AS

    (-----------------------------------------------------------------------------

    --==== High performance CTE equivalent of a Tally or Numbers table

    SELECT TOP (@EndNumber-@StartNumber+1)

    ROW_NUMBER() OVER (ORDER BY t1.Object_ID) +@StartNumber -1 AS N

    FROM Master.sys.ALL_Columns t1

    CROSS JOIN Master.sys.ALL_Columns t2

    )-----------------------------------------------------------------------------

    SELECT @Bitbucket = N FROM cTally --Do your outer join with table being checked here

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • And, I haven't done the comparison lately, but I believe the first method I showed will be the pants off most outer join methods...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ahem...

    With a large enough set of numbers - wouldn't you want to be able to leverage an index on your tally table? Although I conceptually like the CTE - it essentially returns an unindexed heap of a temp table. Again - great for smallish stuff, but surely it doesn't match performance of a "real" tally table?

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Sounds like we might need a test to find out....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • I was thinking the same thing... 😉

    Also, Matt... your the Ninja on Regex... is there anyway to use Regex for such a thing as this import?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I don't think it would be efficient. It would be something like a recursive back-reference, and then you'd be doing math on string values, etc... A tally table is both much simpler and (gut feeling) will steamroll right over it.

    I will put some thought into it though.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • create table dbo.SerialNumbers (ID int primary key)

    go

    insert into dbo.serialnumbers

    select number

    from common.dbo.numbers

    where number between 1 and 100

    go

    delete from dbo.serialnumbers

    where id in (55, 56, 71, 80, 99)

    Opened a separate connection.

    set statistics time on

    set statistics io on

    SELECT GapStart =

    (SELECT ISNULL(MAX(b.ID),0)+1

    FROM dbo.serialnumbers b

    WHERE b.ID< a.ID),

    GapEnd = ID- 1

    FROM dbo.serialnumbers a

    WHERE a.ID - 1 NOT IN

    (SELECT ID

    FROM dbo.serialnumbers)

    AND a.ID - 1 > 0

    Estimated Cost of the final query: 0.0163148

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    (4 row(s) affected)

    Table 'SerialNumbers'. Scan count 6, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    Next query (separate connection):

    select number

    from common.dbo.numbers

    left outer join dbo.serialnumbers

    on number = id

    where number between 1 and 100

    and id is null

    Estimated Cost: 0.0128453

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    (5 row(s) affected)

    Table 'SerialNumbers'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Numbers'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    Conclusion:

    The join to a Numbers table performs better, because of less reads. But the margin is tiny on such a small number set.

    Increased the table to 10,000 rows, deleted another five semi-random records.

    Cost on first query rises to: 0.108881; run time increases to 11 ms, Scan count 10, logical reads 54

    Cost on numbers table query rises to: 0.012861; run time still 1 ms, Scan count 1, logical reads 2

    On more complex tables, with more rows, the costs would be affected, but the numbers table join would still be less expensive.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Decided to also try:

    select number

    from common.dbo.numbers

    left outer join dbo.serialnumbers

    on number = id

    where number >=

    (select min(id)

    from dbo.serialnumbers)

    and number <=

    (select max(id)

    from dbo.serialnumbers)

    and id is null

    Because that way I'm not assuming I already know the range of numbers to test.

    Still using 10,000 as top ID in dbo.SerialNumbers, still missing 10 semi-random rows.

    Cost increases to: 0.097948

    SQL Server parse and compile time:

    CPU time = 11 ms, elapsed time = 11 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    (10 row(s) affected)

    Table 'SerialNumbers'. Scan count 3, logical reads 23, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Numbers'. Scan count 1, logical reads 19, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 16 ms, elapsed time = 9 ms.

    Still lower cost than the more complex query, but significantly higher.

    select number

    from common.dbo.numbers

    left outer join dbo.serialnumbers

    on number = id

    where number >= 1

    and number <=

    (select max(id)

    from dbo.serialnumbers)

    and id is null

    Cost: 0.0868469 (lower)

    SQL Server parse and compile time:

    CPU time = 1 ms, elapsed time = 1 ms.

    (10 row(s) affected)

    Table 'SerialNumbers'. Scan count 2, logical reads 21, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Numbers'. Scan count 1, logical reads 19, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 8 ms.

    Since the first query assumed that the lowest number was 1, this version does too (more fair). Cost goes down, as well as reads and execution time. Still lower cost than the more complex query.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • The numbers in the cost isn't the most significant thing to compare with in this case.

    What you want to look at is the scan count, where scan count = 1 is the most optimal you can get.

    /Kenneth

  • I gotta agree with Kenneth... and, I'll throw in that CPU time matters as well. I thought the example code I posted that didn't use a Tally table would really do the trick... it does if you wanna drive the disk nuts 😛

    Instead of messing around with a 100 or even 10,000 rows, I thought I'd put the two methods to a real life test with 6 Million rows. Takes a couple of minutes to setup... here's the setup code for both the Monster Tally table and the Monster Serial Number table...


    [font="Courier New"]--drop&nbsptable&nbsptally

    --=====&nbspSetup&nbspfor&nbspspeed&nbspand&nbspto&nbspprevent&nbspblocking

    &nbsp&nbsp&nbsp&nbspSET&nbspNOCOUNT&nbspON

    &nbsp&nbsp&nbsp&nbspSET&nbspTRANSACTION&nbspISOLATION&nbspLEVEL&nbspREAD&nbspUNCOMMITTED

    --=====&nbspCreate&nbspa&nbsp6&nbspmillion&nbsprow&nbspTally&nbsptable...

    --=====&nbspCreate&nbspand&nbsppopulate&nbspthe&nbspTally&nbsptable&nbspon&nbspthe&nbspfly

    &nbspSELECT&nbspTOP&nbsp6000000

    &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspIDENTITY(INT,1,1)&nbspAS&nbspN

    &nbsp&nbsp&nbspINTO&nbspdbo.Tally

    &nbsp&nbsp&nbspFROM&nbspMaster.dbo.SysColumns&nbspsc1,

    &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspMaster.dbo.SysColumns&nbspsc2

    --=====&nbspAdd&nbspa&nbspPrimary&nbspKey&nbspto&nbspmaximize&nbspperformance

    &nbsp&nbspALTER&nbspTABLE&nbspdbo.Tally

    &nbsp&nbsp&nbsp&nbspADD&nbspCONSTRAINT&nbspPK_Tally_N&nbsp

    &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspPRIMARY&nbspKEY&nbspCLUSTERED&nbsp(N)&nbspWITH&nbspFILLFACTOR&nbsp=&nbsp100

    --=============================================================================

    --&nbsp&nbsp&nbsp&nbsp&nbsp&nbspCreate&nbspan&nbspexperimental&nbsptable&nbspto&nbspsimulate&nbspthe&nbsptable&nbspbeing&nbspexamined

    --&nbsp&nbsp&nbsp&nbsp&nbsp&nbspAgain...&nbsp6&nbspmillion&nbsprows...

    --=============================================================================

    --=====&nbspCreate&nbspthe&nbspexperimental&nbsptemp&nbsptable&nbspand&nbsppopulate&nbspwith&nbspSerial&nbsp#'s&nbspon&nbspthe&nbspfly

    &nbsp&nbsp&nbsp&nbsp&nbsp--&nbspThis&nbspworks&nbspbecause&nbspSysColumns&nbspalways&nbsphas&nbspat&nbspleast&nbsp4000&nbspentries

    &nbsp&nbsp&nbsp&nbsp&nbsp--&nbspeven&nbspin&nbspa&nbspnew&nbspdatabase&nbspand&nbsp4000*4000&nbsp=&nbsp16,000,000

    &nbspSELECT&nbspTOP&nbsp6000000&nbspSerialNumber&nbsp=&nbspIDENTITY(INT,&nbsp1,&nbsp1)

    &nbsp&nbsp&nbspINTO&nbspdbo.yourtable

    &nbsp&nbsp&nbspFROM&nbspMaster.dbo.SysColumns&nbspsc1,

    &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspMaster.dbo.SysColumns&nbspsc2

    --&nbsp--=====&nbspLike&nbspany&nbspgood&nbsptable,&nbspour&nbspexperimental&nbsptable&nbspneeds&nbspa&nbspPrimary&nbspKey

    &nbsp&nbspALTER&nbspTABLE&nbspdbo.yourtable

    &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspADD&nbspPRIMARY&nbspKEY&nbspCLUSTERED&nbsp(SerialNumber)

    &nbsp&nbsp&nbsp&nbsp&nbsp--&nbspThis&nbspdeletes&nbspa&nbsp"monster"&nbsprange&nbspjust&nbspto&nbspsee&nbsphow&nbspit's&nbsphandled.

    &nbspDELETE&nbspdbo.yourtable

    &nbsp&nbspWHERE&nbspSerialNumber&nbspBETWEEN&nbsp5000000&nbspAND&nbsp5500000

    &nbsp&nbsp&nbsp&nbsp&nbsp--&nbspThis&nbspdeletes&nbspevery&nbspthird&nbsprow

    &nbspDELETE&nbspdbo.yourtable

    &nbsp&nbspWHERE&nbspSerialNumber&nbsp%3&nbsp=&nbsp0

    [/font]


    ... here's the two methods running one right after the other with IO and CPU times turned on...


    [font="Courier New"]--=============================================================================

    --&nbsp&nbsp&nbsp&nbsp&nbsp&nbspTest&nbspthe&nbsptwo&nbspmethods&nbspwith&nbspIO&nbspand&nbspCPU&nbspmeasurements&nbspturned&nbspon

    --=============================================================================

    SET&nbspSTATISTICS&nbspIO&nbspON

    SET&nbspSTATISTICS&nbspTIME&nbspON

    --=====&nbspCalculated&nbspGaps

    &nbspSELECT&nbspGapStart&nbsp=&nbsp(SELECT&nbspISNULL(MAX(b.SerialNumber),0)+1

    &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspFROM&nbspdbo.yourtable&nbspb&nbsp

    &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspWHERE&nbspb.SerialNumber&nbsp<&nbspa.SerialNumber),

    &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspGapEnd&nbsp=&nbspSerialNumber&nbsp-&nbsp1&nbsp&nbsp

    &nbsp&nbsp&nbspFROM&nbspdbo.yourtable&nbspa

    &nbsp&nbspWHERE&nbspa.SerialNumber&nbsp-&nbsp1&nbspNOT&nbspIN&nbsp(SELECT&nbspSerialNumber&nbspFROM&nbspdbo.yourtable)

    &nbsp&nbsp&nbsp&nbspAND&nbspa.SerialNumber&nbsp-&nbsp1&nbsp>&nbsp0

    PRINT&nbspREPLICATE('=',100)

    --=====&nbspTally&nbsptable&nbspgaps

    &nbspSELECT&nbspt.N

    &nbsp&nbsp&nbspFROM&nbspdbo.Tally&nbspt&nbspWITH&nbsp(NOLOCK)

    &nbsp&nbsp&nbspLEFT&nbspOUTER&nbspJOIN&nbspdbo.yourtable&nbspy

    &nbsp&nbsp&nbsp&nbsp&nbspON&nbspt.N&nbsp=&nbspy.SerialNumber

    &nbsp&nbspWHERE&nbspy.SerialNumber&nbspIS&nbspNULL

    SET&nbspSTATISTICS&nbspIO&nbspOFF

    SET&nbspSTATISTICS&nbspTIME&nbspOFF

    [/font]


    ... and here's the results... Tally table wins by a mile!

    ====================================================================================================

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    Table 'yourtable'. Scan count 1833335, logical reads 5520639, physical reads 0, read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 28297 ms, elapsed time = 44224 ms.

    ====================================================================================================

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    Table 'Tally'. Scan count 1, logical reads 9649, physical reads 1, read-ahead reads 8293.

    Table 'yourtable'. Scan count 1, logical reads 8846, physical reads 0, read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 5578 ms, elapsed time = 18606 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply