Finding Sequences (Need help ASAP)

  • Please skip to http://www.sqlservercentral.com/Forums/FindPost1000643.aspx for updated requirements.

    I have a request from management and am at a complete roadblock. I am a DBA and this is much more developer/programmer logic than I am used to.

    Scenario:

    Using the below numbers I need to find each unique sequence sets. The data is a unique column in my actual db stored as a varchar (255) field. (Can't change that)

    Below is some sample data.

    create table gaps

    (id varchar(50))

    insert into gaps

    SELECT 'PPCI000000001' UNION ALL

    SELECT 'PPCI000000002' UNION ALL

    SELECT 'PPCI000000003' UNION ALL

    SELECT 'PPCI000000005' UNION ALL

    SELECT 'PPCI000000010' UNION ALL

    SELECT 'PPCI000000011' UNION ALL

    SELECT 'PPCI000000012' UNION ALL

    SELECT 'PPCI000000013' UNION ALL

    SELECT 'PPCI000000014' UNION ALL

    SELECT 'PPCI000000015' UNION ALL

    SELECT 'PPCI000000016' UNION ALL

    SELECT 'PPCI000000017' UNION ALL

    SELECT 'PPCI000000100' UNION ALL

    SELECT 'PPCI000000101' UNION ALL

    SELECT 'PPCI000000102' UNION ALL

    SELECT 'PPCI000000103' UNION ALL

    SELECT 'PPCI000000104' UNION ALL

    SELECT 'PPCI000000105' UNION ALL

    SELECT 'PPCI000000106' UNION ALL

    SELECT 'PPCI000000167' UNION ALL

    SELECT 'PPCI000000197'

    What I need to do is identify the sequences. For example an ideal output using the sample data would be

    PPCI000000001 - PPCI000000003

    PPCI000000005

    PPCI000000010 - PPCI000000017

    PPCI000000100 - PPCI000000106

    PPCI000000167

    PPCI000000197

    *It would also be nice to know the 'missing' numbers in the sequence. for example an ideal output would be

    PPCI000000004

    PPCI000000006 - PPCI000000009

    PPCI000000018 - PPCI000000099

    PPCI000000107 - PPCI0000000166

    PPCI000000168 - PPCI0000000196

    Ideally I would be able to do this with a single query.

    ***In this example I only included text with a fixed length character prefix (PPCI) our actual data may have a prefix of 1,2,3,4,5, or 6 characters. The column is a unique field so there will not be any duplicates.

    Any help is much appreciated, I dont even know where to start as this is a bit more programming logic than I am used too.

  • So far this works using the substring statement. But has alot of hard coding. Although this solves my current problem, it doesn't help with other future needs.

    select 'PPCI' + substring(l.id,5,9) as start

    ,'PPCI' +

    (

    select substring(min(a.id),5,9) as id

    from gaps as a

    left outer join gaps as b

    on substring(a.id,5,9) = substring(b.id,5,9) - 1

    where b.id is null

    and substring(a.id,5,9) >= substring(l.id,5,9)

    ) as [end]

    from gaps as l

    left outer join gaps as r

    on substring(r.id,5,9) = substring(l.id,5,9) - 1

    where r.id is null;

    But it has 1 big problem.

    1. The text can be a variable amount from 1-6+ characters. I have it hardcoded for this example, but it needs to be dynamic.

    Any way to make those sections dynamic?

  • Craig,

    Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • There are multiple articles at SSC that may be of your interest:

    - Generating Missing Dates and Numbers By Jacob Sebastian http://www.sqlservercentral.com/articles/Datetime+Manipulation/61822/

    - The "Numbers" or "Tally" Table: What it is and how it replaces a loop. By Jeff Moden http://www.sqlservercentral.com/articles/T-SQL/62867/

    Maybe a computed column may help out , if this action is more that a single shot solution.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Craig Farrell (10/6/2010)


    Craig,

    Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.

    Yep. They always end in a numeric character. Looking forward to your response. My query is terribly slow on my 260,000 row table.

  • craig-404139 (10/6/2010)


    Craig Farrell (10/6/2010)


    Craig,

    Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.

    Yep. They always end in a numeric character. Looking forward to your response. My query is terribly slow on my 260,000 row table.

    260k? Hrm, let me build a proper test harness before I point you up the wrong road. 🙂

    One last confirmation before I build said harness: We're looking at records like:

    ABC0001

    ABCDEF00002

    ABCD003

    ABCDEF000006

    AB7

    ...

    ETC

    Where there is absolutely no conformity? Are we dealing with a standard length with no conformity to the length of the front in regards to the #alpha chars?

    Sorry, just trying to nail down the exact data situation.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Craig Farrell (10/6/2010)


    craig-404139 (10/6/2010)


    Craig Farrell (10/6/2010)


    Craig,

    Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.

    Yep. They always end in a numeric character. Looking forward to your response. My query is terribly slow on my 260,000 row table.

    260k? Hrm, let me build a proper test harness before I point you up the wrong road. 🙂

    One last confirmation before I build said harness: We're looking at records like:

    ABC0001

    ABCDEF00002

    ABCD003

    ABCDEF000006

    AB7

    ...

    ETC

    Where there is absolutely no conformity? Are we dealing with a standard length with no conformity to the length of the front in regards to the #alpha chars?

    Sorry, just trying to nail down the exact data situation.

    In theory yes. These are received legal documents and are bates stamped (identifier we are referring too) using a different prefix depending on the producing party. So your above example is quit possible. The total length of the numbers can/will vary as well. In this particular case they are all 13 total characters (4 prefix text + 9 numerals) and all have a PPCI prefix. But I am trying to write this script so I can run it against all cases. Some of which have a uniform prefix and uniform character count throughout, others which have a variable text prefix and variable numerical count. One thing though, all identifiers with the same prefix will have the same total character count.

    Some tables may have 1 million+ records. I understand whatever solution we find will tax the system since it's checking each id, but it's a necessary process to identify ranges of produced documents for auditing purposes.

    I appreciate your time with this. I feel I made some good progress with my above script, but am having trouble moving away from so much hard coding. FYI it's taking almost 30 minutes to run on my system.

  • I've made some assumptions in my possible solution. If my assumptions are wrong, you may well be able to tweak the query to get it working for you.

    I have assumed:

    1) that the prefix never contains numeric characters [0-9].

    2) that the numeric part of the id column always contains exactly 9 numeric characters [0-9] and can always be converted to an int.

    3) that you want the results partitioned by the non-numeric prefix

    If there is a clustered index on the id field this query should be quite performant.

    ;WITH cteParse(id, prefix, num, rn) AS (

    SELECT

    id,

    SUBSTRING(id, 1, PATINDEX('%[0-9]%', id) - 1),

    CAST(SUBSTRING(id, PATINDEX('%[0-9]%', id), 9) AS int),

    ROW_NUMBER() OVER (ORDER BY id)

    FROM gaps

    )

    SELECT

    prefix + REPLACE(STR(MIN(num), 9), ' ', '0') AS [From],

    CASE WHEN (COUNT(*) > 1) THEN

    prefix + REPLACE(STR(MAX(num), 9), ' ', '0')

    ELSE '' END AS [To]

    FROM cteParse

    GROUP BY prefix, num - rn

    ORDER BY prefix, num - rn

  • Edit: Sorry for the multi-edit. Something didn't play nicely with the code= statements when I tried to first post this.

    The Test Harness:

    SELECT TOP 1000000

    IDENTITY(INT, 1,1) AS N

    INTOTally

    FROM

    Master.dbo.SysColumns sc1,

    Master.dbo.syscolumns sc2

    ALTER TABLE Tally

    ADD CONSTRAINT PK_Tally_N

    PRIMARY KEY CLUSTERED(N) with FILLFACTOR = 100

    GRANT SELECT, REFERENCES on TALLY TO PUBLIC

    select max(n) from Tally

    CREATE TABLE #TestHarness

    (TextIDVARCHAR(200) NOT NULL)

    --TRUNCATE TABLE #TestHarness

    INSERT INTO #TestHarness

    SELECT

    REPLICATE( 'A', FLOOR( (N/10)%5) + 1) -- VAriable length starter

    + REPLICATE( '0', N%7+1 ) -- Variable # of 0's

    + CAST( N AS VARCHAR(15))

    FROM

    Tally

    WHERE

    N%5 >1

    AND N < 500000

    The PatternFinder simplified: (I get 2 second results from this on a desktop server)

    select --top 200

    DISTINCT

    LEFT( TextID, MAX( N) ) AS Pattern

    FROM

    #TestHarness AS th,

    Tally AS T

    where

    N < LEN(TextID) AND

    isnumeric( SUBSTRING( TextID, N, 1)) = 0

    GROUP BY

    TextID

    The patternfinder used to strip to Patterns and Numerics (8 seconds with full row return):

    select

    TextID,

    LEFT( TextID, MAX( N) ) AS Pattern,

    CAST( RIGHT( TextID, LEN( TextID) - MAX( N)) AS BIGINT) AS NumPiece

    FROM

    #TestHarness AS th,

    Tally AS T

    where

    N < LEN(TextID) AND

    isnumeric( SUBSTRING( TextID, N, 1)) = 0

    GROUP BY

    TextID

    Now, back to finding the gapping, which I'm not sure of directly. Hopefully anyone else working on this will find this useful.

    Of note: The patternfinder backs up into the pattern, so it locates the last letter, and won't break on numerics in the pattern. You've also now got an integer to work from. This happens very quickly.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • ^^ Thanks. I'll start chewing through that and seeing if it fits. At first glance I have no idea what it's doing. Like I mentioned I've never had to deal with this detail of coding. Appreciate your time.

  • craig-404139 (10/6/2010)


    ^^ Thanks. I'll start chewing through that and seeing if it fits. At first glance I have no idea what it's doing. Like I mentioned I've never had to deal with this detail of coding. Appreciate your time.

    Heh, whoops, sorry, I'll explain a little more.

    The test harness does nothing more then build a million row tally table and then create a bunch of records with variable length varchar patterns at the beginning and then variable #'s of 0s before a number.

    The Patternfinder uses the same tally table to locate the largest (MAX(N)) non-numeric character position in the TextID, or your codes. This works because anything else after it is numeric.

    So, now that we know the final position of the pattern, or varchar prefix, we take that off (Pattern column), and we convert the rest to numeric (NumPiece).

    The thing that speeds this up is the tally table. It's an amazing Anti-RBAR component that sped up an old method I had. The Test Harness, btw, builds out a 300,000 row test case to work against.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • And for my last trick...

    This ran in about 3:15 on my system

    -- First, separate the numbers and their prefixes into usable data.

    select

    TextID,

    LEFT( TextID, MAX( N) ) AS Pattern,

    CAST( RIGHT( TextID, LEN( TextID) - MAX( N)) AS BIGINT) AS NumPiece

    INTO

    #FindGaps

    FROM

    #TestHarness AS th,

    Tally AS T

    where

    N < LEN(TextID) AND

    isnumeric( SUBSTRING( TextID, N, 1)) = 0

    GROUP BY

    TextID

    -- Give us a fighting chance for optimization...

    CREATE CLUSTERED INDEX idx_FindGap ON #FindGaps ( Pattern, NumPiece)

    -- Gaptable will store the list of Gaps and their start/end #s

    CREATE TABLE #GapTable

    (PatternVARCHAR(100),

    GapStartBIGINT,

    GapEndBIGINT)

    CREATE CLUSTERED INDEX idx_GapTable ON #GapTable ( Pattern, GapStart)

    -- This is an optimization table to take over work from a subquery

    CREATE TABLE #EndGap

    (Pattern VARCHAR(100),

    GapEnd BIGINT)

    CREATE CLUSTERED INDEX idx_EndGap ON #EndGap ( Pattern, GapEnd)

    -- Why the cursor? Because I couldn't figure out a way to go against each

    -- pattern without making the triangle join even worse...

    DECLARE @Pattern VARCHAR(100)

    DECLARE PLoop CURSOR

    FOR SELECT DISTINCT Pattern

    FROM #FindGaps

    OPEN PLoop

    FETCH NEXT FROM PLoop INTO @Pattern

    WHILE @@FETCH_STATUS = 0

    BEGIN

    -- Find the beginning of each gap. It needs an entry to exist before it

    -- considers a gap to have started, so if the first record is 5, it doesn't

    -- care about 1-4.

    INSERT INTO #GapTable

    (Pattern, GapStart)

    SELECT

    @Pattern,

    N AS GapStartNum

    FROM

    Tally AS t

    LEFT JOIN

    #FindGaps AS fg

    ONt.N = fg.NumPiece

    AND Pattern = @Pattern

    WHERE

    fg.NumPiece IS NULL

    AND EXISTS( SELECT 1 FROM #FindGaps WHERE Pattern = @Pattern and NumPiece = N-1)

    ORDER BY

    t.N

    -- This could be used as a subquery, but I got better performance as a separate table.

    -- It works like above, identifying the end of each gap.

    INSERT INTO #EndGap

    (Pattern, GapEnd)

    SELECT

    @Pattern AS Pattern,

    N AS GapEnd

    FROM

    Tally AS t

    LEFT JOIN

    #FindGaps AS fg

    ONt.N = fg.NumPiece

    AND Pattern = @Pattern

    WHERE

    fg.NumPiece IS NULL

    AND EXISTS( SELECT 1 FROM #FindGaps WHERE Pattern = @Pattern and NumPiece = N+1)

    FETCH NEXT FROM PLoop INTO @Pattern

    END

    -- Cleanup

    CLOSE PLoop

    DEALLOCATE PLoop

    -- Inbound triangle join, I know... I know... It's the time eater, too, just can't see a way out.

    -- The logic here will find the nearest ending gap to the start point, by pattern/prefix.

    UPDATE gt

    SETGapEnd = drvEnds2.GapEnd

    FROM

    #GapTable AS gt

    JOIN

    (SELECT

    gt.Pattern,

    gt.GapStart,

    MIN( eg.GapEnd) AS GapEnd

    FROM

    #EndGap AS eg

    JOIN

    #GapTable AS gt

    ON eg.Pattern = gt.Pattern

    WHERE

    gt.GapStart < eg.GapEnd

    GROUP BY

    gt.Pattern,

    gt.GapStart

    ) AS drvEnds2

    ONgt.Pattern = drvEnds2.Pattern

    AND gt.GapStart = drvEnds2.GapStart

    SELECT * FROM #GapTable

    DROP TABLE #FindGaps

    DROP TABLE #GapTable

    DROP TABLE #EndGap


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • craig-404139 (10/6/2010)


    Craig Farrell (10/6/2010)


    craig-404139 (10/6/2010)


    Craig Farrell (10/6/2010)


    Craig,

    Is it safe to assume that this field will alway end in numerics? If so, I may have a simple solution for you regarding the variable front of the field.

    Yep. They always end in a numeric character. Looking forward to your response. My query is terribly slow on my 260,000 row table.

    260k? Hrm, let me build a proper test harness before I point you up the wrong road. 🙂

    One last confirmation before I build said harness: We're looking at records like:

    ABC0001

    ABCDEF00002

    ABCD003

    ABCDEF000006

    AB7

    ...

    ETC

    Where there is absolutely no conformity? Are we dealing with a standard length with no conformity to the length of the front in regards to the #alpha chars?

    Sorry, just trying to nail down the exact data situation.

    In theory yes. These are received legal documents and are bates stamped (identifier we are referring too) using a different prefix depending on the producing party. So your above example is quit possible. The total length of the numbers can/will vary as well. In this particular case they are all 13 total characters (4 prefix text + 9 numerals) and all have a PPCI prefix. But I am trying to write this script so I can run it against all cases. Some of which have a uniform prefix and uniform character count throughout, others which have a variable text prefix and variable numerical count. One thing though, all identifiers with the same prefix will have the same total character count.

    Some tables may have 1 million+ records. I understand whatever solution we find will tax the system since it's checking each id, but it's a necessary process to identify ranges of produced documents for auditing purposes.

    I appreciate your time with this. I feel I made some good progress with my above script, but am having trouble moving away from so much hard coding. FYI it's taking almost 30 minutes to run on my system.

    Question: when a prefix changes, do the numbers restart at 1? Or do they continue?

    (I've got some real nifty, high-speed code (thanks to Jeff!) to quickly identify gaps - but it's designed to work on numbers. Will take some mods to work with the prefix, but should be simple.)

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (10/6/2010)


    craig-404139 (10/6/2010)


    Question: when a prefix changes, do the numbers restart at 1? Or do they continue?

    (I've got some real nifty, high-speed code (thanks to Jeff!) to quickly identify gaps - but it's designed to work on numbers. Will take some mods to work with the prefix, but should be simple.)

    Hey Wayne, if you can modify to match the Pattern/NumPiece code above I'd be curious to see that result. The Pattern splitter is pretty zippy and will return a group/numeric resultset.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Craig Farrell (10/6/2010)


    Edit: Sorry for the multi-edit. Something didn't play nicely with the code= statements when I tried to first post this.

    The Test Harness:

    SELECT TOP 1000000

    IDENTITY(INT, 1,1) AS N

    INTOTally

    FROM

    Master.dbo.SysColumns sc1,

    Master.dbo.syscolumns sc2

    ALTER TABLE Tally

    ADD CONSTRAINT PK_Tally_N

    PRIMARY KEY CLUSTERED(N) with FILLFACTOR = 100

    GRANT SELECT, REFERENCES on TALLY TO PUBLIC

    select max(n) from Tally

    CREATE TABLE #TestHarness

    (TextIDVARCHAR(200) NOT NULL)

    --TRUNCATE TABLE #TestHarness

    INSERT INTO #TestHarness

    SELECT

    REPLICATE( 'A', FLOOR( (N/10)%5) + 1) -- VAriable length starter

    + REPLICATE( '0', N%7+1 ) -- Variable # of 0's

    + CAST( N AS VARCHAR(15))

    FROM

    Tally

    WHERE

    N%5 >1

    AND N < 500000

    Craig Farrell - can you modify your TestHarness to meet this stipulation:

    craig-404139 (10/6/2010)


    One thing though, all identifiers with the same prefix will have the same total character count.

    Craig Farrell (10/6/2010)


    The PatternFinder simplified: (I get 2 second results from this on a desktop server)

    select --top 200

    DISTINCT

    LEFT( TextID, MAX( N) ) AS Pattern

    FROM

    #TestHarness AS th,

    Tally AS T

    where

    N < LEN(TextID) AND

    isnumeric( SUBSTRING( TextID, N, 1)) = 0

    GROUP BY

    TextID

    The fastest gap detection code that I've ever seen.... doesn't use a tally table. (Though it is an obvious thing to use!)

    If you can make the TestHarness be the same for each prefix... I've got something good for you! 😉

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

Viewing 15 posts - 1 through 15 (of 91 total)

You must be logged in to reply to this topic. Login to reply