Finding duplicate values in very large table. Select Row_number() over too slow

  • This query finds duplicate values and inserts the results into another table using the basic structure shown below. This table will grow from 35 million to probably 65 million records over the next year. It's causing a spike in tempdb up to 14,000 milliseconds wait per IO for about an hour each night. We need a better approach to this.

    insert into .......

    SELECT ROW_NUMBER

    () OVER (

    ORDER BY ....( orders by two varchar data type fields )

    ......

    FROM SUMMARY_TABLE AS Summary02

    CROSS JOIN

    SUMMARY_TABLE AS Summary13

    WHERE

  • Couldn't you just do a COUNT/HAVING approach?

    Select Field1, Field2, COUNT(*) AS Records

    FROM SUMMARY_TABLE

    GROUP BY Field1, Field2

    HAVING COUNT(*) > 1

    That will identify the multiple occurrences and from there you can resolve as needed.

    --------------------------------------------------------------------------
    When you realize you've dug yourself into a hole....Step 1...stop digging.

  • I tested yours with our info and it ran in 1.5 minutes instead of 17 on a test box. But I'll have to see what the developers are trying to accomplish. Since most sql is object-generated I usually just pass on the culprits as a bug via our ticketing system.

  • Well let's hope it helps. Just based on the numbers you replied back with they should at least look at other "options".

    --------------------------------------------------------------------------
    When you realize you've dug yourself into a hole....Step 1...stop digging.

  • I guess I'd have to ask... why do they allow the "dupes" to continue?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I'm with you on that one. I didn't want to clutter up the quick answer, but that is a great question. I think i'm afraid to hear the answer though.

    --------------------------------------------------------------------------
    When you realize you've dug yourself into a hole....Step 1...stop digging.

  • I'm not sure. These are car loans so it might be where a vin number exists twice, but one record is a paid off loan still in the database and the other is the same vin ( vehicle identification number ) showing on a newer loan.

    I'm more familiar with our legacy system where a vin can exist twice but anything beyond the primary active record has a special status indicating it is in some sort of an error condition. In other words, there is a unique index on vin excepting that error status flag. Files from clients ( banks ) and state motor vehicle offices can arrive with a vin that already exists. Simply not allowing these files to load where the vin was a duplicate would create it's own problems.

  • Sounds like historical records without dates or status flags 🙂

    Anyway, I hope the new version works well for you. A non-clustered index on those two fields should dramatically speed up the query as well (however I don't know how adversely it would impact your INSERT/UPDATE/DELETES). Definitely test it if you deside to create an index.

    --------------------------------------------------------------------------
    When you realize you've dug yourself into a hole....Step 1...stop digging.

  • It will be up to the developers to decide if they're going to revise anything. Unless Systems decides to "insist" I've done a lot of googling on row_count and have yet to see anyone else using it with a cross join on a huge table with a copy of itself.

  • Let us know how it goes!

    --------------------------------------------------------------------------
    When you realize you've dug yourself into a hole....Step 1...stop digging.

  • Indianrock (2/23/2011)


    This query finds duplicate values and inserts the results into another table using the basic structure shown below.

    An alternative solution would be to create insert, update, and delete triggers on the summary table to maintain a list of primary key values in the table where there are duplicates for the two character fields. This will add a small overhead to data-changing operations on the table, but finding duplicates would be trivial, and practically instantaneous. As an alternative to the triggers, you could also consider an indexed view (with a filtered non-clustered index in 2008).

  • We're on sql 2005 Enterprise and do have some indexed views in use. How would that work for this problem?

  • Indianrock (2/25/2011)


    We're on sql 2005 Enterprise and do have some indexed views in use. How would that work for this problem?

    I think I know where Paul's going with that (I'm never quite sure, I'm still playing catchup to his knowledge... 🙂 ), but since you're in 2k5 you can't use a filtered index, so it's a moot point.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Indianrock (2/25/2011)


    We're on sql 2005 Enterprise and do have some indexed views in use. How would that work for this problem?

    I'll post an example in a second.

  • CREATE TABLE

    dbo.Summary

    (

    PK INTEGER PRIMARY KEY,

    group1 INTEGER NOT NULL,

    group2 INTEGER NOT NULL,

    value INTEGER NOT NULL

    );

    GO

    INSERT dbo.Summary VALUES (1, 1, 1, 100);

    INSERT dbo.Summary VALUES (2, 1, 2, 200);

    INSERT dbo.Summary VALUES (3, 2, 3, 300);

    INSERT dbo.Summary VALUES (4, 2, 4, 400);

    INSERT dbo.Summary VALUES (5, 2, 3, 500);

    INSERT dbo.Summary VALUES (6, 3, 5, 600);

    INSERT dbo.Summary VALUES (7, 3, 5, 700);

    INSERT dbo.Summary VALUES (8, 4, 5, 800);

    INSERT dbo.Summary VALUES (9, 5, 5, 900);

    GO

    CREATE INDEX nc1 ON dbo.Summary (group1, group2);

    GO

    -- View definiion

    CREATE VIEW dbo.IV WITH SCHEMABINDING AS

    SELECT group1, group2, COUNT_BIG(*) AS row_count

    FROM dbo.Summary

    GROUP BY

    group1,

    group2;

    GO

    -- Index it

    CREATE UNIQUE CLUSTERED INDEX cuq ON dbo.IV (group1, group2);

    CREATE NONCLUSTERED INDEX nc1 ON dbo.IV (row_count);

    GO

    -- NOEXPAND hint required for non-Enterprise SKUs

    SELECT IV.group1,

    IV.group2

    FROM dbo.IV AS IV WITH (NOEXPAND)

    WHERE row_count > 1

    GO

    -- In Enterprise SKUs, this query seeks on the indexed view

    SELECT group1,

    group2,

    COUNT_BIG(*) AS row_count

    FROM dbo.Summary

    GROUP BY

    group1,

    group2

    HAVING COUNT_BIG(*) > 1

    GO

    DROP VIEW dbo.IV;

    DROP TABLE dbo.Summary;

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply