Another Duplicate removal question

  • We are still using SQL 2000, so my question is about SQL 2000.

    Problem statement

    We receive a comma separated file about once every month.

    We import that into a SQL table. Once it is imported we run various scripts to compare the imported data to a database we have. When we have matches based on certain criteria the results are inserted into a results table.

    Despite trying to prevent duplicates from getting inserted into the “results” table we still get duplicates because of one or another of the required criteria.

    In an effort to remove certain duplicates, we assign a “Score” to each record and we flag the duplicates with a 1.

    What I would like is some help with the following.

    1.If there is a duplicate record then remove all but the one with the highest score.

    2.If there are duplicate records and they all have the same score, then remove all duplicates except one, preferably the one with the most data in the row.

    In the “results” table there is a primary key called ID.

    There is a unique id in the source file called ScrubID that gets inserted into the “results” table as well.

    I used this script, which I modified from one on this site, to delete records with the same score.

    DELETE FROM results WHERE ID >

    (SELECT MIN(ID) FROM results b where

    results.ScrubID=b.ScrubID AND

    results.score=b.Score )

    I have tried to modify the above script to keep the higest score and it seems no matter what I try it does not work.

    Please see modification below.

    DELETE FROM results WHERE ID >

    (SELECT MIN(ID) FROM results b where

    results.ScrubID=b.ScrubID AND

    results.score > b.Score )

    I will try to provide some samples in a little while.

    Thanks for any help and or guidance.

    Gary

  • I can't help with requirement #2 because I don't know what your data or table looks like and don't know what exactly you mean by "preferably the one with the most data in the row".

    There are two fairly easy ways to accomplish what you ask. They will outperform takes turns outperforming each other depending on how many duplicates you have per ScrubID and what the indexes on the table are.

    Here are the two different methods. I didn't test them because you didn't post any readily consumable data but you should get the idea. Both will handle virtually any number of duplicate ScrubID's but, like I said previously, will work at different performance levels based on how many dupes there are for each ScrubID.

    DELETE lo

    FROM dbo.Results lo

    JOIN dbo.Results hi

    ON lo.ScrubID = hi.ScrubID

    AND lo.Score < hi.Score

    ;

    WITH

    cteEnumerateDupes AS

    (

    SELECT SortOrder = ROW_NUMBER() OVER (PARTITION BY ScrubID ORDER BY Score DESC),

    ID --You probably won't need this but it gives people the nice warm fuzzies.

    FROM dbo.Results

    )

    DELETE cteEnumarateDupes

    WHERE SortOrder > 1

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (8/24/2013)


    I can't help with requirement #2 because I don't know what your data or table looks like and don't know what exactly you mean by "preferably the one with the most data in the row".

    There are two fairly easy ways to accomplish what you ask. They will outperform takes turns outperforming each other depending on how many duplicates you have per ScrubID and what the indexes on the table are.

    Here are the two different methods. I didn't test them because you didn't post any readily consumable data but you should get the idea. Both will handle virtually any number of duplicate ScrubID's but, like I said previously, will work at different performance levels based on how many dupes there are for each ScrubID.

    DELETE lo

    FROM dbo.Results lo

    JOIN dbo.Results hi

    ON lo.ScrubID = hi.ScrubID

    AND lo.Score < hi.Score

    ;

    WITH

    cteEnumerateDupes AS

    (

    SELECT SortOrder = ROW_NUMBER() OVER (PARTITION BY ScrubID ORDER BY Score DESC),

    ID --You probably won't need this but it gives people the nice warm fuzzies.

    FROM dbo.Results

    )

    DELETE cteEnumarateDupes

    WHERE SortOrder > 1

    ;

    Just one problem with what you provided, Jeff. The OP needs a solution that will work with SQL Server 2000.

  • If you like using aggregates (all those MINs suggest that) and are not too concerned by the idea of using two queries instead of one, and that "having the most data" can be changed into "having the lowest ID" as suggested by your code, something that will work in SQL 2000 is

    delete results R where not exists (

    select 1 from (

    select A.ScrubID, max(A.Score) as Score from results A group by ScrubID

    ) MX

    where MX.Score = R.Score and MX.ScrubId = R.ScrubID); -- gets rid of low scores

    delete results R where R.ID not in (select min(A.ID) from results A group by scrubID) ; -- gets rid of duplicates with high IDs

    Four years ago, I might have worked out how to do it in a single query in SQL 2000, but years of being spoilt by the neat new features in SQL 2008 and its successors have made me less willing (and probably less able, too) to write really convoluted queries.

    Actually, Jeff's first suggestion does the same as my first query, and will probably do it a lot more efficiently unless there's some very unlikely (and rather silly) indexing.

    edit: spelling :blush:

    Tom

  • Lynn Pettis (8/25/2013)


    Jeff Moden (8/24/2013)


    I can't help with requirement #2 because I don't know what your data or table looks like and don't know what exactly you mean by "preferably the one with the most data in the row".

    There are two fairly easy ways to accomplish what you ask. They will outperform takes turns outperforming each other depending on how many duplicates you have per ScrubID and what the indexes on the table are.

    Here are the two different methods. I didn't test them because you didn't post any readily consumable data but you should get the idea. Both will handle virtually any number of duplicate ScrubID's but, like I said previously, will work at different performance levels based on how many dupes there are for each ScrubID.

    DELETE lo

    FROM dbo.Results lo

    JOIN dbo.Results hi

    ON lo.ScrubID = hi.ScrubID

    AND lo.Score < hi.Score

    ;

    WITH

    cteEnumerateDupes AS

    (

    SELECT SortOrder = ROW_NUMBER() OVER (PARTITION BY ScrubID ORDER BY Score DESC),

    ID --You probably won't need this but it gives people the nice warm fuzzies.

    FROM dbo.Results

    )

    DELETE cteEnumarateDupes

    WHERE SortOrder > 1

    ;

    Just one problem with what you provided, Jeff. The OP needs a solution that will work with SQL Server 2000.

    In that case, the first solution I provided will work.

    Good to "see" you around, Lynn.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • 2. If there are duplicate records and they all have the same score, then remove all duplicates except one, preferably the one with the most data in the row.

    If you post the schema for this table I can quickly give you a script to do this in SQL 2000. I used to write jobs that had to clean tables with no unique ID's. Should be easy to use a join and write a clean up script for the one you have.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply