Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Another Duplicate removal question


Another Duplicate removal question

Author
Message
GF
GF
SSC-Addicted
SSC-Addicted (470 reputation)SSC-Addicted (470 reputation)SSC-Addicted (470 reputation)SSC-Addicted (470 reputation)SSC-Addicted (470 reputation)SSC-Addicted (470 reputation)SSC-Addicted (470 reputation)SSC-Addicted (470 reputation)

Group: General Forum Members
Points: 470 Visits: 985
We are still using SQL 2000, so my question is about SQL 2000.

Problem statement

We receive a comma separated file about once every month.

We import that into a SQL table. Once it is imported we run various scripts to compare the imported data to a database we have. When we have matches based on certain criteria the results are inserted into a results table.
Despite trying to prevent duplicates from getting inserted into the “results” table we still get duplicates because of one or another of the required criteria.

In an effort to remove certain duplicates, we assign a “Score” to each record and we flag the duplicates with a 1.

What I would like is some help with the following.
1.   If there is a duplicate record then remove all but the one with the highest score.
2.   If there are duplicate records and they all have the same score, then remove all duplicates except one, preferably the one with the most data in the row.

In the “results” table there is a primary key called ID.

There is a unique id in the source file called ScrubID that gets inserted into the “results” table as well.

I used this script, which I modified from one on this site, to delete records with the same score.

DELETE FROM results WHERE ID >

(SELECT MIN(ID) FROM results b where

results.ScrubID=b.ScrubID AND

results.score=b.Score )

I have tried to modify the above script to keep the higest score and it seems no matter what I try it does not work.

Please see modification below.

DELETE FROM results WHERE ID >

(SELECT MIN(ID) FROM results b where

results.ScrubID=b.ScrubID AND

results.score > b.Score )


I will try to provide some samples in a little while.

Thanks for any help and or guidance.

Gary



Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)

Group: General Forum Members
Points: 44970 Visits: 39864
I can't help with requirement #2 because I don't know what your data or table looks like and don't know what exactly you mean by "preferably the one with the most data in the row".

There are two fairly easy ways to accomplish what you ask. They will outperform takes turns outperforming each other depending on how many duplicates you have per ScrubID and what the indexes on the table are.

Here are the two different methods. I didn't test them because you didn't post any readily consumable data but you should get the idea. Both will handle virtually any number of duplicate ScrubID's but, like I said previously, will work at different performance levels based on how many dupes there are for each ScrubID.


DELETE lo
FROM dbo.Results lo
JOIN dbo.Results hi
ON lo.ScrubID = hi.ScrubID
AND lo.Score < hi.Score
;

WITH
cteEnumerateDupes AS
(
SELECT SortOrder = ROW_NUMBER() OVER (PARTITION BY ScrubID ORDER BY Score DESC),
ID --You probably won't need this but it gives people the nice warm fuzzies.
FROM dbo.Results
)
DELETE cteEnumarateDupes
WHERE SortOrder > 1
;




--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
     Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is usually not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Lynn Pettis
Lynn Pettis
SSC-Insane
SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)SSC-Insane (24K reputation)

Group: General Forum Members
Points: 24167 Visits: 37929
Jeff Moden (8/24/2013)
I can't help with requirement #2 because I don't know what your data or table looks like and don't know what exactly you mean by "preferably the one with the most data in the row".

There are two fairly easy ways to accomplish what you ask. They will outperform takes turns outperforming each other depending on how many duplicates you have per ScrubID and what the indexes on the table are.

Here are the two different methods. I didn't test them because you didn't post any readily consumable data but you should get the idea. Both will handle virtually any number of duplicate ScrubID's but, like I said previously, will work at different performance levels based on how many dupes there are for each ScrubID.


DELETE lo
FROM dbo.Results lo
JOIN dbo.Results hi
ON lo.ScrubID = hi.ScrubID
AND lo.Score < hi.Score
;

WITH
cteEnumerateDupes AS
(
SELECT SortOrder = ROW_NUMBER() OVER (PARTITION BY ScrubID ORDER BY Score DESC),
ID --You probably won't need this but it gives people the nice warm fuzzies.
FROM dbo.Results
)
DELETE cteEnumarateDupes
WHERE SortOrder > 1
;




Just one problem with what you provided, Jeff. The OP needs a solution that will work with SQL Server 2000.

Cool
Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
TomThomson
TomThomson
SSChampion
SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)

Group: General Forum Members
Points: 10696 Visits: 11991
If you like using aggregates (all those MINs suggest that) and are not too concerned by the idea of using two queries instead of one, and that "having the most data" can be changed into "having the lowest ID" as suggested by your code, something that will work in SQL 2000 is

delete results R where not exists (
select 1 from (
select A.ScrubID, max(A.Score) as Score from results A group by ScrubID
) MX
where MX.Score = R.Score and MX.ScrubId = R.ScrubID); -- gets rid of low scores
delete results R where R.ID not in (select min(A.ID) from results A group by scrubID) ; -- gets rid of duplicates with high IDs


Four years ago, I might have worked out how to do it in a single query in SQL 2000, but years of being spoilt by the neat new features in SQL 2008 and its successors have made me less willing (and probably less able, too) to write really convoluted queries.

Actually, Jeff's first suggestion does the same as my first query, and will probably do it a lot more efficiently unless there's some very unlikely (and rather silly) indexing.




edit: spelling Blush

Tom

Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)

Group: General Forum Members
Points: 44970 Visits: 39864
Lynn Pettis (8/25/2013)
Jeff Moden (8/24/2013)
I can't help with requirement #2 because I don't know what your data or table looks like and don't know what exactly you mean by "preferably the one with the most data in the row".

There are two fairly easy ways to accomplish what you ask. They will outperform takes turns outperforming each other depending on how many duplicates you have per ScrubID and what the indexes on the table are.

Here are the two different methods. I didn't test them because you didn't post any readily consumable data but you should get the idea. Both will handle virtually any number of duplicate ScrubID's but, like I said previously, will work at different performance levels based on how many dupes there are for each ScrubID.


DELETE lo
FROM dbo.Results lo
JOIN dbo.Results hi
ON lo.ScrubID = hi.ScrubID
AND lo.Score < hi.Score
;

WITH
cteEnumerateDupes AS
(
SELECT SortOrder = ROW_NUMBER() OVER (PARTITION BY ScrubID ORDER BY Score DESC),
ID --You probably won't need this but it gives people the nice warm fuzzies.
FROM dbo.Results
)
DELETE cteEnumarateDupes
WHERE SortOrder > 1
;




Just one problem with what you provided, Jeff. The OP needs a solution that will work with SQL Server 2000.


In that case, the first solution I provided will work.

Good to "see" you around, Lynn.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
     Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is usually not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
PHYData DBA
PHYData DBA
Mr or Mrs. 500
Mr or Mrs. 500 (556 reputation)Mr or Mrs. 500 (556 reputation)Mr or Mrs. 500 (556 reputation)Mr or Mrs. 500 (556 reputation)Mr or Mrs. 500 (556 reputation)Mr or Mrs. 500 (556 reputation)Mr or Mrs. 500 (556 reputation)Mr or Mrs. 500 (556 reputation)

Group: General Forum Members
Points: 556 Visits: 533
2. If there are duplicate records and they all have the same score, then remove all duplicates except one, preferably the one with the most data in the row.


If you post the schema for this table I can quickly give you a script to do this in SQL 2000. I used to write jobs that had to clean tables with no unique ID's. Should be easy to use a join and write a clean up script for the one you have.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search