Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Another Duplicate removal question Expand / Collapse
Author
Message
Posted Saturday, August 24, 2013 12:41 PM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Friday, April 11, 2014 10:59 PM
Points: 428, Visits: 691
We are still using SQL 2000, so my question is about SQL 2000.

Problem statement

We receive a comma separated file about once every month.

We import that into a SQL table. Once it is imported we run various scripts to compare the imported data to a database we have. When we have matches based on certain criteria the results are inserted into a results table.
Despite trying to prevent duplicates from getting inserted into the “results” table we still get duplicates because of one or another of the required criteria.

In an effort to remove certain duplicates, we assign a “Score” to each record and we flag the duplicates with a 1.

What I would like is some help with the following.
1. If there is a duplicate record then remove all but the one with the highest score.
2. If there are duplicate records and they all have the same score, then remove all duplicates except one, preferably the one with the most data in the row.

In the “results” table there is a primary key called ID.

There is a unique id in the source file called ScrubID that gets inserted into the “results” table as well.

I used this script, which I modified from one on this site, to delete records with the same score.

DELETE FROM results WHERE ID >

(SELECT MIN(ID) FROM results b where

results.ScrubID=b.ScrubID AND

results.score=b.Score )

I have tried to modify the above script to keep the higest score and it seems no matter what I try it does not work.

Please see modification below.

DELETE FROM results WHERE ID >

(SELECT MIN(ID) FROM results b where

results.ScrubID=b.ScrubID AND

results.score > b.Score )


I will try to provide some samples in a little while.

Thanks for any help and or guidance.

Gary



Post #1488142
Posted Saturday, August 24, 2013 6:02 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 4:46 PM
Points: 35,959, Visits: 30,250
I can't help with requirement #2 because I don't know what your data or table looks like and don't know what exactly you mean by "preferably the one with the most data in the row".

There are two fairly easy ways to accomplish what you ask. They will outperform takes turns outperforming each other depending on how many duplicates you have per ScrubID and what the indexes on the table are.

Here are the two different methods. I didn't test them because you didn't post any readily consumable data but you should get the idea. Both will handle virtually any number of duplicate ScrubID's but, like I said previously, will work at different performance levels based on how many dupes there are for each ScrubID.

 DELETE lo
FROM dbo.Results lo
JOIN dbo.Results hi
ON lo.ScrubID = hi.ScrubID
AND lo.Score < hi.Score
;

WITH
cteEnumerateDupes AS
(
SELECT SortOrder = ROW_NUMBER() OVER (PARTITION BY ScrubID ORDER BY Score DESC),
ID --You probably won't need this but it gives people the nice warm fuzzies.
FROM dbo.Results
)
DELETE cteEnumarateDupes
WHERE SortOrder > 1
;




--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

"Change is inevitable. Change for the better is not." -- 04 August 2013
(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1488177
Posted Sunday, August 25, 2013 3:15 AM


SSC-Insane

SSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-Insane

Group: General Forum Members
Last Login: Today @ 8:55 PM
Points: 22,492, Visits: 30,199
Jeff Moden (8/24/2013)
I can't help with requirement #2 because I don't know what your data or table looks like and don't know what exactly you mean by "preferably the one with the most data in the row".

There are two fairly easy ways to accomplish what you ask. They will outperform takes turns outperforming each other depending on how many duplicates you have per ScrubID and what the indexes on the table are.

Here are the two different methods. I didn't test them because you didn't post any readily consumable data but you should get the idea. Both will handle virtually any number of duplicate ScrubID's but, like I said previously, will work at different performance levels based on how many dupes there are for each ScrubID.

 DELETE lo
FROM dbo.Results lo
JOIN dbo.Results hi
ON lo.ScrubID = hi.ScrubID
AND lo.Score < hi.Score
;

WITH
cteEnumerateDupes AS
(
SELECT SortOrder = ROW_NUMBER() OVER (PARTITION BY ScrubID ORDER BY Score DESC),
ID --You probably won't need this but it gives people the nice warm fuzzies.
FROM dbo.Results
)
DELETE cteEnumarateDupes
WHERE SortOrder > 1
;




Just one problem with what you provided, Jeff. The OP needs a solution that will work with SQL Server 2000.



Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Post #1488197
Posted Sunday, August 25, 2013 10:57 AM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Today @ 4:44 PM
Points: 8,286, Visits: 8,736
If you like using aggregates (all those MINs suggest that) and are not too concerned by the idea of using two queries instead of one, and that "having the most data" can be changed into "having the lowest ID" as suggested by your code, something that will work in SQL 2000 is
delete results R where not exists (
select 1 from (
select A.ScrubID, max(A.Score) as Score from results A group by ScrubID
) MX
where MX.Score = R.Score and MX.ScrubId = R.ScrubID); -- gets rid of low scores
delete results R where R.ID not in (select min(A.ID) from results A group by scrubID) ; -- gets rid of duplicates with high IDs

Four years ago, I might have worked out how to do it in a single query in SQL 2000, but years of being spoilt by the neat new features in SQL 2008 and its successors have made me less willing (and probably less able, too) to write really convoluted queries.

Actually, Jeff's first suggestion does the same as my first query, and will probably do it a lot more efficiently unless there's some very unlikely (and rather silly) indexing.




edit: spelling


Tom
Post #1488216
Posted Sunday, August 25, 2013 2:08 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 4:46 PM
Points: 35,959, Visits: 30,250
Lynn Pettis (8/25/2013)
Jeff Moden (8/24/2013)
I can't help with requirement #2 because I don't know what your data or table looks like and don't know what exactly you mean by "preferably the one with the most data in the row".

There are two fairly easy ways to accomplish what you ask. They will outperform takes turns outperforming each other depending on how many duplicates you have per ScrubID and what the indexes on the table are.

Here are the two different methods. I didn't test them because you didn't post any readily consumable data but you should get the idea. Both will handle virtually any number of duplicate ScrubID's but, like I said previously, will work at different performance levels based on how many dupes there are for each ScrubID.

 DELETE lo
FROM dbo.Results lo
JOIN dbo.Results hi
ON lo.ScrubID = hi.ScrubID
AND lo.Score < hi.Score
;

WITH
cteEnumerateDupes AS
(
SELECT SortOrder = ROW_NUMBER() OVER (PARTITION BY ScrubID ORDER BY Score DESC),
ID --You probably won't need this but it gives people the nice warm fuzzies.
FROM dbo.Results
)
DELETE cteEnumarateDupes
WHERE SortOrder > 1
;




Just one problem with what you provided, Jeff. The OP needs a solution that will work with SQL Server 2000.


In that case, the first solution I provided will work.

Good to "see" you around, Lynn.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

"Change is inevitable. Change for the better is not." -- 04 August 2013
(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1488234
Posted Monday, August 26, 2013 9:15 AM


SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Wednesday, April 02, 2014 12:18 PM
Points: 253, Visits: 216
2. If there are duplicate records and they all have the same score, then remove all duplicates except one, preferably the one with the most data in the row.


If you post the schema for this table I can quickly give you a script to do this in SQL 2000. I used to write jobs that had to clean tables with no unique ID's. Should be easy to use a join and write a clean up script for the one you have.
Post #1488425
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse