Finding duplicates in a table with 20 million records

Question

Finding duplicates in a table with 20 million records

S_Kumar_S

SSCrazy Eights

Points: 9037
More actions
March 16, 2022 at 11:13 am

#4000489

HI All
I have a huge table with 20M records with around 70 columns. I need to check if a combination of 30 columns has duplicate values. Use of Group by takes too long and is very heavy on tempdb.
I was thinking to concatenate the 30 columns and do a CHECKSUM and then group by just that column. But seems like CHECKSUM will not guarantee a unique value for a distinct row. Means 2 rows with different values of 30 columns might have same CHECKSUM.
Use of ROW_NUMBER() with PARTITION BY will require ORDER BY which again will be very slow.
Any other idea on it , how can it be done in best possible way?
Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 247202 More actions · Answer 1

Maybe use checksum() as suggested & push results to a separate 'duplicate suspects' table, then refine your search using the data in the new table?

S_Kumar_S SSCrazy Eights Points: 9037 More actions · Answer 2

Thanks Phil. Seems like a good idea...will give it a try

Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.

frederico_fonseca SSCoach Points: 16290 More actions · Answer 3

use a construct like the following

select *
from tablename
cross join (select hashbytes('SHA1', concat(field1
                           , '|' , field2
                           , '|' , convert(varchar(27), field3, 121) -- for date/datetime fields to ensure always same format
                           ) as hashkey
           ) keys
inner join (select keys.hashkey
            from tablename 
            cross join (select hashbytes('SHA1', concat(field1
                                       , '|' , field2
                                       , '|' , convert(varchar(27), field3, 121) -- for date/datetime fields to ensure always same format
                                       ) as hashkey
                       ) keys
            group by haskey
            having count(*) > 1
           ) dups
where dups.hashkey = keys.hashkey

if it is still slow do a pass onto a temporary table where you insert primary key of the table in question and the hash.

and then do the above code over that temp table to identify the pk's of records with issue