Finding duplicates in a table with 20 million records

  • HI All

    I have a huge table with 20M records with around 70 columns. I need to check if a combination of 30 columns has duplicate values. Use of Group by takes too long and is very heavy on tempdb.

    I was thinking to concatenate the 30 columns and do a CHECKSUM and then group by just that column. But seems like CHECKSUM will not guarantee a unique value for a distinct row. Means 2 rows with different values of 30 columns might have same CHECKSUM.

    Use of ROW_NUMBER() with PARTITION BY will require ORDER BY which again will be very slow.

    Any other idea on it , how can it be done in best possible way?

    Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.

  • Maybe use checksum() as suggested & push results to a separate 'duplicate suspects' table, then refine your search using the data in the new table?

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Thanks Phil. Seems like a good idea...will give it a try

    Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.

  • use a construct like the following

    select *
    from tablename
    cross join (select hashbytes('SHA1', concat(field1
    , '|' , field2
    , '|' , convert(varchar(27), field3, 121) -- for date/datetime fields to ensure always same format
    ) as hashkey
    ) keys
    inner join (select keys.hashkey
    from tablename
    cross join (select hashbytes('SHA1', concat(field1
    , '|' , field2
    , '|' , convert(varchar(27), field3, 121) -- for date/datetime fields to ensure always same format
    ) as hashkey
    ) keys
    group by haskey
    having count(*) > 1
    ) dups
    where dups.hashkey = keys.hashkey

    if it is still slow do a pass onto a temporary table where you insert primary key of the table in question and the hash.

    and then do the above code over that temp table to identify the pk's of records with issue

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply