March 16, 2022 at 11:13 am
HI All
I have a huge table with 20M records with around 70 columns. I need to check if a combination of 30 columns has duplicate values. Use of Group by takes too long and is very heavy on tempdb.
I was thinking to concatenate the 30 columns and do a CHECKSUM and then group by just that column. But seems like CHECKSUM will not guarantee a unique value for a distinct row. Means 2 rows with different values of 30 columns might have same CHECKSUM.
Use of ROW_NUMBER() with PARTITION BY will require ORDER BY which again will be very slow.
Any other idea on it , how can it be done in best possible way?
Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.
March 16, 2022 at 11:29 am
Maybe use checksum() as suggested & push results to a separate 'duplicate suspects' table, then refine your search using the data in the new table?
March 16, 2022 at 11:41 am
Thanks Phil. Seems like a good idea...will give it a try
Online Trainer For SQL DBA and Developer @RedBushTechnologies with 18 yrs exp.
March 16, 2022 at 11:41 am
use a construct like the following
select *
from tablename
cross join (select hashbytes('SHA1', concat(field1
, '|' , field2
, '|' , convert(varchar(27), field3, 121) -- for date/datetime fields to ensure always same format
) as hashkey
) keys
inner join (select keys.hashkey
from tablename
cross join (select hashbytes('SHA1', concat(field1
, '|' , field2
, '|' , convert(varchar(27), field3, 121) -- for date/datetime fields to ensure always same format
) as hashkey
) keys
group by haskey
having count(*) > 1
) dups
where dups.hashkey = keys.hashkey
if it is still slow do a pass onto a temporary table where you insert primary key of the table in question and the hash.
and then do the above code over that temp table to identify the pk's of records with issue
Viewing 4 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply