sqldriver (10/25/2014)
I was trying a similar approach, but couldn't get the results right. :blush:Upper/lowercase is throwing things off, I think.
100 Musk Ox for you, sir.
SET STATISTICS IO ON
SET STATISTICS TIME ON;
WITH dupe1
AS ( SELECT --ROW_NUMBER() OVER (PARTITION BY ORDER BY ) AS RN ,
a.Business_pk ,
a.BusinessName ,
a.[Address] ,
a.Phone ,
HASHBYTES('MD5',
a.BusinessName + a.[Address] + a.Phone) AS [HashPit]
FROM TestBusiness a
),
dupe2
AS ( SELECT * ,
ROW_NUMBER() OVER ( PARTITION BY [HashPit] ORDER BY ( SELECT
NULL
) ) AS RN
FROM dupe1
)
SELECT d2.Business_pk ,
d2.BusinessName ,
d2.[Address] ,
d2.Phone
FROM dupe2 d2
WHERE d2.RN > 1
You are on the right track there, a simple solution would be to "UPPER" the values before hashing. SHA1 is quite a lot cheaper computationally than MD5 and should be more than fit for purpose, gives a better distribution (less collisions) than binary_checksum but the latter is at least in the order of one magnitude faster, a little bit of a juggle there.
😎