Home Forums SQL Server 2008 T-SQL (SS2K8) Rank duplicates, but only rows involved in duplicates RE: Rank duplicates, but only rows involved in duplicates

  • sqldriver (10/25/2014)


    I was trying a similar approach, but couldn't get the results right. :blush:

    Upper/lowercase is throwing things off, I think.

    100 Musk Ox for you, sir.

    SET STATISTICS IO ON

    SET STATISTICS TIME ON;

    WITH dupe1

    AS ( SELECT --ROW_NUMBER() OVER (PARTITION BY ORDER BY ) AS RN ,

    a.Business_pk ,

    a.BusinessName ,

    a.[Address] ,

    a.Phone ,

    HASHBYTES('MD5',

    a.BusinessName + a.[Address] + a.Phone) AS [HashPit]

    FROM TestBusiness a

    ),

    dupe2

    AS ( SELECT * ,

    ROW_NUMBER() OVER ( PARTITION BY [HashPit] ORDER BY ( SELECT

    NULL

    ) ) AS RN

    FROM dupe1

    )

    SELECT d2.Business_pk ,

    d2.BusinessName ,

    d2.[Address] ,

    d2.Phone

    FROM dupe2 d2

    WHERE d2.RN > 1

    You are on the right track there, a simple solution would be to "UPPER" the values before hashing. SHA1 is quite a lot cheaper computationally than MD5 and should be more than fit for purpose, gives a better distribution (less collisions) than binary_checksum but the latter is at least in the order of one magnitude faster, a little bit of a juggle there.

    😎