Eirikur Eiriksson (10/25/2014)
An alternative solution which doesn't require any joining. It scales well for large sets but has a weakness of returning false positives although fairly unlikely with this type of data. As a bonus, the B_CHS (binary_checksum) column is a "distinct group name".😎
;WITH BASE_DATA AS
(
SELECT
BINARY_CHECKSUM(B.Phone,CHAR(124), B.Address) AS B_CHS
,B.Business_pk
FROM dbo.TestBusiness B
)
SELECT
BD.B_CHS
,BD.Business_pk
INTO #BD_SORT
FROM BASE_DATA BD;
CREATE UNIQUE NONCLUSTERED INDEX TMP_BD_SORT_DD ON #BD_SORT ( B_CHS ASC, Business_pk ASC);
;WITH BS_SORT AS
(
SELECT
BS.Business_pk
,ROW_NUMBER() OVER (PARTITION BY BS.B_CHS ORDER BY BS.Business_pk ) AS BS_RID
FROM #BD_SORT BS
)
SELECT * FROM BS_SORT WHERE BS_RID > 1;
DROP TABLE #BD_SORT;
I was trying a similar approach, but couldn't get the results right. :blush:
Upper/lowercase is throwing things off, I think.
100 Musk Ox for you, sir.
SET STATISTICS IO ON
SET STATISTICS TIME ON;
WITH dupe1
AS ( SELECT --ROW_NUMBER() OVER (PARTITION BY ORDER BY ) AS RN ,
a.Business_pk ,
a.BusinessName ,
a.[Address] ,
a.Phone ,
HASHBYTES('MD5',
a.BusinessName + a.[Address] + a.Phone) AS [HashPit]
FROM TestBusiness a
),
dupe2
AS ( SELECT * ,
ROW_NUMBER() OVER ( PARTITION BY [HashPit] ORDER BY ( SELECT
NULL
) ) AS RN
FROM dupe1
)
SELECT d2.Business_pk ,
d2.BusinessName ,
d2.[Address] ,
d2.Phone
FROM dupe2 d2
WHERE d2.RN > 1