Cross joining one table to find duplicates

Question

Cross joining one table to find duplicates

d.aarts

SSC Veteran

Points: 232
More actions
October 2, 2012 at 1:19 am

#267558

I have been working on cleaning a database with addresses in an attempt to dedupe it. There's a lot of pollution and duplicates in the table. Due to efficiency we would like to trace those down.
To find the duplicates inside my table i use a function which uses Jaro-Winkler algorithm to find duplicates. The most efficient query i could compose for this solution is a cross join and cross apply the function to find duplicates. However, the downside is, that each record appears Left and Right in the resultset. The immediate consequence is, that this list cannot be cleanly used to dedupe.
I've tried running a T-SQL query with a cursor instead, to exclude those records which were used on the left side, but that unfortunately didn't work for me. I was changing data in the table while the cursor moved over it. I attempted the same with a temp table but the results and performance were so horrible, it wouldn't execute over the entire 200k records.
Normally one can place some sort of exclusion in the where. But since the where is entirely based on percentages from the outcome of a crossapply that doesn't work too well.
Is there a cleaner way to handle this query for deduping? Right now my 230.000 records turn into a 600.000 comparison result (given the state of the addresses, that is not unlikely, however the resultset isn't exactly 'clean').
SELECT alfa.RecordID as A_ID,beta.RecordID B_ID,
alfa.bedrijfsnaam as A_naam, beta.bedrijfsnaam as B_Naam,
alfa.zoekcode as A_zoekcode, beta.zoekcode as B_zoekcode, Naamscore.score,
alfa.bezoek_adres as A_Bezoek_Adres, beta.Bezoek_Adres as B_Bezoek_Adres, AdresScore.score,
alfa.Bezoek_Plaats as A_Bezoek_Plaats, beta.Bezoek_Plaats as B_Bezoek_Plaats, PlaatsScore.score,
alfa.Post_Adres as A_Post_Adres, beta.Post_Adres as B_Post_Adres, PostAdresScore.score,
alfa.Post_Plaats as A_Post_Plaats, beta.Post_Plaats as B_Post_Plaats,PostPlaatsScore.score
FROM zeno_bedrijven alfa
cross join zeno_bedrijven beta
cross apply (select score from CompareJaroWinkler(alfa.zoekcode,beta.zoekcode)) as NaamScore
cross apply (select score from comparejarowinkler(alfa.bezoek_adres,beta.bezoek_adres)) as AdresScore
cross apply (select score from comparejarowinkler(alfa.bezoek_plaats,beta.bezoek_plaats)) as PlaatsScore
cross apply (select score from comparejarowinkler(alfa.Post_Adres,beta.Post_adres)) as PostAdresScore
cross apply (select score from comparejarowinkler(alfa.Post_Plaats,beta.Post_Plaats)) as PostPlaatsScore
WHERE alfa.recordid <> beta.recordID
and alfa.zoekcode is not null
and beta.zoekcode is not null
and Naamscore.score > 0.8
and (
(
(AdresScore.score > 0.8 or AdresScore.score is null)
and (PlaatsScore.score > 0.8 or PlaatsScore.score is null)
)
or (
(PostAdresScore.score > 0.8 or PostAdresScore.score is null)
and (PostPlaatsScore.score > 0.8 or PostPlaatsScore.score is null)
)
)
Example of the results:
604497716142Bodinco BVBodinco BVBODINCOBODINCO
604497604498Bodinco BVBodinco BVBODINCOBODINCO
604498716142Bodinco BVBodinco BVBODINCOBODINCO
604498604497Bodinco BVBodinco BVBODINCOBODINCO
Obviously, i would prefer not having the 604498 and 716142 appearing on the left side in the query.

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

d.aarts (10/2/2012)
I have been working on cleaning a database with addresses in an attempt to dedupe it. There's a lot of pollution and duplicates in the table. Due to efficiency we would like to trace those down.
To find the duplicates inside my table i use a function which uses Jaro-Winkler algorithm to find duplicates. The most efficient query i could compose for this solution is a cross join and cross apply the function to find duplicates. However, the downside is, that each record appears Left and Right in the resultset. The immediate consequence is, that this list cannot be cleanly used to dedupe.
<<snip>>

Exactly the same issue is addressed here.

Do you have the code for the JaroWinkler function? What sort of performance are you getting?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

d.aarts SSC Veteran Points: 232 More actions · Answer 2

Hello ChrisM,

Thanks for your reply. I'll see if i can adept the code for cleaning up my ID's.

As for the Jaro-Winkler. What i did, was adopt the SimMetrics DLL into SQL Server. It isn't quite as hard as I thought it would be. ( I attached the compiled project, it however uses .net 4.0 ). The SimMetrics project has several text-comparison functions. (for installation, just adept the install_sql script and run it).

The performance i get from the DLL calculation is about 0.1 second per record. On cross joining 230k records to itself, the output took approximately 18 hours. What I did as well, which drastically improved the resulting comparison, is creating a list of so called 'SEARCHWORDS'. I took the company names and cleaned them up drastically. This cleansing reduced comparison time from 3 days to just over 18 hours.

Some interesting tidbits:

This Procedure splits every company name into different words and then counts the occurence of those words. In the table there's space for synonyms, so you can replace words with some else, or add in an empty string. This list is used as the basis for cleaning up the company names for comparison. I went with everything with over 30 occurences, but it depends a little on how your data is structured.

procedure [dbo].[Synoniem_Lijst] (@MyQUERY varchar(255))

AS

BEGIN

CREATE TABLE #Tempdb (

Word VARCHAR(255)

)

CREATE TABLE Synonims

(

Words NVARCHAR(255),

Wcount INT,

Synonim NVARCHAR(255)

)

CREATE TABLE #Countdb (

Wcount INT,

Word NVARCHAR(255)

)

-- Do not return rowcount to improve performance with multiple queries

SET NOCOUNT ON

-- declare cursor

DECLARE @colvalue NVARCHAR(255)

insert into #Tempdb

EXEC (@MyQuery)

-- load cursor

DECLARE load_cursor CURSOR FOR

SELECT word

FROM #Tempdb

-- open cursor and load resultset

OPEN load_cursor

FETCH NEXT FROM load_cursor INTO @colvalue

WHILE @@FETCH_STATUS = 0

BEGIN

INSERT INTO #Countdb

select * from fn_split2(' ',@colvalue)

FETCH NEXT FROM load_cursor INTO @colvalue

END

BEGIN

INSERT INTO Synonims (Words, Wcount)

SELECT Word, Count(Word) FROM #Countdb GROUP BY Word

END

CLOSE load_cursor

DEALLOCATE load_cursor

DROP TABLE #Tempdb

DROP TABLE #Countdb

RETURN (select * from synonims)

END

And the Function which creates the searchcode (cleaned up company name). Bedrijfsnaam=Companyname, Zoek_code=Created SearchString used for comparison by the algorithm. This is an adeptation of the synonym replace code i found on the net.

ALTER PROCEDURE [dbo].[Zoek_Code]

AS

BEGIN

SET NOCOUNT ON

DECLARE @word VARCHAR(50),

@position INT,

@ZoekCode VARCHAR(500),

@oldBedrijfsnaam VARCHAR(500),

@newWord VARCHAR(50),

@Bedrijfsnaam VARCHAR(500),

@RecordID NVARCHAR(255)

DECLARE load_cursor CURSOR FOR

SELECT idcode, name FROM MyTable

OPEN load_cursor

FETCH NEXT FROM load_cursor INTO @RecordID, @Bedrijfsnaam

WHILE @@FETCH_STATUS = 0

BEGIN

SET @oldBedrijfsnaam = @Bedrijfsnaam

SET @Bedrijfsnaam = LTRIM(RTRIM(@Bedrijfsnaam))

SET @ZoekCode = @Bedrijfsnaam

SET @position = CHARINDEX(' ', @Bedrijfsnaam, 1)

BEGIN

WHILE @position > 0

BEGIN

SET @word = LTRIM(RTRIM(LEFT(@Bedrijfsnaam, @position - 1)))

IF @word <> ''

BEGIN

SELECT @newWord = NULL

SELECT @newWord = synonym FROM Synonyms WHERE words = @word

IF @newWord IS NOT NULL

BEGIN

SET @ZoekCode = REPLACE(@ZoekCode, @word, @newWord)

END

SET @Bedrijfsnaam = RIGHT(@Bedrijfsnaam, LEN(@Bedrijfsnaam) - @position)

SET @position = CHARINDEX(' ', @Bedrijfsnaam, 1)

END

SET @word = @Bedrijfsnaam

SELECT @newWord = NULL

SELECT @newWord = synonym FROM Synonyms WHERE words = @word

IF @newWord IS NOT NULL

SET @ZoekCode = REPLACE(@ZoekCode, @Bedrijfsnaam, @newWord)

END

--IF @oldBedrijfsnaam <> @ZoekCode

BEGIN

SET @ZoekCode = REPLACE(@ZoekCode, ' ', '')

SET @ZoekCode = REPLACE(@ZoekCode, '.', '')

SET @ZoekCode = REPLACE(@ZoekCode, ',', '')

SET @ZoekCode = REPLACE(@ZoekCode, '-', '')

UPDATE MyTable set zoekcode=UPPER(@ZoekCode) where idcode=@RecordID

END

FETCH NEXT FROM load_cursor INTO @RecordID, @Bedrijfsnaam

END

CLOSE load_cursor

DEALLOCATE load_cursor

RETURN (SELECT * FROM MyTable)

END

Cross joining one table to find duplicates

Cookies on SQLServerCentral