Cross joining one table to find duplicates

  • I have been working on cleaning a database with addresses in an attempt to dedupe it. There's a lot of pollution and duplicates in the table. Due to efficiency we would like to trace those down.

    To find the duplicates inside my table i use a function which uses Jaro-Winkler algorithm to find duplicates. The most efficient query i could compose for this solution is a cross join and cross apply the function to find duplicates. However, the downside is, that each record appears Left and Right in the resultset. The immediate consequence is, that this list cannot be cleanly used to dedupe.

    I've tried running a T-SQL query with a cursor instead, to exclude those records which were used on the left side, but that unfortunately didn't work for me. I was changing data in the table while the cursor moved over it. I attempted the same with a temp table but the results and performance were so horrible, it wouldn't execute over the entire 200k records.

    Normally one can place some sort of exclusion in the where. But since the where is entirely based on percentages from the outcome of a crossapply that doesn't work too well.

    Is there a cleaner way to handle this query for deduping? Right now my 230.000 records turn into a 600.000 comparison result (given the state of the addresses, that is not unlikely, however the resultset isn't exactly 'clean').

    SELECT alfa.RecordID as A_ID,beta.RecordID B_ID,

    alfa.bedrijfsnaam as A_naam, beta.bedrijfsnaam as B_Naam,

    alfa.zoekcode as A_zoekcode, beta.zoekcode as B_zoekcode, Naamscore.score,

    alfa.bezoek_adres as A_Bezoek_Adres, beta.Bezoek_Adres as B_Bezoek_Adres, AdresScore.score,

    alfa.Bezoek_Plaats as A_Bezoek_Plaats, beta.Bezoek_Plaats as B_Bezoek_Plaats, PlaatsScore.score,

    alfa.Post_Adres as A_Post_Adres, beta.Post_Adres as B_Post_Adres, PostAdresScore.score,

    alfa.Post_Plaats as A_Post_Plaats, beta.Post_Plaats as B_Post_Plaats,PostPlaatsScore.score

    FROM zeno_bedrijven alfa

    cross join zeno_bedrijven beta

    cross apply (select score from CompareJaroWinkler(alfa.zoekcode,beta.zoekcode)) as NaamScore

    cross apply (select score from comparejarowinkler(alfa.bezoek_adres,beta.bezoek_adres)) as AdresScore

    cross apply (select score from comparejarowinkler(alfa.bezoek_plaats,beta.bezoek_plaats)) as PlaatsScore

    cross apply (select score from comparejarowinkler(alfa.Post_Adres,beta.Post_adres)) as PostAdresScore

    cross apply (select score from comparejarowinkler(alfa.Post_Plaats,beta.Post_Plaats)) as PostPlaatsScore

    WHERE alfa.recordid <> beta.recordID

    and alfa.zoekcode is not null

    and beta.zoekcode is not null

    and Naamscore.score > 0.8

    and (

    (

    (AdresScore.score > 0.8 or AdresScore.score is null)

    and (PlaatsScore.score > 0.8 or PlaatsScore.score is null)

    )

    or (

    (PostAdresScore.score > 0.8 or PostAdresScore.score is null)

    and (PostPlaatsScore.score > 0.8 or PostPlaatsScore.score is null)

    )

    )

    Example of the results:

    604497716142Bodinco BVBodinco BVBODINCOBODINCO

    604497604498Bodinco BVBodinco BVBODINCOBODINCO

    604498716142Bodinco BVBodinco BVBODINCOBODINCO

    604498604497Bodinco BVBodinco BVBODINCOBODINCO

    Obviously, i would prefer not having the 604498 and 716142 appearing on the left side in the query.

  • d.aarts (10/2/2012)


    I have been working on cleaning a database with addresses in an attempt to dedupe it. There's a lot of pollution and duplicates in the table. Due to efficiency we would like to trace those down.

    To find the duplicates inside my table i use a function which uses Jaro-Winkler algorithm to find duplicates. The most efficient query i could compose for this solution is a cross join and cross apply the function to find duplicates. However, the downside is, that each record appears Left and Right in the resultset. The immediate consequence is, that this list cannot be cleanly used to dedupe.

    <<snip>>

    Exactly the same issue is addressed here.

    Do you have the code for the JaroWinkler function? What sort of performance are you getting?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hello ChrisM,

    Thanks for your reply. I'll see if i can adept the code for cleaning up my ID's.

    As for the Jaro-Winkler. What i did, was adopt the SimMetrics DLL into SQL Server. It isn't quite as hard as I thought it would be. ( I attached the compiled project, it however uses .net 4.0 ). The SimMetrics project has several text-comparison functions. (for installation, just adept the install_sql script and run it).

    The performance i get from the DLL calculation is about 0.1 second per record. On cross joining 230k records to itself, the output took approximately 18 hours. What I did as well, which drastically improved the resulting comparison, is creating a list of so called 'SEARCHWORDS'. I took the company names and cleaned them up drastically. This cleansing reduced comparison time from 3 days to just over 18 hours.

    Some interesting tidbits:

    This Procedure splits every company name into different words and then counts the occurence of those words. In the table there's space for synonyms, so you can replace words with some else, or add in an empty string. This list is used as the basis for cleaning up the company names for comparison. I went with everything with over 30 occurences, but it depends a little on how your data is structured.

    procedure [dbo].[Synoniem_Lijst] (@MyQUERY varchar(255))

    AS

    BEGIN

    CREATE TABLE #Tempdb (

    Word VARCHAR(255)

    )

    CREATE TABLE Synonims

    (

    Words NVARCHAR(255),

    Wcount INT,

    Synonim NVARCHAR(255)

    )

    CREATE TABLE #Countdb (

    Wcount INT,

    Word NVARCHAR(255)

    )

    -- Do not return rowcount to improve performance with multiple queries

    SET NOCOUNT ON

    -- declare cursor

    DECLARE @colvalue NVARCHAR(255)

    insert into #Tempdb

    EXEC (@MyQuery)

    -- load cursor

    DECLARE load_cursor CURSOR FOR

    SELECT word

    FROM #Tempdb

    -- open cursor and load resultset

    OPEN load_cursor

    FETCH NEXT FROM load_cursor INTO @colvalue

    WHILE @@FETCH_STATUS = 0

    BEGIN

    INSERT INTO #Countdb

    select * from fn_split2(' ',@colvalue)

    FETCH NEXT FROM load_cursor INTO @colvalue

    END

    BEGIN

    INSERT INTO Synonims (Words, Wcount)

    SELECT Word, Count(Word) FROM #Countdb GROUP BY Word

    END

    CLOSE load_cursor

    DEALLOCATE load_cursor

    DROP TABLE #Tempdb

    DROP TABLE #Countdb

    RETURN (select * from synonims)

    END

    And the Function which creates the searchcode (cleaned up company name). Bedrijfsnaam=Companyname, Zoek_code=Created SearchString used for comparison by the algorithm. This is an adeptation of the synonym replace code i found on the net.

    ALTER PROCEDURE [dbo].[Zoek_Code]

    AS

    BEGIN

    SET NOCOUNT ON

    DECLARE @word VARCHAR(50),

    @position INT,

    @ZoekCode VARCHAR(500),

    @oldBedrijfsnaam VARCHAR(500),

    @newWord VARCHAR(50),

    @Bedrijfsnaam VARCHAR(500),

    @RecordID NVARCHAR(255)

    DECLARE load_cursor CURSOR FOR

    SELECT idcode, name FROM MyTable

    OPEN load_cursor

    FETCH NEXT FROM load_cursor INTO @RecordID, @Bedrijfsnaam

    WHILE @@FETCH_STATUS = 0

    BEGIN

    SET @oldBedrijfsnaam = @Bedrijfsnaam

    SET @Bedrijfsnaam = LTRIM(RTRIM(@Bedrijfsnaam))

    SET @ZoekCode = @Bedrijfsnaam

    SET @position = CHARINDEX(' ', @Bedrijfsnaam, 1)

    BEGIN

    WHILE @position > 0

    BEGIN

    SET @word = LTRIM(RTRIM(LEFT(@Bedrijfsnaam, @position - 1)))

    IF @word <> ''

    BEGIN

    SELECT @newWord = NULL

    SELECT @newWord = synonym FROM Synonyms WHERE words = @word

    IF @newWord IS NOT NULL

    BEGIN

    SET @ZoekCode = REPLACE(@ZoekCode, @word, @newWord)

    END

    END

    SET @Bedrijfsnaam = RIGHT(@Bedrijfsnaam, LEN(@Bedrijfsnaam) - @position)

    SET @position = CHARINDEX(' ', @Bedrijfsnaam, 1)

    END

    SET @word = @Bedrijfsnaam

    SELECT @newWord = NULL

    SELECT @newWord = synonym FROM Synonyms WHERE words = @word

    IF @newWord IS NOT NULL

    SET @ZoekCode = REPLACE(@ZoekCode, @Bedrijfsnaam, @newWord)

    END

    --IF @oldBedrijfsnaam <> @ZoekCode

    BEGIN

    SET @ZoekCode = REPLACE(@ZoekCode, ' ', '')

    SET @ZoekCode = REPLACE(@ZoekCode, '.', '')

    SET @ZoekCode = REPLACE(@ZoekCode, ',', '')

    SET @ZoekCode = REPLACE(@ZoekCode, '-', '')

    UPDATE MyTable set zoekcode=UPPER(@ZoekCode) where idcode=@RecordID

    END

    FETCH NEXT FROM load_cursor INTO @RecordID, @Bedrijfsnaam

    END

    CLOSE load_cursor

    DEALLOCATE load_cursor

    RETURN (SELECT * FROM MyTable)

    END

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply