Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Cross joining one table to find duplicates Expand / Collapse
Author
Message
Posted Tuesday, October 02, 2012 1:19 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, April 17, 2013 7:41 AM
Points: 2, Visits: 28
I have been working on cleaning a database with addresses in an attempt to dedupe it. There's a lot of pollution and duplicates in the table. Due to efficiency we would like to trace those down.

To find the duplicates inside my table i use a function which uses Jaro-Winkler algorithm to find duplicates. The most efficient query i could compose for this solution is a cross join and cross apply the function to find duplicates. However, the downside is, that each record appears Left and Right in the resultset. The immediate consequence is, that this list cannot be cleanly used to dedupe.

I've tried running a T-SQL query with a cursor instead, to exclude those records which were used on the left side, but that unfortunately didn't work for me. I was changing data in the table while the cursor moved over it. I attempted the same with a temp table but the results and performance were so horrible, it wouldn't execute over the entire 200k records.
Normally one can place some sort of exclusion in the where. But since the where is entirely based on percentages from the outcome of a crossapply that doesn't work too well.

Is there a cleaner way to handle this query for deduping? Right now my 230.000 records turn into a 600.000 comparison result (given the state of the addresses, that is not unlikely, however the resultset isn't exactly 'clean').

SELECT alfa.RecordID as A_ID,beta.RecordID B_ID,
alfa.bedrijfsnaam as A_naam, beta.bedrijfsnaam as B_Naam,
alfa.zoekcode as A_zoekcode, beta.zoekcode as B_zoekcode, Naamscore.score,
alfa.bezoek_adres as A_Bezoek_Adres, beta.Bezoek_Adres as B_Bezoek_Adres, AdresScore.score,
alfa.Bezoek_Plaats as A_Bezoek_Plaats, beta.Bezoek_Plaats as B_Bezoek_Plaats, PlaatsScore.score,
alfa.Post_Adres as A_Post_Adres, beta.Post_Adres as B_Post_Adres, PostAdresScore.score,
alfa.Post_Plaats as A_Post_Plaats, beta.Post_Plaats as B_Post_Plaats,PostPlaatsScore.score

FROM zeno_bedrijven alfa
cross join zeno_bedrijven beta
cross apply (select score from CompareJaroWinkler(alfa.zoekcode,beta.zoekcode)) as NaamScore
cross apply (select score from comparejarowinkler(alfa.bezoek_adres,beta.bezoek_adres)) as AdresScore
cross apply (select score from comparejarowinkler(alfa.bezoek_plaats,beta.bezoek_plaats)) as PlaatsScore
cross apply (select score from comparejarowinkler(alfa.Post_Adres,beta.Post_adres)) as PostAdresScore
cross apply (select score from comparejarowinkler(alfa.Post_Plaats,beta.Post_Plaats)) as PostPlaatsScore

WHERE alfa.recordid <> beta.recordID
and alfa.zoekcode is not null
and beta.zoekcode is not null
and Naamscore.score > 0.8
and (
(
(AdresScore.score > 0.8 or AdresScore.score is null)
and (PlaatsScore.score > 0.8 or PlaatsScore.score is null)
)
or (
(PostAdresScore.score > 0.8 or PostAdresScore.score is null)
and (PostPlaatsScore.score > 0.8 or PostPlaatsScore.score is null)
)
)

Example of the results:
604497 716142 Bodinco BV Bodinco BV BODINCO BODINCO
604497 604498 Bodinco BV Bodinco BV BODINCO BODINCO
604498 716142 Bodinco BV Bodinco BV BODINCO BODINCO
604498 604497 Bodinco BV Bodinco BV BODINCO BODINCO

Obviously, i would prefer not having the 604498 and 716142 appearing on the left side in the query.
Post #1366876
Posted Tuesday, October 02, 2012 2:15 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 9:27 AM
Points: 5,618, Visits: 10,990
d.aarts (10/2/2012)
I have been working on cleaning a database with addresses in an attempt to dedupe it. There's a lot of pollution and duplicates in the table. Due to efficiency we would like to trace those down.

To find the duplicates inside my table i use a function which uses Jaro-Winkler algorithm to find duplicates. The most efficient query i could compose for this solution is a cross join and cross apply the function to find duplicates. However, the downside is, that each record appears Left and Right in the resultset. The immediate consequence is, that this list cannot be cleanly used to dedupe.

<<snip>>


Exactly the same issue is addressed here.

Do you have the code for the JaroWinkler function? What sort of performance are you getting?


“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #1366894
Posted Tuesday, October 02, 2012 2:57 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, April 17, 2013 7:41 AM
Points: 2, Visits: 28
Hello ChrisM,

Thanks for your reply. I'll see if i can adept the code for cleaning up my ID's.

As for the Jaro-Winkler. What i did, was adopt the SimMetrics DLL into SQL Server. It isn't quite as hard as I thought it would be. ( I attached the compiled project, it however uses .net 4.0 ). The SimMetrics project has several text-comparison functions. (for installation, just adept the install_sql script and run it).

The performance i get from the DLL calculation is about 0.1 second per record. On cross joining 230k records to itself, the output took approximately 18 hours. What I did as well, which drastically improved the resulting comparison, is creating a list of so called 'SEARCHWORDS'. I took the company names and cleaned them up drastically. This cleansing reduced comparison time from 3 days to just over 18 hours.

Some interesting tidbits:
This Procedure splits every company name into different words and then counts the occurence of those words. In the table there's space for synonyms, so you can replace words with some else, or add in an empty string. This list is used as the basis for cleaning up the company names for comparison. I went with everything with over 30 occurences, but it depends a little on how your data is structured.

procedure [dbo].[Synoniem_Lijst] (@MyQUERY varchar(255))

AS
BEGIN

CREATE TABLE #Tempdb (
Word VARCHAR(255)
)

CREATE TABLE Synonims
(
Words NVARCHAR(255),
Wcount INT,
Synonim NVARCHAR(255)
)

CREATE TABLE #Countdb (
Wcount INT,
Word NVARCHAR(255)
)

-- Do not return rowcount to improve performance with multiple queries
SET NOCOUNT ON

-- declare cursor
DECLARE @colvalue NVARCHAR(255)

insert into #Tempdb
EXEC (@MyQuery)

-- load cursor
DECLARE load_cursor CURSOR FOR
SELECT word
FROM #Tempdb

-- open cursor and load resultset
OPEN load_cursor
FETCH NEXT FROM load_cursor INTO @colvalue

WHILE @@FETCH_STATUS = 0
BEGIN

INSERT INTO #Countdb
select * from fn_split2(' ',@colvalue)

FETCH NEXT FROM load_cursor INTO @colvalue

END

BEGIN
INSERT INTO Synonims (Words, Wcount)
SELECT Word, Count(Word) FROM #Countdb GROUP BY Word
END

CLOSE load_cursor
DEALLOCATE load_cursor

DROP TABLE #Tempdb
DROP TABLE #Countdb

RETURN (select * from synonims)

END


And the Function which creates the searchcode (cleaned up company name). Bedrijfsnaam=Companyname, Zoek_code=Created SearchString used for comparison by the algorithm. This is an adeptation of the synonym replace code i found on the net.

ALTER PROCEDURE [dbo].[Zoek_Code]

AS
BEGIN

SET NOCOUNT ON

DECLARE @word VARCHAR(50),
@position INT,
@ZoekCode VARCHAR(500),
@oldBedrijfsnaam VARCHAR(500),
@newWord VARCHAR(50),
@Bedrijfsnaam VARCHAR(500),
@RecordID NVARCHAR(255)


DECLARE load_cursor CURSOR FOR
SELECT idcode, name FROM MyTable

OPEN load_cursor
FETCH NEXT FROM load_cursor INTO @RecordID, @Bedrijfsnaam

WHILE @@FETCH_STATUS = 0
BEGIN
SET @oldBedrijfsnaam = @Bedrijfsnaam
SET @Bedrijfsnaam = LTRIM(RTRIM(@Bedrijfsnaam))

SET @ZoekCode = @Bedrijfsnaam

SET @position = CHARINDEX(' ', @Bedrijfsnaam, 1)

BEGIN
WHILE @position > 0
BEGIN
SET @word = LTRIM(RTRIM(LEFT(@Bedrijfsnaam, @position - 1)))
IF @word <> ''
BEGIN

SELECT @newWord = NULL
SELECT @newWord = synonym FROM Synonyms WHERE words = @word
IF @newWord IS NOT NULL
BEGIN
SET @ZoekCode = REPLACE(@ZoekCode, @word, @newWord)
END
END
SET @Bedrijfsnaam = RIGHT(@Bedrijfsnaam, LEN(@Bedrijfsnaam) - @position)
SET @position = CHARINDEX(' ', @Bedrijfsnaam, 1)
END

SET @word = @Bedrijfsnaam
SELECT @newWord = NULL
SELECT @newWord = synonym FROM Synonyms WHERE words = @word
IF @newWord IS NOT NULL
SET @ZoekCode = REPLACE(@ZoekCode, @Bedrijfsnaam, @newWord)

END

--IF @oldBedrijfsnaam <> @ZoekCode
BEGIN
SET @ZoekCode = REPLACE(@ZoekCode, ' ', '')
SET @ZoekCode = REPLACE(@ZoekCode, '.', '')
SET @ZoekCode = REPLACE(@ZoekCode, ',', '')
SET @ZoekCode = REPLACE(@ZoekCode, '-', '')

UPDATE MyTable set zoekcode=UPPER(@ZoekCode) where idcode=@RecordID
END

FETCH NEXT FROM load_cursor INTO @RecordID, @Bedrijfsnaam

END

CLOSE load_cursor
DEALLOCATE load_cursor

RETURN (SELECT * FROM MyTable)

END



  Post Attachments 
SimMetrics.zip (1 view, 37.02 KB)
Post #1366909
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse