|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Wednesday, April 17, 2013 7:41 AM
Points: 2,
Visits: 28
|
|
I have been working on cleaning a database with addresses in an attempt to dedupe it. There's a lot of pollution and duplicates in the table. Due to efficiency we would like to trace those down.
To find the duplicates inside my table i use a function which uses Jaro-Winkler algorithm to find duplicates. The most efficient query i could compose for this solution is a cross join and cross apply the function to find duplicates. However, the downside is, that each record appears Left and Right in the resultset. The immediate consequence is, that this list cannot be cleanly used to dedupe.
I've tried running a T-SQL query with a cursor instead, to exclude those records which were used on the left side, but that unfortunately didn't work for me. I was changing data in the table while the cursor moved over it. I attempted the same with a temp table but the results and performance were so horrible, it wouldn't execute over the entire 200k records. Normally one can place some sort of exclusion in the where. But since the where is entirely based on percentages from the outcome of a crossapply that doesn't work too well.
Is there a cleaner way to handle this query for deduping? Right now my 230.000 records turn into a 600.000 comparison result (given the state of the addresses, that is not unlikely, however the resultset isn't exactly 'clean').
SELECT alfa.RecordID as A_ID,beta.RecordID B_ID, alfa.bedrijfsnaam as A_naam, beta.bedrijfsnaam as B_Naam, alfa.zoekcode as A_zoekcode, beta.zoekcode as B_zoekcode, Naamscore.score, alfa.bezoek_adres as A_Bezoek_Adres, beta.Bezoek_Adres as B_Bezoek_Adres, AdresScore.score, alfa.Bezoek_Plaats as A_Bezoek_Plaats, beta.Bezoek_Plaats as B_Bezoek_Plaats, PlaatsScore.score, alfa.Post_Adres as A_Post_Adres, beta.Post_Adres as B_Post_Adres, PostAdresScore.score, alfa.Post_Plaats as A_Post_Plaats, beta.Post_Plaats as B_Post_Plaats,PostPlaatsScore.score FROM zeno_bedrijven alfa cross join zeno_bedrijven beta cross apply (select score from CompareJaroWinkler(alfa.zoekcode,beta.zoekcode)) as NaamScore cross apply (select score from comparejarowinkler(alfa.bezoek_adres,beta.bezoek_adres)) as AdresScore cross apply (select score from comparejarowinkler(alfa.bezoek_plaats,beta.bezoek_plaats)) as PlaatsScore cross apply (select score from comparejarowinkler(alfa.Post_Adres,beta.Post_adres)) as PostAdresScore cross apply (select score from comparejarowinkler(alfa.Post_Plaats,beta.Post_Plaats)) as PostPlaatsScore
WHERE alfa.recordid <> beta.recordID and alfa.zoekcode is not null and beta.zoekcode is not null and Naamscore.score > 0.8 and ( ( (AdresScore.score > 0.8 or AdresScore.score is null) and (PlaatsScore.score > 0.8 or PlaatsScore.score is null) ) or ( (PostAdresScore.score > 0.8 or PostAdresScore.score is null) and (PostPlaatsScore.score > 0.8 or PostPlaatsScore.score is null) ) )
Example of the results: 604497 716142 Bodinco BV Bodinco BV BODINCO BODINCO 604497 604498 Bodinco BV Bodinco BV BODINCO BODINCO 604498 716142 Bodinco BV Bodinco BV BODINCO BODINCO 604498 604497 Bodinco BV Bodinco BV BODINCO BODINCO
Obviously, i would prefer not having the 604498 and 716142 appearing on the left side in the query.
|
|
|
|
|
SSCertifiable
       
Group: General Forum Members
Last Login: Today @ 9:27 AM
Points: 5,618,
Visits: 10,990
|
|
d.aarts (10/2/2012) I have been working on cleaning a database with addresses in an attempt to dedupe it. There's a lot of pollution and duplicates in the table. Due to efficiency we would like to trace those down.
To find the duplicates inside my table i use a function which uses Jaro-Winkler algorithm to find duplicates. The most efficient query i could compose for this solution is a cross join and cross apply the function to find duplicates. However, the downside is, that each record appears Left and Right in the resultset. The immediate consequence is, that this list cannot be cleanly used to dedupe.
<<snip>>
Exactly the same issue is addressed here.
Do you have the code for the JaroWinkler function? What sort of performance are you getting?
“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw
For fast, accurate and documented assistance in answering your questions, please read this article. Understanding and using APPLY, (I) and (II) Paul White Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden Exploring Recursive CTEs by Example Dwain Camps
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Wednesday, April 17, 2013 7:41 AM
Points: 2,
Visits: 28
|
|
Hello ChrisM,
Thanks for your reply. I'll see if i can adept the code for cleaning up my ID's.
As for the Jaro-Winkler. What i did, was adopt the SimMetrics DLL into SQL Server. It isn't quite as hard as I thought it would be. ( I attached the compiled project, it however uses .net 4.0 ). The SimMetrics project has several text-comparison functions. (for installation, just adept the install_sql script and run it).
The performance i get from the DLL calculation is about 0.1 second per record. On cross joining 230k records to itself, the output took approximately 18 hours. What I did as well, which drastically improved the resulting comparison, is creating a list of so called 'SEARCHWORDS'. I took the company names and cleaned them up drastically. This cleansing reduced comparison time from 3 days to just over 18 hours.
Some interesting tidbits: This Procedure splits every company name into different words and then counts the occurence of those words. In the table there's space for synonyms, so you can replace words with some else, or add in an empty string. This list is used as the basis for cleaning up the company names for comparison. I went with everything with over 30 occurences, but it depends a little on how your data is structured.
procedure [dbo].[Synoniem_Lijst] (@MyQUERY varchar(255))
AS BEGIN
CREATE TABLE #Tempdb ( Word VARCHAR(255) )
CREATE TABLE Synonims ( Words NVARCHAR(255), Wcount INT, Synonim NVARCHAR(255) )
CREATE TABLE #Countdb ( Wcount INT, Word NVARCHAR(255) )
-- Do not return rowcount to improve performance with multiple queries SET NOCOUNT ON
-- declare cursor DECLARE @colvalue NVARCHAR(255)
insert into #Tempdb EXEC (@MyQuery)
-- load cursor DECLARE load_cursor CURSOR FOR SELECT word FROM #Tempdb
-- open cursor and load resultset OPEN load_cursor FETCH NEXT FROM load_cursor INTO @colvalue
WHILE @@FETCH_STATUS = 0 BEGIN
INSERT INTO #Countdb select * from fn_split2(' ',@colvalue)
FETCH NEXT FROM load_cursor INTO @colvalue
END
BEGIN INSERT INTO Synonims (Words, Wcount) SELECT Word, Count(Word) FROM #Countdb GROUP BY Word END
CLOSE load_cursor DEALLOCATE load_cursor
DROP TABLE #Tempdb DROP TABLE #Countdb
RETURN (select * from synonims)
END
And the Function which creates the searchcode (cleaned up company name). Bedrijfsnaam=Companyname, Zoek_code=Created SearchString used for comparison by the algorithm. This is an adeptation of the synonym replace code i found on the net.
ALTER PROCEDURE [dbo].[Zoek_Code]
AS BEGIN
SET NOCOUNT ON
DECLARE @word VARCHAR(50), @position INT, @ZoekCode VARCHAR(500), @oldBedrijfsnaam VARCHAR(500), @newWord VARCHAR(50), @Bedrijfsnaam VARCHAR(500), @RecordID NVARCHAR(255)
DECLARE load_cursor CURSOR FOR SELECT idcode, name FROM MyTable
OPEN load_cursor FETCH NEXT FROM load_cursor INTO @RecordID, @Bedrijfsnaam
WHILE @@FETCH_STATUS = 0 BEGIN SET @oldBedrijfsnaam = @Bedrijfsnaam SET @Bedrijfsnaam = LTRIM(RTRIM(@Bedrijfsnaam)) SET @ZoekCode = @Bedrijfsnaam
SET @position = CHARINDEX(' ', @Bedrijfsnaam, 1)
BEGIN WHILE @position > 0 BEGIN SET @word = LTRIM(RTRIM(LEFT(@Bedrijfsnaam, @position - 1))) IF @word <> '' BEGIN SELECT @newWord = NULL SELECT @newWord = synonym FROM Synonyms WHERE words = @word IF @newWord IS NOT NULL BEGIN SET @ZoekCode = REPLACE(@ZoekCode, @word, @newWord) END END SET @Bedrijfsnaam = RIGHT(@Bedrijfsnaam, LEN(@Bedrijfsnaam) - @position) SET @position = CHARINDEX(' ', @Bedrijfsnaam, 1) END SET @word = @Bedrijfsnaam SELECT @newWord = NULL SELECT @newWord = synonym FROM Synonyms WHERE words = @word IF @newWord IS NOT NULL SET @ZoekCode = REPLACE(@ZoekCode, @Bedrijfsnaam, @newWord) END --IF @oldBedrijfsnaam <> @ZoekCode BEGIN SET @ZoekCode = REPLACE(@ZoekCode, ' ', '') SET @ZoekCode = REPLACE(@ZoekCode, '.', '') SET @ZoekCode = REPLACE(@ZoekCode, ',', '') SET @ZoekCode = REPLACE(@ZoekCode, '-', '')
UPDATE MyTable set zoekcode=UPPER(@ZoekCode) where idcode=@RecordID END
FETCH NEXT FROM load_cursor INTO @RecordID, @Bedrijfsnaam
END
CLOSE load_cursor DEALLOCATE load_cursor
RETURN (SELECT * FROM MyTable)
END
|
|
|
|