October 2, 2012 at 1:19 am
I have been working on cleaning a database with addresses in an attempt to dedupe it. There's a lot of pollution and duplicates in the table. Due to efficiency we would like to trace those down.
To find the duplicates inside my table i use a function which uses Jaro-Winkler algorithm to find duplicates. The most efficient query i could compose for this solution is a cross join and cross apply the function to find duplicates. However, the downside is, that each record appears Left and Right in the resultset. The immediate consequence is, that this list cannot be cleanly used to dedupe.
I've tried running a T-SQL query with a cursor instead, to exclude those records which were used on the left side, but that unfortunately didn't work for me. I was changing data in the table while the cursor moved over it. I attempted the same with a temp table but the results and performance were so horrible, it wouldn't execute over the entire 200k records.
Normally one can place some sort of exclusion in the where. But since the where is entirely based on percentages from the outcome of a crossapply that doesn't work too well.
Is there a cleaner way to handle this query for deduping? Right now my 230.000 records turn into a 600.000 comparison result (given the state of the addresses, that is not unlikely, however the resultset isn't exactly 'clean').
SELECT alfa.RecordID as A_ID,beta.RecordID B_ID,
alfa.bedrijfsnaam as A_naam, beta.bedrijfsnaam as B_Naam,
alfa.zoekcode as A_zoekcode, beta.zoekcode as B_zoekcode, Naamscore.score,
alfa.bezoek_adres as A_Bezoek_Adres, beta.Bezoek_Adres as B_Bezoek_Adres, AdresScore.score,
alfa.Bezoek_Plaats as A_Bezoek_Plaats, beta.Bezoek_Plaats as B_Bezoek_Plaats, PlaatsScore.score,
alfa.Post_Adres as A_Post_Adres, beta.Post_Adres as B_Post_Adres, PostAdresScore.score,
alfa.Post_Plaats as A_Post_Plaats, beta.Post_Plaats as B_Post_Plaats,PostPlaatsScore.score
FROM zeno_bedrijven alfa
cross join zeno_bedrijven beta
cross apply (select score from CompareJaroWinkler(alfa.zoekcode,beta.zoekcode)) as NaamScore
cross apply (select score from comparejarowinkler(alfa.bezoek_adres,beta.bezoek_adres)) as AdresScore
cross apply (select score from comparejarowinkler(alfa.bezoek_plaats,beta.bezoek_plaats)) as PlaatsScore
cross apply (select score from comparejarowinkler(alfa.Post_Adres,beta.Post_adres)) as PostAdresScore
cross apply (select score from comparejarowinkler(alfa.Post_Plaats,beta.Post_Plaats)) as PostPlaatsScore
WHERE alfa.recordid <> beta.recordID
and alfa.zoekcode is not null
and beta.zoekcode is not null
and Naamscore.score > 0.8
and (
(
(AdresScore.score > 0.8 or AdresScore.score is null)
and (PlaatsScore.score > 0.8 or PlaatsScore.score is null)
)
or (
(PostAdresScore.score > 0.8 or PostAdresScore.score is null)
and (PostPlaatsScore.score > 0.8 or PostPlaatsScore.score is null)
)
)
Example of the results:
604497716142Bodinco BVBodinco BVBODINCOBODINCO
604497604498Bodinco BVBodinco BVBODINCOBODINCO
604498716142Bodinco BVBodinco BVBODINCOBODINCO
604498604497Bodinco BVBodinco BVBODINCOBODINCO
Obviously, i would prefer not having the 604498 and 716142 appearing on the left side in the query.
October 2, 2012 at 2:15 am
d.aarts (10/2/2012)
I have been working on cleaning a database with addresses in an attempt to dedupe it. There's a lot of pollution and duplicates in the table. Due to efficiency we would like to trace those down.To find the duplicates inside my table i use a function which uses Jaro-Winkler algorithm to find duplicates. The most efficient query i could compose for this solution is a cross join and cross apply the function to find duplicates. However, the downside is, that each record appears Left and Right in the resultset. The immediate consequence is, that this list cannot be cleanly used to dedupe.
<<snip>>
Exactly the same issue is addressed here.
Do you have the code for the JaroWinkler function? What sort of performance are you getting?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
October 2, 2012 at 2:57 am
Hello ChrisM,
Thanks for your reply. I'll see if i can adept the code for cleaning up my ID's.
As for the Jaro-Winkler. What i did, was adopt the SimMetrics DLL into SQL Server. It isn't quite as hard as I thought it would be. ( I attached the compiled project, it however uses .net 4.0 ). The SimMetrics project has several text-comparison functions. (for installation, just adept the install_sql script and run it).
The performance i get from the DLL calculation is about 0.1 second per record. On cross joining 230k records to itself, the output took approximately 18 hours. What I did as well, which drastically improved the resulting comparison, is creating a list of so called 'SEARCHWORDS'. I took the company names and cleaned them up drastically. This cleansing reduced comparison time from 3 days to just over 18 hours.
Some interesting tidbits:
This Procedure splits every company name into different words and then counts the occurence of those words. In the table there's space for synonyms, so you can replace words with some else, or add in an empty string. This list is used as the basis for cleaning up the company names for comparison. I went with everything with over 30 occurences, but it depends a little on how your data is structured.
procedure [dbo].[Synoniem_Lijst] (@MyQUERY varchar(255))
AS
BEGIN
CREATE TABLE #Tempdb (
Word VARCHAR(255)
)
CREATE TABLE Synonims
(
Words NVARCHAR(255),
Wcount INT,
Synonim NVARCHAR(255)
)
CREATE TABLE #Countdb (
Wcount INT,
Word NVARCHAR(255)
)
-- Do not return rowcount to improve performance with multiple queries
SET NOCOUNT ON
-- declare cursor
DECLARE @colvalue NVARCHAR(255)
insert into #Tempdb
EXEC (@MyQuery)
-- load cursor
DECLARE load_cursor CURSOR FOR
SELECT word
FROM #Tempdb
-- open cursor and load resultset
OPEN load_cursor
FETCH NEXT FROM load_cursor INTO @colvalue
WHILE @@FETCH_STATUS = 0
BEGIN
INSERT INTO #Countdb
select * from fn_split2(' ',@colvalue)
FETCH NEXT FROM load_cursor INTO @colvalue
END
BEGIN
INSERT INTO Synonims (Words, Wcount)
SELECT Word, Count(Word) FROM #Countdb GROUP BY Word
END
CLOSE load_cursor
DEALLOCATE load_cursor
DROP TABLE #Tempdb
DROP TABLE #Countdb
RETURN (select * from synonims)
END
And the Function which creates the searchcode (cleaned up company name). Bedrijfsnaam=Companyname, Zoek_code=Created SearchString used for comparison by the algorithm. This is an adeptation of the synonym replace code i found on the net.
ALTER PROCEDURE [dbo].[Zoek_Code]
AS
BEGIN
SET NOCOUNT ON
DECLARE @word VARCHAR(50),
@position INT,
@ZoekCode VARCHAR(500),
@oldBedrijfsnaam VARCHAR(500),
@newWord VARCHAR(50),
@Bedrijfsnaam VARCHAR(500),
@RecordID NVARCHAR(255)
DECLARE load_cursor CURSOR FOR
SELECT idcode, name FROM MyTable
OPEN load_cursor
FETCH NEXT FROM load_cursor INTO @RecordID, @Bedrijfsnaam
WHILE @@FETCH_STATUS = 0
BEGIN
SET @oldBedrijfsnaam = @Bedrijfsnaam
SET @Bedrijfsnaam = LTRIM(RTRIM(@Bedrijfsnaam))
SET @ZoekCode = @Bedrijfsnaam
SET @position = CHARINDEX(' ', @Bedrijfsnaam, 1)
BEGIN
WHILE @position > 0
BEGIN
SET @word = LTRIM(RTRIM(LEFT(@Bedrijfsnaam, @position - 1)))
IF @word <> ''
BEGIN
SELECT @newWord = NULL
SELECT @newWord = synonym FROM Synonyms WHERE words = @word
IF @newWord IS NOT NULL
BEGIN
SET @ZoekCode = REPLACE(@ZoekCode, @word, @newWord)
END
END
SET @Bedrijfsnaam = RIGHT(@Bedrijfsnaam, LEN(@Bedrijfsnaam) - @position)
SET @position = CHARINDEX(' ', @Bedrijfsnaam, 1)
END
SET @word = @Bedrijfsnaam
SELECT @newWord = NULL
SELECT @newWord = synonym FROM Synonyms WHERE words = @word
IF @newWord IS NOT NULL
SET @ZoekCode = REPLACE(@ZoekCode, @Bedrijfsnaam, @newWord)
END
--IF @oldBedrijfsnaam <> @ZoekCode
BEGIN
SET @ZoekCode = REPLACE(@ZoekCode, ' ', '')
SET @ZoekCode = REPLACE(@ZoekCode, '.', '')
SET @ZoekCode = REPLACE(@ZoekCode, ',', '')
SET @ZoekCode = REPLACE(@ZoekCode, '-', '')
UPDATE MyTable set zoekcode=UPPER(@ZoekCode) where idcode=@RecordID
END
FETCH NEXT FROM load_cursor INTO @RecordID, @Bedrijfsnaam
END
CLOSE load_cursor
DEALLOCATE load_cursor
RETURN (SELECT * FROM MyTable)
END
Viewing 3 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply
This website stores cookies on your computer.
These cookies are used to improve your website experience and provide more personalized services to you, both on this website and through other media.
To find out more about the cookies we use, see our Privacy Policy