RE: Record linkage advice required– Long read (YOU HAVE BEEN WARNED!)

SSC Guru

Points: 186127

July 31, 2012 at 5:28 am

#1519626

Abu Dina (7/31/2012)
Hi Chris, I have some questions below. Any advice would be much appreciated. In your original post you said:
Try it out. There's always an element of tuning with this type of dedupe. The key things here are:
1. Get some kind of solid matching working as efficiently as possible. In this case, it's postcode and master_id.
The master_id is the unique identifier for each record so I'm not sure what you mean to be honest.

You want to join the table to itself by the most efficient means possible, where the masterid in the two tables is different and the postcode is the same. Indexing on postcode and masterid does this for you. However, you're looking at quite a few columns in each row. A clustered index or an index covering all of the rows being examined will do the trick.

The merge script I'm trying to rebuild contains the following for Pass 1:
•left(a.Surname,5)=left(b.Surname,5)
•left(a.forename,1) = left(B.forename,1)
•a.postcode=b.postcode
•left(A.companynameny,8) = left(B.companyname,8)
•left(A.address1,8) = left(B.address1,8)
Your function and example works great. Maybe I'm just stupid but I just don't see how I can use this to replace the passes system used by the current merge script we have.

The passes system isn't very efficient - it's a pile of scripts which someone used to identify dupes run sequentially. Why not find out exactly how closely matched each column is (surname, forename, title etc) then set a cutoff point for each one?

I'm not looking for someone to do the dirty work for me so please don't take this the wrong way! I just need some guidance so I know I'm on the right track.
Should I create a new table with match ratios for each field then do the filtering on the combination of ratios?

No, it can all be done in a single step

I'm really confused!
2. You're left with serious numbercrunching over a scan of the reference table and seeks to the target. Do as much work as possible as efficiently as you can. The function is set up as an iTVF so it can parallelise, and you can get it to compare multiple values in the same query.
But from the example select I don't see how I can compare more than one field at a time or am I missing something?

Here's how to compare several columns at once:

SELECT r.*, '#' AS '#', t.master_id,

t.Surname, Fuzzysurname.Result, Fuzzysurname.MatchRatio,

t.Address1, FuzzyAddress1.Result, FuzzyAddress1.MatchRatio,

t.Title, FuzzyTitle.Result, FuzzyTitle.MatchRatio,

t.CompanyName, FuzzyCompanyName.Result, FuzzyCompanyName.MatchRatio

FROM dbo.PreMerge r

INNER JOIN dbo.PreMerge t

ON t.postcode = r.postcode

AND t.master_id > r.master_id

CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Surname, t.Surname) Fuzzysurname

CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Address1, t.Address1) FuzzyAddress1

CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Title, t.Title) FuzzyTitle

CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.CompanyName, t.CompanyName) FuzzyCompanyName

WHERE Fuzzysurname.MatchRatio > 55

AND FuzzyAddress1.MatchRatio > 50

ORDER BY r.master_id

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden