Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««123»»

Record linkage advice required– Long read (YOU HAVE BEEN WARNED!) Expand / Collapse
Author
Message
Posted Tuesday, July 31, 2012 3:35 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Tuesday, August 26, 2014 3:26 AM
Points: 708, Visits: 3,287
Thanks Chris. You're my new best friend :)

I'm going to spend the rest of today on this and will report back in the afternoon.

Once again, I really appreciate your efforts!


---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1337753
Posted Tuesday, July 31, 2012 4:57 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Tuesday, August 26, 2014 3:26 AM
Points: 708, Visits: 3,287
Hi Chris, I have some questions below. Any advice would be much appreciated. In your original post you said:


Try it out. There's always an element of tuning with this type of dedupe. The key things here are:

1. Get some kind of solid matching working as efficiently as possible. In this case, it's postcode and master_id.


The master_id is the unique identifier for each record so I'm not sure what you mean to be honest.

The merge script I'm trying to rebuild contains the following for Pass 1:

• left(a.Surname,5)=left(b.Surname,5)
• left(a.forename,1) = left(B.forename,1)
• a.postcode=b.postcode
• left(A.companynameny,8) = left(B.companyname,8)
• left(A.address1,8) = left(B.address1,8)

Your function and example works great. Maybe I'm just stupid but I just don't see how I can use this to replace the passes system used by the current merge script we have.

I'm not looking for someone to do the dirty work for me so please don't take this the wrong way! I just need some guidance so I know I'm on the right track.

Should I create a new table with match ratios for each field then do the filtering on the combination of ratios?

I'm really confused!

2. You're left with serious numbercrunching over a scan of the reference table and seeks to the target. Do as much work as possible as efficiently as you can. The function is set up as an iTVF so it can parallelise, and you can get it to compare multiple values in the same query.


But from the example select I don't see how I can compare more than one field at a time or am I missing something?


---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1337783
Posted Tuesday, July 31, 2012 5:28 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 9:46 AM
Points: 7,224, Visits: 13,696
Abu Dina (7/31/2012)
Hi Chris, I have some questions below. Any advice would be much appreciated. In your original post you said:


Try it out. There's always an element of tuning with this type of dedupe. The key things here are:

1. Get some kind of solid matching working as efficiently as possible. In this case, it's postcode and master_id.


The master_id is the unique identifier for each record so I'm not sure what you mean to be honest.

You want to join the table to itself by the most efficient means possible, where the masterid in the two tables is different and the postcode is the same. Indexing on postcode and masterid does this for you. However, you're looking at quite a few columns in each row. A clustered index or an index covering all of the rows being examined will do the trick.



The merge script I'm trying to rebuild contains the following for Pass 1:

• left(a.Surname,5)=left(b.Surname,5)
• left(a.forename,1) = left(B.forename,1)
• a.postcode=b.postcode
• left(A.companynameny,8) = left(B.companyname,8)
• left(A.address1,8) = left(B.address1,8)

Your function and example works great. Maybe I'm just stupid but I just don't see how I can use this to replace the passes system used by the current merge script we have.


The passes system isn't very efficient - it's a pile of scripts which someone used to identify dupes run sequentially. Why not find out exactly how closely matched each column is (surname, forename, title etc) then set a cutoff point for each one?


I'm not looking for someone to do the dirty work for me so please don't take this the wrong way! I just need some guidance so I know I'm on the right track.

Should I create a new table with match ratios for each field then do the filtering on the combination of ratios?
No, it can all be done in a single step


I'm really confused!

2. You're left with serious numbercrunching over a scan of the reference table and seeks to the target. Do as much work as possible as efficiently as you can. The function is set up as an iTVF so it can parallelise, and you can get it to compare multiple values in the same query.


But from the example select I don't see how I can compare more than one field at a time or am I missing something?


Here's how to compare several columns at once:
SELECT r.*, '#' AS '#', t.master_id, 
t.Surname, Fuzzysurname.Result, Fuzzysurname.MatchRatio,
t.Address1, FuzzyAddress1.Result, FuzzyAddress1.MatchRatio,
t.Title, FuzzyTitle.Result, FuzzyTitle.MatchRatio,
t.CompanyName, FuzzyCompanyName.Result, FuzzyCompanyName.MatchRatio
FROM dbo.PreMerge r
INNER JOIN dbo.PreMerge t
ON t.postcode = r.postcode
AND t.master_id > r.master_id
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Surname, t.Surname) Fuzzysurname
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Address1, t.Address1) FuzzyAddress1
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Title, t.Title) FuzzyTitle
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.CompanyName, t.CompanyName) FuzzyCompanyName
WHERE Fuzzysurname.MatchRatio > 55
AND FuzzyAddress1.MatchRatio > 50
ORDER BY r.master_id



“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #1337800
Posted Tuesday, July 31, 2012 8:20 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Tuesday, August 26, 2014 3:26 AM
Points: 708, Visits: 3,287
Thanks Chris.

Just to confirm. Is the FuzzyMatch_iTVF2k5 is a full implementation of http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf ?




---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1337925
Posted Tuesday, July 31, 2012 8:25 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 9:46 AM
Points: 7,224, Visits: 13,696
Abu Dina (7/31/2012)
Thanks Chris.

Just to confirm. Is the FuzzyMatch_iTVF2k5 is a full implementation of http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf ?



No - it's the closest online reference I could find at the time. The iTVF uses the token-matching method which is fairly well understood and quite widely used but folks appear reluctant to publish anything about it - or at least they did around 7 years ago when I first wrote this.


“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #1337938
Posted Tuesday, July 31, 2012 9:50 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Tuesday, August 26, 2014 3:26 AM
Points: 708, Visits: 3,287
I see.

Another question if you don't mind (you're going to hate me once this little project of mine is done! but if you're anywhere near the northwest then I'd be happy to repay your kindness with drinks of your choice lol).

How can Willis & Lilley have a match ratio = 60 ?! I've been tweaking the surname setting and I've settled on 54 for now but this is a strange result or is this one of many inevitable false positives I will always encounter with any record set?

Thanks.


---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1338007
Posted Tuesday, July 31, 2012 10:05 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 9:46 AM
Points: 7,224, Visits: 13,696
Abu Dina (7/31/2012)
I see.

Another question if you don't mind (you're going to hate me once this little project of mine is done! but if you're anywhere near the northwest then I'd be happy to repay your kindness with drinks of your choice lol).
Northwest UK? Sure!
Keep coming with the questions, there are plenty of good folks here to lend a hand.


How can Willis & Lilley have a match ratio = 60 ?! I've been tweaking the surname setting and I've settled on 54 for now but this is a strange result or is this one of many inevitable false positives I will always encounter with any record set?

Thanks.

That's because I made a mistake playing with the function yesterday
Here are the 2-char overlapping tokens from 'Willis' which are matched in 'Lilley':
il, ll, li

Here's the fixed function:
-- =============================================
-- Author: Chris Morris
-- Create date: 30/07/2012
-- =============================================
ALTER FUNCTION [dbo].[FuzzyMatch_iTVF2k5]
(@Reference VARCHAR(100) = NULL,
@Target VARCHAR(100) = NULL)
RETURNS table WITH SCHEMABINDING
AS

-- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

RETURN
SELECT d.Result, MatchRatio = CAST(CASE
WHEN d.Result = 1 THEN 100
WHEN d.Result = 3 THEN DATALENGTH(@Target)*100.00/DATALENGTH(@Reference)
WHEN d.Result = 4 THEN DATALENGTH(@Reference)*100.00/DATALENGTH(@Target)
WHEN d.Result = 5 THEN
(SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-2.00)) AS DECIMAL(5,2))
* CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END
FROM (
SELECT n = t1.n+t2.n
FROM (VALUES (0),(10),(20),(30),(40),(50),(60),(70),(80),(90)) t1(n)
CROSS JOIN (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t2(n)
WHERE t1.n+t2.n <= DATALENGTH(@Reference)
) Tally
WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 3), @Target) > 0
AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 3)) = 3)
WHEN d.Result = 6 THEN
(SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-0.00)) AS DECIMAL(5,2))
* CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END
FROM (
SELECT n = t1.n
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t1(n)
WHERE t1.n <= DATALENGTH(@Reference)
) Tally
WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 2), @Target) > 0
AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 2)) = 2)
ELSE NULL
END AS DECIMAL(5,2))

FROM (
SELECT Result = CASE
WHEN @Reference = @Target THEN 1
WHEN @Reference IS NULL OR @Target IS NULL THEN 2
WHEN @Reference LIKE '%'+@Target+'%' THEN 3
WHEN @Target LIKE '%'+@Reference+'%' THEN 4
WHEN DATALENGTH(@Reference) > 7 AND DATALENGTH(@Target) > 7 THEN 5
WHEN DATALENGTH(@Reference) > 2 AND DATALENGTH(@Target) > 2 THEN 6 -- SOUNDEX may work better
ELSE 7
END
) d



“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #1338021
Posted Tuesday, July 31, 2012 10:17 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Tuesday, August 26, 2014 3:26 AM
Points: 708, Visits: 3,287
Northwest UK yes :) I will PM you.. seriously you deserve drinks for all this help!!!!!

I've made the change and the function is working better now. :)

Going to play around with this more and will report back again tomorrow.

Once again, thanks so much for your efforts I really appreciate it.

Have a good evening.

p.s.
Sorry I must have PMed you 3 times already by mistake. The sent folder doesn't get updated with sent messages then


---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1338026
Posted Wednesday, August 1, 2012 1:42 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 9:46 AM
Points: 7,224, Visits: 13,696
Abu Dina (7/31/2012)
Northwest UK yes :) I will PM you.. seriously you deserve drinks for all this help!!!!!

I've made the change and the function is working better now. :)

Going to play around with this more and will report back again tomorrow.

Once again, thanks so much for your efforts I really appreciate it.

Have a good evening.

p.s.
Sorry I must have PMed you 3 times already by mistake. The sent folder doesn't get updated with sent messages then


We're all volunteers here, but thanks!
When you're ready you will probably need an explanation of how this function works - a good place to start would be one of Jeff Moden's tally table articles.


“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #1338344
Posted Thursday, August 2, 2012 9:02 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Tuesday, August 26, 2014 3:26 AM
Points: 708, Visits: 3,287
Hi Chris,

I've been trying to work out a stragey for my merging but I'm struggling a little.

The merge stored procedure will accept two parameters, the source table and the merge criteria

1) Merge by Contact
2) Merge by Business
3) Merge by Address

Maybe I've started with the wrong one but I'm trying to work out the Merge by Address but I just can't get it right.

As mentioned previoulsy, the source table will always be in the following format:

create table dbo.PreMerge (
master_id int,
Title varchar(20),
Forename varchar(50),
Surname varchar(50),
CompanyName varchar(200),
CompanyName2 varchar(200),
Address1 varchar(200),
Address2 varchar(200),
Address3 varchar(200),
Address4 varchar(200),
Town varchar(100),
County varchar(100),
Postcode varchar(20),
Hierarchy int
)

I've applied the iTVF to get a match ratio for each column in the following sample data.

insert into PreMerge (master_id, Title, Forename, Surname, CompanyName, Address1, Address2, Town, Postcode, Hierarchy) 
select 1547300, 'Mr', 'E', 'Glowver', NULL, '21 Harcourt Crescent', 'Nuthall', 'Nottingham', 'NG14 6BT', '20100309' union all

select 1581378, 'Mrs', 'C E', 'Glowver', NULL, '21 Harcourt Crescent', NULL, 'Nottingham', 'NG14 6BT', '20100309' union all

select 43196681, NULL, NULL, 'Glowveer', NULL, '21 Harcourt', 'Crescent', NULL, 'NG14 6BT', '20120501' union all

select 43196681, 'Mr', 'C J E', 'Glowser', NULL, 'Harcourt', '21 Harcourt Crescent', NULL, 'NG14 6BT', '20120501' union all
select 64273, 'Mrs', 'A', 'Tomlinson', NULL, '13 Harwood New Road', NULL, 'BLACKBURN', 'BB4 6QS', 2 union all
select 74575, 'Mr', 'M', 'Tomlinson', NULL, '13 Harwood New Road Great Harw', 'Great Harwood', 'BLACKBURN', 'BB4 6QS', 2
go

if object_id('PostMergeTest') > 0
drop table dbo.PostMergeTest
go

SELECT
r.master_id as retained_master_id,
t.master_id as deleted_master_id,

'#' AS '#',

isnull(FuzzyTitle.MatchRatio, 0) as Title_MatchRatio,
isnull(FuzzyForename.MatchRatio, 0) as Forename_MatchRatio,
isnull(FuzzySurname.MatchRatio, 0) as Surname_MatchRatio,
isnull(FuzzyAddress1.MatchRatio, 0) as Address1_MatchRatio,
isnull(FuzzyTown.MatchRatio, 0) as Town_MatchRatio,

'#' AS '##',

r.title as retained_title,
r.forename as retained_forename,
r.surname as retained_surname,
r.Address1 as retained_Address1,
r.Address2 as retained_Address2,
r.Town as retained_Town,

'#' AS '###',

t.title as deleted_title,
t.forename as deleted_forename,
t.Surname as deleted_surname,
t.Address1 as deleted_Address1,
t.Address2 as deleted_Address2,
t.Town as deleted_Town

into dbo.PostMergeTest

FROM dbo.PreMerge r
INNER JOIN dbo.PreMerge t
ON t.postcode = r.postcode
AND ((r.hierarchy<t.hierarchy) or (r.hierarchy=t.hierarchy and r.master_id>t.master_id))
-------------------------------------------------------------------------------------------------------------------------------------
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Title, t.Title) FuzzyTitle -- Title
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Forename, t.Forename) FuzzyForename -- Forename
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Surname, t.Surname) FuzzySurname -- Surname
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.address1, t.address2) FuzzyAddress12 -- Address1&Address2
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.address1, t.Address1) FuzzyAddress1 -- Address1
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Town, t.Town) FuzzyTown -- Town
--------------------------------------------------------------------------------------------------------------------------------------
order by Fuzzysurname.MatchRatio
go


I then created a new table to hold the duplicate IDs:

-- Match Level 1 = Surname, Address1 100% match

insert into dbo.FinalSiteDuplicates (retained_master_id, deleted_master_id, MatchLevel)
select
retained_master_id,
deleted_master_id,
1
from PostMergeTest
where (Surname_MatchRatio > 99 and Address1_MatchRatio > 99)
.
.
.
.
.

This find one duplicate but I've been trying different ration numbers and I jst can't get it to work.

am I doing this whole thing wrong? Is there a better way to do this?

Thanks in advance.


---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1339257
« Prev Topic | Next Topic »

Add to briefcase ««123»»

Permissions Expand / Collapse