SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Record linkage advice required– Long read (YOU HAVE BEEN WARNED!)


Record linkage advice required– Long read (YOU HAVE BEEN WARNED!)

Author
Message
Abu Dina
Abu Dina
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1225 Visits: 3323
Thanks Chris. You're my new best friend Smile

I'm going to spend the rest of today on this and will report back in the afternoon.

Once again, I really appreciate your efforts!

---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Abu Dina
Abu Dina
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1225 Visits: 3323
Hi Chris, I have some questions below. Any advice would be much appreciated. In your original post you said:


Try it out. There's always an element of tuning with this type of dedupe. The key things here are:

1. Get some kind of solid matching working as efficiently as possible. In this case, it's postcode and master_id.


The master_id is the unique identifier for each record so I'm not sure what you mean to be honest.

The merge script I'm trying to rebuild contains the following for Pass 1:

• left(a.Surname,5)=left(b.Surname,5)
• left(a.forename,1) = left(B.forename,1)
• a.postcode=b.postcode
• left(A.companynameny,8) = left(B.companyname,8)
• left(A.address1,8) = left(B.address1,8)

Your function and example works great. Maybe I'm just stupid but I just don't see how I can use this to replace the passes system used by the current merge script we have.

I'm not looking for someone to do the dirty work for me so please don't take this the wrong way! I just need some guidance so I know I'm on the right track.

Should I create a new table with match ratios for each field then do the filtering on the combination of ratios?

I'm really confused!

2. You're left with serious numbercrunching over a scan of the reference table and seeks to the target. Do as much work as possible as efficiently as you can. The function is set up as an iTVF so it can parallelise, and you can get it to compare multiple values in the same query.


But from the example select I don't see how I can compare more than one field at a time or am I missing something?

---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
ChrisM@Work
ChrisM@Work
SSCoach
SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)

Group: General Forum Members
Points: 16105 Visits: 19534
Abu Dina (7/31/2012)
Hi Chris, I have some questions below. Any advice would be much appreciated. In your original post you said:


Try it out. There's always an element of tuning with this type of dedupe. The key things here are:

1. Get some kind of solid matching working as efficiently as possible. In this case, it's postcode and master_id.


The master_id is the unique identifier for each record so I'm not sure what you mean to be honest.

You want to join the table to itself by the most efficient means possible, where the masterid in the two tables is different and the postcode is the same. Indexing on postcode and masterid does this for you. However, you're looking at quite a few columns in each row. A clustered index or an index covering all of the rows being examined will do the trick.



The merge script I'm trying to rebuild contains the following for Pass 1:

• left(a.Surname,5)=left(b.Surname,5)
• left(a.forename,1) = left(B.forename,1)
• a.postcode=b.postcode
• left(A.companynameny,8) = left(B.companyname,8)
• left(A.address1,8) = left(B.address1,8)

Your function and example works great. Maybe I'm just stupid but I just don't see how I can use this to replace the passes system used by the current merge script we have.


The passes system isn't very efficient - it's a pile of scripts which someone used to identify dupes run sequentially. Why not find out exactly how closely matched each column is (surname, forename, title etc) then set a cutoff point for each one?


I'm not looking for someone to do the dirty work for me so please don't take this the wrong way! I just need some guidance so I know I'm on the right track.

Should I create a new table with match ratios for each field then do the filtering on the combination of ratios?
No, it can all be done in a single step


I'm really confused!

2. You're left with serious numbercrunching over a scan of the reference table and seeks to the target. Do as much work as possible as efficiently as you can. The function is set up as an iTVF so it can parallelise, and you can get it to compare multiple values in the same query.


But from the example select I don't see how I can compare more than one field at a time or am I missing something?


Here's how to compare several columns at once:
SELECT r.*, '#' AS '#', t.master_id, 
t.Surname, Fuzzysurname.Result, Fuzzysurname.MatchRatio,
t.Address1, FuzzyAddress1.Result, FuzzyAddress1.MatchRatio,
t.Title, FuzzyTitle.Result, FuzzyTitle.MatchRatio,
t.CompanyName, FuzzyCompanyName.Result, FuzzyCompanyName.MatchRatio
FROM dbo.PreMerge r
INNER JOIN dbo.PreMerge t
ON t.postcode = r.postcode
AND t.master_id > r.master_id
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Surname, t.Surname) Fuzzysurname
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Address1, t.Address1) FuzzyAddress1
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Title, t.Title) FuzzyTitle
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.CompanyName, t.CompanyName) FuzzyCompanyName
WHERE Fuzzysurname.MatchRatio > 55
AND FuzzyAddress1.MatchRatio > 50
ORDER BY r.master_id



“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Abu Dina
Abu Dina
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1225 Visits: 3323
Thanks Chris.

Just to confirm. Is the FuzzyMatch_iTVF2k5 is a full implementation of http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf ?

---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
ChrisM@Work
ChrisM@Work
SSCoach
SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)

Group: General Forum Members
Points: 16105 Visits: 19534
Abu Dina (7/31/2012)
Thanks Chris.

Just to confirm. Is the FuzzyMatch_iTVF2k5 is a full implementation of http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf ?



No - it's the closest online reference I could find at the time. The iTVF uses the token-matching method which is fairly well understood and quite widely used but folks appear reluctant to publish anything about it - or at least they did around 7 years ago when I first wrote this.

“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Abu Dina
Abu Dina
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1225 Visits: 3323
I see.

Another question if you don't mind (you're going to hate me once this little project of mine is done! but if you're anywhere near the northwest then I'd be happy to repay your kindness with drinks of your choice lol).

How can Willis & Lilley have a match ratio = 60 ?! I've been tweaking the surname setting and I've settled on 54 for now but this is a strange result or is this one of many inevitable false positives I will always encounter with any record set?

Thanks.

---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
ChrisM@Work
ChrisM@Work
SSCoach
SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)

Group: General Forum Members
Points: 16105 Visits: 19534
Abu Dina (7/31/2012)
I see.

Another question if you don't mind (you're going to hate me once this little project of mine is done! but if you're anywhere near the northwest then I'd be happy to repay your kindness with drinks of your choice lol).
Northwest UK? Sure!
Keep coming with the questions, there are plenty of good folks here to lend a hand.


How can Willis & Lilley have a match ratio = 60 ?! I've been tweaking the surname setting and I've settled on 54 for now but this is a strange result or is this one of many inevitable false positives I will always encounter with any record set?

Thanks.

That's because I made a mistake playing with the function yesterday Blush
Here are the 2-char overlapping tokens from 'Willis' which are matched in 'Lilley':
il, ll, li

Here's the fixed function:
-- =============================================
-- Author: Chris Morris
-- Create date: 30/07/2012
-- =============================================
ALTER FUNCTION [dbo].[FuzzyMatch_iTVF2k5]
(@Reference VARCHAR(100) = NULL,
@Target VARCHAR(100) = NULL)
RETURNS table WITH SCHEMABINDING
AS

-- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

RETURN
SELECT d.Result, MatchRatio = CAST(CASE
WHEN d.Result = 1 THEN 100
WHEN d.Result = 3 THEN DATALENGTH(@Target)*100.00/DATALENGTH(@Reference)
WHEN d.Result = 4 THEN DATALENGTH(@Reference)*100.00/DATALENGTH(@Target)
WHEN d.Result = 5 THEN
(SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-2.00)) AS DECIMAL(5,2))
* CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END
FROM (
SELECT n = t1.n+t2.n
FROM (VALUES (0),(10),(20),(30),(40),(50),(60),(70),(80),(90)) t1(n)
CROSS JOIN (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t2(n)
WHERE t1.n+t2.n <= DATALENGTH(@Reference)
) Tally
WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 3), @Target) > 0
AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 3)) = 3)
WHEN d.Result = 6 THEN
(SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-0.00)) AS DECIMAL(5,2))
* CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END
FROM (
SELECT n = t1.n
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t1(n)
WHERE t1.n <= DATALENGTH(@Reference)
) Tally
WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 2), @Target) > 0
AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 2)) = 2)
ELSE NULL
END AS DECIMAL(5,2))

FROM (
SELECT Result = CASE
WHEN @Reference = @Target THEN 1
WHEN @Reference IS NULL OR @Target IS NULL THEN 2
WHEN @Reference LIKE '%'+@Target+'%' THEN 3
WHEN @Target LIKE '%'+@Reference+'%' THEN 4
WHEN DATALENGTH(@Reference) > 7 AND DATALENGTH(@Target) > 7 THEN 5
WHEN DATALENGTH(@Reference) > 2 AND DATALENGTH(@Target) > 2 THEN 6 -- SOUNDEX may work better
ELSE 7
END
) d



“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Abu Dina
Abu Dina
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1225 Visits: 3323
Northwest UK yes Smile I will PM you.. seriously you deserve drinks for all this help!!!!!

I've made the change and the function is working better now. Smile

Going to play around with this more and will report back again tomorrow.

Once again, thanks so much for your efforts I really appreciate it.

Have a good evening.

p.s.
Sorry I must have PMed you 3 times already by mistake. The sent folder doesn't get updated with sent messages then Rolleyes

---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
ChrisM@Work
ChrisM@Work
SSCoach
SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)

Group: General Forum Members
Points: 16105 Visits: 19534
Abu Dina (7/31/2012)
Northwest UK yes Smile I will PM you.. seriously you deserve drinks for all this help!!!!!

I've made the change and the function is working better now. Smile

Going to play around with this more and will report back again tomorrow.

Once again, thanks so much for your efforts I really appreciate it.

Have a good evening.

p.s.
Sorry I must have PMed you 3 times already by mistake. The sent folder doesn't get updated with sent messages then Rolleyes


We're all volunteers here, but thanks!
When you're ready you will probably need an explanation of how this function works - a good place to start would be one of Jeff Moden's tally table articles.

“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Abu Dina
Abu Dina
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1225 Visits: 3323
Hi Chris,

I've been trying to work out a stragey for my merging but I'm struggling a little. Blush

The merge stored procedure will accept two parameters, the source table and the merge criteria

1) Merge by Contact
2) Merge by Business
3) Merge by Address

Maybe I've started with the wrong one but I'm trying to work out the Merge by Address but I just can't get it right.

As mentioned previoulsy, the source table will always be in the following format:

create table dbo.PreMerge (
master_id int,
Title varchar(20),
Forename varchar(50),
Surname varchar(50),
CompanyName varchar(200),
CompanyName2 varchar(200),
Address1 varchar(200),
Address2 varchar(200),
Address3 varchar(200),
Address4 varchar(200),
Town varchar(100),
County varchar(100),
Postcode varchar(20),
Hierarchy int
)



I've applied the iTVF to get a match ratio for each column in the following sample data.


insert into PreMerge (master_id, Title, Forename, Surname, CompanyName, Address1, Address2, Town, Postcode, Hierarchy)
select 1547300, 'Mr', 'E', 'Glowver', NULL, '21 Harcourt Crescent', 'Nuthall', 'Nottingham', 'NG14 6BT', '20100309' union all

select 1581378, 'Mrs', 'C E', 'Glowver', NULL, '21 Harcourt Crescent', NULL, 'Nottingham', 'NG14 6BT', '20100309' union all

select 43196681, NULL, NULL, 'Glowveer', NULL, '21 Harcourt', 'Crescent', NULL, 'NG14 6BT', '20120501' union all

select 43196681, 'Mr', 'C J E', 'Glowser', NULL, 'Harcourt', '21 Harcourt Crescent', NULL, 'NG14 6BT', '20120501' union all
select 64273, 'Mrs', 'A', 'Tomlinson', NULL, '13 Harwood New Road', NULL, 'BLACKBURN', 'BB4 6QS', 2 union all
select 74575, 'Mr', 'M', 'Tomlinson', NULL, '13 Harwood New Road Great Harw', 'Great Harwood', 'BLACKBURN', 'BB4 6QS', 2
go

if object_id('PostMergeTest') > 0
drop table dbo.PostMergeTest
go

SELECT
r.master_id as retained_master_id,
t.master_id as deleted_master_id,

'#' AS '#',

isnull(FuzzyTitle.MatchRatio, 0) as Title_MatchRatio,
isnull(FuzzyForename.MatchRatio, 0) as Forename_MatchRatio,
isnull(FuzzySurname.MatchRatio, 0) as Surname_MatchRatio,
isnull(FuzzyAddress1.MatchRatio, 0) as Address1_MatchRatio,
isnull(FuzzyTown.MatchRatio, 0) as Town_MatchRatio,

'#' AS '##',

r.title as retained_title,
r.forename as retained_forename,
r.surname as retained_surname,
r.Address1 as retained_Address1,
r.Address2 as retained_Address2,
r.Town as retained_Town,

'#' AS '###',

t.title as deleted_title,
t.forename as deleted_forename,
t.Surname as deleted_surname,
t.Address1 as deleted_Address1,
t.Address2 as deleted_Address2,
t.Town as deleted_Town

into dbo.PostMergeTest

FROM dbo.PreMerge r
INNER JOIN dbo.PreMerge t
ON t.postcode = r.postcode
AND ((r.hierarchy<t.hierarchy) or (r.hierarchy=t.hierarchy and r.master_id>t.master_id))
-------------------------------------------------------------------------------------------------------------------------------------
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Title, t.Title) FuzzyTitle -- Title
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Forename, t.Forename) FuzzyForename -- Forename
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Surname, t.Surname) FuzzySurname -- Surname
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.address1, t.address2) FuzzyAddress12 -- Address1&Address2
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.address1, t.Address1) FuzzyAddress1 -- Address1
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Town, t.Town) FuzzyTown -- Town
--------------------------------------------------------------------------------------------------------------------------------------
order by Fuzzysurname.MatchRatio
go




I then created a new table to hold the duplicate IDs:


-- Match Level 1 = Surname, Address1 100% match

insert into dbo.FinalSiteDuplicates (retained_master_id, deleted_master_id, MatchLevel)
select
retained_master_id,
deleted_master_id,
1
from PostMergeTest
where (Surname_MatchRatio > 99 and Address1_MatchRatio > 99)
.
.
.
.
.



This find one duplicate but I've been trying different ration numbers and I jst can't get it to work.

am I doing this whole thing wrong? Is there a better way to do this?

Thanks in advance.

---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search