Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 123»»»

Record linkage advice required– Long read (YOU HAVE BEEN WARNED!) Expand / Collapse
Author
Message
Posted Monday, July 30, 2012 10:10 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Tuesday, August 26, 2014 3:26 AM
Points: 708, Visits: 3,287
Hi,

I have the following table structure:

create table dbo.PreMerge (
master_id int,
Title varchar(20),
Forename varchar(50),
Surname varchar(50),
CompanyName varchar(200),
CompanyName2 varchar(200),
Address1 varchar(200),
Address2 varchar(200),
Address3 varchar(200),
Address4 varchar(200),
Town varchar(100),
County varchar(100),
Postcode varchar(20),
Hierarchy int
)

insert into PreMerge (master_id, Title, Forename, Surname, CompanyName, Address1, Address2, Town, Postcode, Hierarchy)
select 1547300, 'Mr', 'E', 'Glowver', NULL, '21 Harcourt Crescent', 'Nuthall', 'Nottingham', 'NG14 6BT', '20100309' union all
select 1581378, 'Mr', 'C', 'Wallis', NULL, '116 Lynch Gate', NULL, 'Lydbury North', 'SY7 8AE', '20100309' union all
select 43196681, 'Mrs', 'X', 'Collows', NULL, '14 Ferndown Road', 'Eltham', 'London', 'SE4 5JP', '20120501' union all
select 43198564, 'Mrs', 'Z', 'Brogdenson', NULL, '36 Rhodes Park', NULL, 'North Berwick', 'EH11 5QA', '20120501' union all
select 1581378, 'Mr', 'Y', 'Walls', NULL, '117 Lynch Gate', NULL, 'Lydbury South', 'SX7 8OE', '20100309' union all
select 1547300, 'Mr', 'P', 'Glowver', NULL, '21 Harcourt', 'Nuthall', 'Notts', 'NG14 6BT', '20100309'


This is a very small sample of the data that I'm going to be working with but the table columns will always be the same.

Assumptions:

1) Data hygiene will be performed by a separate process.
2) Addresses are all valid - No record will exist without a UK postcode

The method that had been previously developed as follows:

The PreMerge table is joined to itself with the left table A being the table with record we want to retain and the right table B will be the record we will want to remove as duplicate

so it will be something like this

select a.master_id as retained_id, b.master_id as Duplicate_id,
a.title as retained_title, b.title as dropped_title
a.forename as retained_forename, b.forename as dropped_forename
.
.
.
.
from PreMerge as a
inner join Premerge as b
on left(a.Surname,5)=left(b.Surname,5)
and left(a.forename,1) = left(B.forename,1)
and a.postcode=b.postcode
where ((a.hierarchy<b.hierarchy) or (a.hierarchy=b.hierarchy and a.MergeID>b.MergeID))
and left(A.companynameny,8) = left(B.companyname,8)
and left(A.address1,8) = left(B.address1,8)
and a.postcode is not null and b.postcode is not null
and a.address1 is not null and b.address1 is not null
and a.forename is not null and b.forename is not null
and a.companyname is not null and b.companyname is not null

This is what is called Pass 1. Now we have various other passes that get applied with each pass having a different merge criteria for example we could look for:

Pass 2:

• left(a.Surname,5)=left(b.Surname,5)
• left(a.forename,1) = left(B.forename,1)
• a.postcode=b.postcode
• left(A.address1,12) = left(B.address1,12)

Pass 3:

• left(a.Surname,5)=left(b.Surname,5)
• left(a.forename,1) = left(b.forename,1)
• a.postcode=b.postcode
• left(A.companyname,14) = left(B.companyname,14)

Pass 4:

• left(a.Surname,5)=left(b.Surname,5)
• a.title = b.title
• a.postcode=b.postcode
• left(A.companyname,12) =left(B.companyname,12)

There are more than 20 passes that I have identified within the hundreds of lines of code!
But almost all of the passes use a deterministic approach rather than a probabilistic one.

So you're wondering what I'm after?! Well I want some opinions about our current technique. I could list all the logic for the passes but I just want to know if I should scrap our technique and apply a probabilistic methodology instead?

I've read several articles from web including a couple from this excellent site but I'm none the wiser! I could do with some expert advice please!

Thanks for taking the time to read this!


---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1337370
Posted Monday, July 30, 2012 10:14 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Thursday, August 21, 2014 1:54 PM
Points: 1,432, Visits: 3,229
The number of rows you will be dealing with will play a big part in which way you decide to go.

Either way, my goal would be to minimize the number of passes you have to make.




The probability of survival is inversely proportional to the angle of arrival.
Post #1337378
Posted Monday, July 30, 2012 10:20 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Tuesday, August 26, 2014 3:26 AM
Points: 708, Visits: 3,287
Data sets will vary but on average we're talking 2-4 million records.

---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1337384
Posted Monday, July 30, 2012 12:30 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Friday, August 29, 2014 2:32 PM
Points: 13,302, Visits: 12,168
Performance on this is going to be awful. You where clause is full of non-SARGable predicates.

It seems you have a number of duplicated checks in each pass too. You join on postcode but then also add NOT NULL for both tables in the where clause. You can't have NULL and = both be true. If table A OR table B is NULL the join will not return the row.

from PreMerge as a
inner join Premerge as b
on left(a.Surname,5)=left(b.Surname,5)
and left(a.forename,1) = left(B.forename,1)
and a.postcode=b.postcode
and left(A.companynameny,8) = left(B.companyname,8) --moved from the where clause to the join
and left(A.address1,8) = left(B.address1,8) --moved from the where clause to the join
where ((a.hierarchy < b.hierarchy) or (a.hierarchy = b.hierarchy and a.MergeID > b.MergeID))
--and a.postcode is not null and b.postcode is not null (this is redundant because of the join)
and a.address1 is not null
and b.address1 is not null
--and a.forename is not null and b.forename is not null (this is redundant because of the join)
and a.companyname is not null
and b.companyname is not null



_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Moden's splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
Post #1337460
Posted Monday, July 30, 2012 1:12 PM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Tuesday, August 26, 2014 3:26 AM
Points: 708, Visits: 3,287
Thanks for this Sean. You are right about the badly written query but for now I'm trying to work out if the approach adopted by the previous developer is the right one or not.

Should I stick with the Pass logic he/she used and just optimize the code or take a different approach to resolving the duplicates?



---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1337489
Posted Monday, July 30, 2012 1:46 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Friday, August 29, 2014 2:32 PM
Points: 13,302, Visits: 12,168
Abu Dina (7/30/2012)
Thanks for this Sean. You are right about the badly written query but for now I'm trying to work out if the approach adopted by the previous developer is the right one or not.

Should I stick with the Pass logic he/she used and just optimize the code or take a different approach to resolving the duplicates?



Well I can't really say but if using the current approach means doing lots of string manipulation like this I would try to find another approach.

From what I can see I don't think you need to use multiple passes, just some cases statements in your where clause to help find rows in each "group" should do it.


_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Moden's splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
Post #1337518
Posted Monday, July 30, 2012 2:45 PM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Today @ 1:20 PM
Points: 1,090, Visits: 6,543
Here's a method I've used with some success in the past. I've changed a few data elements in your sample set, so that the dupes no longer have the same key and Walls/Wallis have the same postcode so they should be dupe-checked. I've also added a clustered index to the table so the target table rows are picked up by seeks.

Here's the sample data script:
DROP table dbo.PreMerge
create table dbo.PreMerge (
master_id int,
Title varchar(20),
Forename varchar(50),
Surname varchar(50),
CompanyName varchar(200),
CompanyName2 varchar(200),
Address1 varchar(200),
Address2 varchar(200),
Address3 varchar(200),
Address4 varchar(200),
Town varchar(100),
County varchar(100),
Postcode varchar(20),
Hierarchy int
)

insert into PreMerge (master_id, Title, Forename, Surname, CompanyName, Address1, Address2, Town, Postcode, Hierarchy)
select 1547300, 'Mr', 'E', 'Glowver', NULL, '21 Harcourt Crescent', 'Nuthall', 'Nottingham', 'NG14 6BT', '20100309' union all
select 1581378, 'Mr', 'C', 'Wallis', NULL, '116 Lynch Gate', NULL, 'Lydbury North', 'SX7 8OE', '20100309' union all
select 43196681, 'Mrs', 'X', 'Collows', NULL, '14 Ferndown Road', 'Eltham', 'London', 'SE4 5JP', '20120501' union all
select 43198564, 'Mrs', 'Z', 'Brogdenson', NULL, '36 Rhodes Park', NULL, 'North Berwick', 'EH11 5QA', '20120501' union all
select 1581379, 'Mr', 'Y', 'Walls', NULL, '117 Lynch Gate', NULL, 'Lydbury South', 'SX7 8OE', '20100309' union all
select 1547310, 'Mr', 'P', 'Glowver', NULL, '21 Harcourt', 'Nuthall', 'Notts', 'NG14 6BT', '20100309'

CREATE CLUSTERED INDEX cx_PreMerge ON PreMerge (postcode,master_id)


Here's a fuzzy-matching inline table-valued function:

-- =============================================
-- Author: Chris Morris
-- Create date: 30/07/2012
-- =============================================
ALTER FUNCTION [dbo].[FuzzyMatch_iTVF]
(
@Reference VARCHAR(100),
@Target VARCHAR(100)
)
RETURNS table
AS

-- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

RETURN
SELECT d.Result, MatchRatio = CAST(CASE
WHEN d.Result = 1 THEN 100
WHEN d.Result = 3 THEN DATALENGTH(@Target)*100.00/DATALENGTH(@Reference)
WHEN d.Result = 4 THEN DATALENGTH(@Reference)*100.00/DATALENGTH(@Target)
WHEN d.Result = 5 THEN
(SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-2.00)) AS DECIMAL(5,2))
* CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END
FROM (
SELECT n = t1.n+t2.n
FROM (VALUES (0),(10),(20),(30),(40),(50),(60),(70),(80),(90)) t1(n)
CROSS JOIN (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t2(n)
WHERE t1.n+t2.n <= DATALENGTH(@Reference)
) Tally
WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 3), @Target) > 0
AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 3)) = 3)
WHEN d.Result = 6 THEN
(SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-1.00)) AS DECIMAL(5,2))
* CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END
FROM (
SELECT n = t1.n
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t1(n)
WHERE t1.n <= DATALENGTH(@Reference)
) Tally
WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 2), @Target) > 0
AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 2)) = 2)
ELSE NULL
END AS DECIMAL(5,2))

FROM (
SELECT Result = CASE
WHEN @Reference = @Target THEN 1
WHEN @Reference IS NULL OR @Target IS NULL THEN 2
WHEN @Reference LIKE '%'+@Target+'%' THEN 3
WHEN @Target LIKE '%'+@Reference+'%' THEN 4
WHEN DATALENGTH(@Reference) > 7 AND DATALENGTH(@Target) > 7 THEN 5
WHEN DATALENGTH(@Reference) > 2 AND DATALENGTH(@Target) > 2 THEN 6 -- SOUNDEX may work better
ELSE 7
END
) d

Here's an example of usage:
SELECT r.*, '#' AS '#', t.master_id, t.Surname, Fuzzysurname.*
FROM dbo.PreMerge r
INNER JOIN dbo.PreMerge t ON t.postcode = r.postcode AND t.master_id < r.master_id
CROSS APPLY [dbo].[FuzzyMatch_iTVF] (r.Surname, t.Surname) Fuzzysurname
ORDER BY r.master_id


Try it out. There's always an element of tuning with this type of dedupe. The key things here are:
1. Get some kind of solid matching working as efficiently as possible. In this case, it's postcode and master_id.
This reduces the amount of work that has to be done straight away.
2. You're left with serious numbercrunching over a scan of the reference table and seeks to the target. Do as much work as possible as efficiently as you can. The function is set up as an iTVF so it can parallelise, and you can get it to compare multiple values in the same query.

Good luck, post back if you get stuck, most of all enjoy.







Low-hanging fruit picker and defender of the moggies





For better assistance in answering your questions, please read this.




Understanding and using APPLY, (I) and (II) Paul White

Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Post #1337549
Posted Tuesday, July 31, 2012 2:23 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Tuesday, August 26, 2014 3:26 AM
Points: 708, Visits: 3,287
Thanks for this Chris but when I tried to create the function I get the following error message:

Msg 156, Level 15, State 1, Procedure FuzzyMatch_iTVF, Line 22
Incorrect syntax near the keyword 'VALUES'.

FROM (VALUES (0),(10),(20),(30),(40),(50),(60),(70),(80),(90)) t1(n)

I'm using SQL Server 2005 SP2

Any ideas?


---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1337727
Posted Tuesday, July 31, 2012 3:13 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 7:55 AM
Points: 7,230, Visits: 13,709
Abu Dina (7/31/2012)
Thanks for this Chris but when I tried to create the function I get the following error message:

Msg 156, Level 15, State 1, Procedure FuzzyMatch_iTVF, Line 22
Incorrect syntax near the keyword 'VALUES'.

FROM (VALUES (0),(10),(20),(30),(40),(50),(60),(70),(80),(90)) t1(n)

I'm using SQL Server 2005 SP2

Any ideas?


Sure thing, apologies but the function I posted uses 2k8 row constructors. Easy enough to change it to CTE's, back shortly...


“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #1337744
Posted Tuesday, July 31, 2012 3:29 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 7:55 AM
Points: 7,230, Visits: 13,709
Here you go mate:

CREATE FUNCTION [dbo].[FuzzyMatch_iTVF2k5] 
(
@Reference VARCHAR(100),
@Target VARCHAR(100)
)
RETURNS table
AS

-- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

RETURN
SELECT d.Result, MatchRatio = CAST(CASE
WHEN d.Result = 1 THEN 100
WHEN d.Result = 3 THEN DATALENGTH(@Target)*100.00/DATALENGTH(@Reference)
WHEN d.Result = 4 THEN DATALENGTH(@Reference)*100.00/DATALENGTH(@Target)
WHEN d.Result = 5 THEN
(SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-2.00)) AS DECIMAL(5,2))
* CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END
FROM (
SELECT TOP (ISNULL(DATALENGTH(@Reference),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) a (u)
CROSS JOIN (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) b (t)
) Tally (n)
WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 3), @Target) > 0
AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 3)) = 3)
WHEN d.Result = 6 THEN
(SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-1.00)) AS DECIMAL(5,2))
* CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END
FROM (
SELECT n = t1.n
FROM (SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t1 (n)
WHERE t1.n <= DATALENGTH(@Reference)
) Tally
WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 2), @Target) > 0
AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 2)) = 2)
ELSE NULL
END AS DECIMAL(5,2))

FROM (
SELECT Result = CASE
WHEN @Reference = @Target THEN 1
WHEN @Reference IS NULL OR @Target IS NULL THEN 2
WHEN @Reference LIKE '%'+@Target+'%' THEN 3
WHEN @Target LIKE '%'+@Reference+'%' THEN 4
WHEN DATALENGTH(@Reference) > 7 AND DATALENGTH(@Target) > 7 THEN 5
WHEN DATALENGTH(@Reference) > 2 AND DATALENGTH(@Target) > 2 THEN 6 -- SOUNDEX may work better
ELSE 7
END
) d



“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #1337751
« Prev Topic | Next Topic »

Add to briefcase 123»»»

Permissions Expand / Collapse