Record linkage advice required– Long read (YOU HAVE BEEN WARNED!)

  • Hi,

    I have the following table structure:

    create table dbo.PreMerge (

    master_idint,

    Titlevarchar(20),

    Forenamevarchar(50),

    Surnamevarchar(50),

    CompanyNamevarchar(200),

    CompanyName2varchar(200),

    Address1varchar(200),

    Address2varchar(200),

    Address3varchar(200),

    Address4varchar(200),

    Townvarchar(100),

    Countyvarchar(100),

    Postcodevarchar(20),

    Hierarchyint

    )

    insert into PreMerge (master_id, Title, Forename, Surname, CompanyName, Address1, Address2, Town, Postcode, Hierarchy)

    select 1547300,'Mr','E','Glowver',NULL,'21 Harcourt Crescent','Nuthall','Nottingham','NG14 6BT','20100309' union all

    select 1581378,'Mr','C','Wallis',NULL,'116 Lynch Gate',NULL,'Lydbury North','SY7 8AE','20100309' union all

    select 43196681, 'Mrs','X','Collows',NULL,'14 Ferndown Road','Eltham','London','SE4 5JP','20120501' union all

    select 43198564, 'Mrs','Z','Brogdenson',NULL,'36 Rhodes Park',NULL,'North Berwick','EH11 5QA','20120501' union all

    select 1581378,'Mr','Y','Walls',NULL,'117 Lynch Gate',NULL,'Lydbury South','SX7 8OE','20100309' union all

    select 1547300,'Mr','P','Glowver',NULL,'21 Harcourt','Nuthall','Notts', 'NG14 6BT','20100309'

    This is a very small sample of the data that I'm going to be working with but the table columns will always be the same.

    Assumptions:

    1) Data hygiene will be performed by a separate process.

    2) Addresses are all valid - No record will exist without a UK postcode

    The method that had been previously developed as follows:

    The PreMerge table is joined to itself with the left table A being the table with record we want to retain and the right table B will be the record we will want to remove as duplicate

    so it will be something like this

    select a.master_id as retained_id, b.master_id as Duplicate_id,

    a.title as retained_title, b.title as dropped_title

    a.forename as retained_forename, b.forename as dropped_forename

    .

    .

    .

    .

    from PreMerge as a

    inner join Premerge as b

    on left(a.Surname,5)=left(b.Surname,5)

    and left(a.forename,1) = left(B.forename,1)

    and a.postcode=b.postcode

    where ((a.hierarchy<b.hierarchy) or (a.hierarchy=b.hierarchy and a.MergeID>b.MergeID))

    and left(A.companynameny,8) = left(B.companyname,8)

    and left(A.address1,8) = left(B.address1,8)

    and a.postcode is not null and b.postcode is not null

    and a.address1 is not null and b.address1 is not null

    and a.forename is not null and b.forename is not null

    and a.companyname is not null and b.companyname is not null

    This is what is called Pass 1. Now we have various other passes that get applied with each pass having a different merge criteria for example we could look for:

    Pass 2:

    •left(a.Surname,5)=left(b.Surname,5)

    •left(a.forename,1) = left(B.forename,1)

    •a.postcode=b.postcode

    •left(A.address1,12) = left(B.address1,12)

    Pass 3:

    •left(a.Surname,5)=left(b.Surname,5)

    •left(a.forename,1) = left(b.forename,1)

    •a.postcode=b.postcode

    •left(A.companyname,14) = left(B.companyname,14)

    Pass 4:

    •left(a.Surname,5)=left(b.Surname,5)

    •a.title = b.title

    •a.postcode=b.postcode

    •left(A.companyname,12) =left(B.companyname,12)

    There are more than 20 passes that I have identified within the hundreds of lines of code!

    But almost all of the passes use a deterministic approach rather than a probabilistic one.

    So you're wondering what I'm after?! Well I want some opinions about our current technique. I could list all the logic for the passes but I just want to know if I should scrap our technique and apply a probabilistic methodology instead?

    I've read several articles from web including a couple from this excellent site but I'm none the wiser! I could do with some expert advice please!

    Thanks for taking the time to read this!

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • The number of rows you will be dealing with will play a big part in which way you decide to go.

    Either way, my goal would be to minimize the number of passes you have to make.

    The probability of survival is inversely proportional to the angle of arrival.

  • Data sets will vary but on average we're talking 2-4 million records.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Performance on this is going to be awful. You where clause is full of non-SARGable predicates.

    It seems you have a number of duplicated checks in each pass too. You join on postcode but then also add NOT NULL for both tables in the where clause. You can't have NULL and = both be true. If table A OR table B is NULL the join will not return the row.

    from PreMerge as a

    inner join Premerge as b

    on left(a.Surname,5)=left(b.Surname,5)

    and left(a.forename,1) = left(B.forename,1)

    and a.postcode=b.postcode

    and left(A.companynameny,8) = left(B.companyname,8) --moved from the where clause to the join

    and left(A.address1,8) = left(B.address1,8) --moved from the where clause to the join

    where ((a.hierarchy < b.hierarchy) or (a.hierarchy = b.hierarchy and a.MergeID > b.MergeID))

    --and a.postcode is not null and b.postcode is not null (this is redundant because of the join)

    and a.address1 is not null

    and b.address1 is not null

    --and a.forename is not null and b.forename is not null (this is redundant because of the join)

    and a.companyname is not null

    and b.companyname is not null

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Thanks for this Sean. You are right about the badly written query but for now I'm trying to work out if the approach adopted by the previous developer is the right one or not.

    Should I stick with the Pass logic he/she used and just optimize the code or take a different approach to resolving the duplicates?

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Abu Dina (7/30/2012)


    Thanks for this Sean. You are right about the badly written query but for now I'm trying to work out if the approach adopted by the previous developer is the right one or not.

    Should I stick with the Pass logic he/she used and just optimize the code or take a different approach to resolving the duplicates?

    Well I can't really say but if using the current approach means doing lots of string manipulation like this I would try to find another approach.

    From what I can see I don't think you need to use multiple passes, just some cases statements in your where clause to help find rows in each "group" should do it.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Here's a method I've used with some success in the past. I've changed a few data elements in your sample set, so that the dupes no longer have the same key and Walls/Wallis have the same postcode so they should be dupe-checked. I've also added a clustered index to the table so the target table rows are picked up by seeks.

    Here's the sample data script:

    DROP table dbo.PreMerge

    create table dbo.PreMerge (

    master_idint,

    Titlevarchar(20),

    Forenamevarchar(50),

    Surnamevarchar(50),

    CompanyNamevarchar(200),

    CompanyName2varchar(200),

    Address1varchar(200),

    Address2varchar(200),

    Address3varchar(200),

    Address4varchar(200),

    Townvarchar(100),

    Countyvarchar(100),

    Postcodevarchar(20),

    Hierarchyint

    )

    insert into PreMerge (master_id, Title, Forename, Surname, CompanyName, Address1, Address2, Town, Postcode, Hierarchy)

    select 1547300,'Mr','E','Glowver',NULL,'21 Harcourt Crescent','Nuthall','Nottingham','NG14 6BT','20100309' union all

    select 1581378,'Mr','C','Wallis',NULL,'116 Lynch Gate',NULL,'Lydbury North','SX7 8OE','20100309' union all

    select 43196681, 'Mrs','X','Collows',NULL,'14 Ferndown Road','Eltham','London','SE4 5JP','20120501' union all

    select 43198564, 'Mrs','Z','Brogdenson',NULL,'36 Rhodes Park',NULL,'North Berwick','EH11 5QA','20120501' union all

    select 1581379,'Mr','Y','Walls',NULL,'117 Lynch Gate',NULL,'Lydbury South','SX7 8OE','20100309' union all

    select 1547310,'Mr','P','Glowver',NULL,'21 Harcourt','Nuthall','Notts', 'NG14 6BT','20100309'

    CREATE CLUSTERED INDEX cx_PreMerge ON PreMerge (postcode,master_id)

    Here's a fuzzy-matching inline table-valued function:

    -- =============================================

    -- Author:Chris Morris

    -- Create date: 30/07/2012

    -- =============================================

    ALTER FUNCTION [dbo].[FuzzyMatch_iTVF]

    (

    @Reference VARCHAR(100),

    @Target VARCHAR(100)

    )

    RETURNS table

    AS

    -- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

    RETURN

    SELECT d.Result, MatchRatio = CAST(CASE

    WHEN d.Result = 1 THEN 100

    WHEN d.Result = 3 THEN DATALENGTH(@Target)*100.00/DATALENGTH(@Reference)

    WHEN d.Result = 4 THEN DATALENGTH(@Reference)*100.00/DATALENGTH(@Target)

    WHEN d.Result = 5 THEN

    (SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-2.00)) AS DECIMAL(5,2))

    * CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END

    FROM (

    SELECT n = t1.n+t2.n

    FROM (VALUES (0),(10),(20),(30),(40),(50),(60),(70),(80),(90)) t1(n)

    CROSS JOIN (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t2(n)

    WHERE t1.n+t2.n <= DATALENGTH(@Reference)

    ) Tally

    WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 3), @Target) > 0

    AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 3)) = 3)

    WHEN d.Result = 6 THEN

    (SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-1.00)) AS DECIMAL(5,2))

    * CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END

    FROM (

    SELECT n = t1.n

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t1(n)

    WHERE t1.n <= DATALENGTH(@Reference)

    ) Tally

    WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 2), @Target) > 0

    AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 2)) = 2)

    ELSE NULL

    END AS DECIMAL(5,2))

    FROM (

    SELECT Result = CASE

    WHEN @Reference = @Target THEN 1

    WHEN @Reference IS NULL OR @Target IS NULL THEN 2

    WHEN @Reference LIKE '%'+@Target+'%' THEN 3

    WHEN @Target LIKE '%'+@Reference+'%' THEN 4

    WHEN DATALENGTH(@Reference) > 7 AND DATALENGTH(@Target) > 7 THEN 5

    WHEN DATALENGTH(@Reference) > 2 AND DATALENGTH(@Target) > 2 THEN 6 -- SOUNDEX may work better

    ELSE 7

    END

    ) d

    Here's an example of usage:

    SELECT r.*, '#' AS '#', t.master_id, t.Surname, Fuzzysurname.*

    FROM dbo.PreMerge r

    INNER JOIN dbo.PreMerge t ON t.postcode = r.postcode AND t.master_id < r.master_id

    CROSS APPLY [dbo].[FuzzyMatch_iTVF] (r.Surname, t.Surname) Fuzzysurname

    ORDER BY r.master_id

    Try it out. There's always an element of tuning with this type of dedupe. The key things here are:

    1. Get some kind of solid matching working as efficiently as possible. In this case, it's postcode and master_id.

    This reduces the amount of work that has to be done straight away.

    2. You're left with serious numbercrunching over a scan of the reference table and seeks to the target. Do as much work as possible as efficiently as you can. The function is set up as an iTVF so it can parallelise, and you can get it to compare multiple values in the same query.

    Good luck, post back if you get stuck, most of all enjoy.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Thanks for this Chris but when I tried to create the function I get the following error message:

    Msg 156, Level 15, State 1, Procedure FuzzyMatch_iTVF, Line 22

    Incorrect syntax near the keyword 'VALUES'.

    FROM (VALUES (0),(10),(20),(30),(40),(50),(60),(70),(80),(90)) t1(n)

    I'm using SQL Server 2005 SP2

    Any ideas?

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Abu Dina (7/31/2012)


    Thanks for this Chris but when I tried to create the function I get the following error message:

    Msg 156, Level 15, State 1, Procedure FuzzyMatch_iTVF, Line 22

    Incorrect syntax near the keyword 'VALUES'.

    FROM (VALUES (0),(10),(20),(30),(40),(50),(60),(70),(80),(90)) t1(n)

    I'm using SQL Server 2005 SP2

    Any ideas?

    Sure thing, apologies but the function I posted uses 2k8 row constructors. Easy enough to change it to CTE's, back shortly...

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Here you go mate:

    CREATE FUNCTION [dbo].[FuzzyMatch_iTVF2k5]

    (

    @Reference VARCHAR(100),

    @Target VARCHAR(100)

    )

    RETURNS table

    AS

    -- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

    RETURN

    SELECT d.Result, MatchRatio = CAST(CASE

    WHEN d.Result = 1 THEN 100

    WHEN d.Result = 3 THEN DATALENGTH(@Target)*100.00/DATALENGTH(@Reference)

    WHEN d.Result = 4 THEN DATALENGTH(@Reference)*100.00/DATALENGTH(@Target)

    WHEN d.Result = 5 THEN

    (SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-2.00)) AS DECIMAL(5,2))

    * CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END

    FROM (

    SELECT TOP (ISNULL(DATALENGTH(@Reference),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) a (u)

    CROSS JOIN (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) b (t)

    ) Tally (n)

    WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 3), @Target) > 0

    AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 3)) = 3)

    WHEN d.Result = 6 THEN

    (SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-1.00)) AS DECIMAL(5,2))

    * CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END

    FROM (

    SELECT n = t1.n

    FROM (SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL

    SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) t1 (n)

    WHERE t1.n <= DATALENGTH(@Reference)

    ) Tally

    WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 2), @Target) > 0

    AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 2)) = 2)

    ELSE NULL

    END AS DECIMAL(5,2))

    FROM (

    SELECT Result = CASE

    WHEN @Reference = @Target THEN 1

    WHEN @Reference IS NULL OR @Target IS NULL THEN 2

    WHEN @Reference LIKE '%'+@Target+'%' THEN 3

    WHEN @Target LIKE '%'+@Reference+'%' THEN 4

    WHEN DATALENGTH(@Reference) > 7 AND DATALENGTH(@Target) > 7 THEN 5

    WHEN DATALENGTH(@Reference) > 2 AND DATALENGTH(@Target) > 2 THEN 6 -- SOUNDEX may work better

    ELSE 7

    END

    ) d

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks Chris. You're my new best friend 🙂

    I'm going to spend the rest of today on this and will report back in the afternoon.

    Once again, I really appreciate your efforts!

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Hi Chris, I have some questions below. Any advice would be much appreciated. In your original post you said:

    Try it out. There's always an element of tuning with this type of dedupe. The key things here are:

    1. Get some kind of solid matching working as efficiently as possible. In this case, it's postcode and master_id.

    The master_id is the unique identifier for each record so I'm not sure what you mean to be honest.

    The merge script I'm trying to rebuild contains the following for Pass 1:

    •left(a.Surname,5)=left(b.Surname,5)

    •left(a.forename,1) = left(B.forename,1)

    •a.postcode=b.postcode

    •left(A.companynameny,8) = left(B.companyname,8)

    •left(A.address1,8) = left(B.address1,8)

    Your function and example works great. Maybe I'm just stupid but I just don't see how I can use this to replace the passes system used by the current merge script we have.

    I'm not looking for someone to do the dirty work for me so please don't take this the wrong way! I just need some guidance so I know I'm on the right track.

    Should I create a new table with match ratios for each field then do the filtering on the combination of ratios?

    I'm really confused!

    2. You're left with serious numbercrunching over a scan of the reference table and seeks to the target. Do as much work as possible as efficiently as you can. The function is set up as an iTVF so it can parallelise, and you can get it to compare multiple values in the same query.

    But from the example select I don't see how I can compare more than one field at a time or am I missing something?

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Abu Dina (7/31/2012)


    Hi Chris, I have some questions below. Any advice would be much appreciated. In your original post you said:

    Try it out. There's always an element of tuning with this type of dedupe. The key things here are:

    1. Get some kind of solid matching working as efficiently as possible. In this case, it's postcode and master_id.

    The master_id is the unique identifier for each record so I'm not sure what you mean to be honest.

    You want to join the table to itself by the most efficient means possible, where the masterid in the two tables is different and the postcode is the same. Indexing on postcode and masterid does this for you. However, you're looking at quite a few columns in each row. A clustered index or an index covering all of the rows being examined will do the trick.

    The merge script I'm trying to rebuild contains the following for Pass 1:

    •left(a.Surname,5)=left(b.Surname,5)

    •left(a.forename,1) = left(B.forename,1)

    •a.postcode=b.postcode

    •left(A.companynameny,8) = left(B.companyname,8)

    •left(A.address1,8) = left(B.address1,8)

    Your function and example works great. Maybe I'm just stupid but I just don't see how I can use this to replace the passes system used by the current merge script we have.

    The passes system isn't very efficient - it's a pile of scripts which someone used to identify dupes run sequentially. Why not find out exactly how closely matched each column is (surname, forename, title etc) then set a cutoff point for each one?

    I'm not looking for someone to do the dirty work for me so please don't take this the wrong way! I just need some guidance so I know I'm on the right track.

    Should I create a new table with match ratios for each field then do the filtering on the combination of ratios?

    No, it can all be done in a single step

    I'm really confused!

    2. You're left with serious numbercrunching over a scan of the reference table and seeks to the target. Do as much work as possible as efficiently as you can. The function is set up as an iTVF so it can parallelise, and you can get it to compare multiple values in the same query.

    But from the example select I don't see how I can compare more than one field at a time or am I missing something?

    Here's how to compare several columns at once:

    SELECT r.*, '#' AS '#', t.master_id,

    t.Surname, Fuzzysurname.Result, Fuzzysurname.MatchRatio,

    t.Address1, FuzzyAddress1.Result, FuzzyAddress1.MatchRatio,

    t.Title, FuzzyTitle.Result, FuzzyTitle.MatchRatio,

    t.CompanyName, FuzzyCompanyName.Result, FuzzyCompanyName.MatchRatio

    FROM dbo.PreMerge r

    INNER JOIN dbo.PreMerge t

    ON t.postcode = r.postcode

    AND t.master_id > r.master_id

    CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Surname, t.Surname) Fuzzysurname

    CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Address1, t.Address1) FuzzyAddress1

    CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Title, t.Title) FuzzyTitle

    CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.CompanyName, t.CompanyName) FuzzyCompanyName

    WHERE Fuzzysurname.MatchRatio > 55

    AND FuzzyAddress1.MatchRatio > 50

    ORDER BY r.master_id

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks Chris.

    Just to confirm. Is the FuzzyMatch_iTVF2k5 is a full implementation of http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf ?

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Abu Dina (7/31/2012)


    Thanks Chris.

    Just to confirm. Is the FuzzyMatch_iTVF2k5 is a full implementation of http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf ?

    No - it's the closest online reference I could find at the time. The iTVF uses the token-matching method which is fairly well understood and quite widely used but folks appear reluctant to publish anything about it - or at least they did around 7 years ago when I first wrote this.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply