Record linkage advice required– Long read (YOU HAVE BEEN WARNED!)

  • I see.

    Another question if you don't mind (you're going to hate me once this little project of mine is done! but if you're anywhere near the northwest then I'd be happy to repay your kindness with drinks of your choice lol).

    How can Willis & Lilley have a match ratio = 60 ?! I've been tweaking the surname setting and I've settled on 54 for now but this is a strange result or is this one of many inevitable false positives I will always encounter with any record set?

    Thanks.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Abu Dina (7/31/2012)


    I see.

    Another question if you don't mind (you're going to hate me once this little project of mine is done! but if you're anywhere near the northwest then I'd be happy to repay your kindness with drinks of your choice lol).

    Northwest UK? Sure!

    Keep coming with the questions, there are plenty of good folks here to lend a hand.

    How can Willis & Lilley have a match ratio = 60 ?! I've been tweaking the surname setting and I've settled on 54 for now but this is a strange result or is this one of many inevitable false positives I will always encounter with any record set?

    Thanks.

    That's because I made a mistake playing with the function yesterday :blush:

    Here are the 2-char overlapping tokens from 'Willis' which are matched in 'Lilley':

    il, ll, li

    Here's the fixed function:

    -- =============================================

    -- Author:Chris Morris

    -- Create date: 30/07/2012

    -- =============================================

    ALTER FUNCTION [dbo].[FuzzyMatch_iTVF2k5]

    (@Reference VARCHAR(100) = NULL,

    @Target VARCHAR(100) = NULL)

    RETURNS table WITH SCHEMABINDING

    AS

    -- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

    RETURN

    SELECT d.Result, MatchRatio = CAST(CASE

    WHEN d.Result = 1 THEN 100

    WHEN d.Result = 3 THEN DATALENGTH(@Target)*100.00/DATALENGTH(@Reference)

    WHEN d.Result = 4 THEN DATALENGTH(@Reference)*100.00/DATALENGTH(@Target)

    WHEN d.Result = 5 THEN

    (SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-2.00)) AS DECIMAL(5,2))

    * CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END

    FROM (

    SELECT n = t1.n+t2.n

    FROM (VALUES (0),(10),(20),(30),(40),(50),(60),(70),(80),(90)) t1(n)

    CROSS JOIN (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t2(n)

    WHERE t1.n+t2.n <= DATALENGTH(@Reference)

    ) Tally

    WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 3), @Target) > 0

    AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 3)) = 3)

    WHEN d.Result = 6 THEN

    (SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-0.00)) AS DECIMAL(5,2))

    * CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END

    FROM (

    SELECT n = t1.n

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t1(n)

    WHERE t1.n <= DATALENGTH(@Reference)

    ) Tally

    WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 2), @Target) > 0

    AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 2)) = 2)

    ELSE NULL

    END AS DECIMAL(5,2))

    FROM (

    SELECT Result = CASE

    WHEN @Reference = @Target THEN 1

    WHEN @Reference IS NULL OR @Target IS NULL THEN 2

    WHEN @Reference LIKE '%'+@Target+'%' THEN 3

    WHEN @Target LIKE '%'+@Reference+'%' THEN 4

    WHEN DATALENGTH(@Reference) > 7 AND DATALENGTH(@Target) > 7 THEN 5

    WHEN DATALENGTH(@Reference) > 2 AND DATALENGTH(@Target) > 2 THEN 6 -- SOUNDEX may work better

    ELSE 7

    END

    ) d

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Northwest UK yes 🙂 I will PM you.. seriously you deserve drinks for all this help!!!!!

    I've made the change and the function is working better now. 🙂

    Going to play around with this more and will report back again tomorrow.

    Once again, thanks so much for your efforts I really appreciate it.

    Have a good evening.

    p.s.

    Sorry I must have PMed you 3 times already by mistake. The sent folder doesn't get updated with sent messages then :rolleyes:

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Abu Dina (7/31/2012)


    Northwest UK yes 🙂 I will PM you.. seriously you deserve drinks for all this help!!!!!

    I've made the change and the function is working better now. 🙂

    Going to play around with this more and will report back again tomorrow.

    Once again, thanks so much for your efforts I really appreciate it.

    Have a good evening.

    p.s.

    Sorry I must have PMed you 3 times already by mistake. The sent folder doesn't get updated with sent messages then :rolleyes:

    We're all volunteers here, but thanks!

    When you're ready you will probably need an explanation of how this function works - a good place to start would be one of Jeff Moden's tally table articles.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Chris,

    I've been trying to work out a stragey for my merging but I'm struggling a little. :blush:

    The merge stored procedure will accept two parameters, the source table and the merge criteria

    1) Merge by Contact

    2) Merge by Business

    3) Merge by Address

    Maybe I've started with the wrong one but I'm trying to work out the Merge by Address but I just can't get it right.

    As mentioned previoulsy, the source table will always be in the following format:

    create table dbo.PreMerge (

    master_idint,

    Titlevarchar(20),

    Forenamevarchar(50),

    Surnamevarchar(50),

    CompanyNamevarchar(200),

    CompanyName2varchar(200),

    Address1varchar(200),

    Address2varchar(200),

    Address3varchar(200),

    Address4varchar(200),

    Townvarchar(100),

    Countyvarchar(100),

    Postcodevarchar(20),

    Hierarchyint

    )

    I've applied the iTVF to get a match ratio for each column in the following sample data.

    insert into PreMerge (master_id, Title, Forename, Surname, CompanyName, Address1, Address2, Town, Postcode, Hierarchy)

    select 1547300,'Mr','E','Glowver',NULL,'21 Harcourt Crescent','Nuthall','Nottingham','NG14 6BT','20100309' union all

    select 1581378,'Mrs','C E','Glowver',NULL,'21 Harcourt Crescent',NULL,'Nottingham','NG14 6BT','20100309' union all

    select 43196681, NULL,NULL,'Glowveer',NULL,'21 Harcourt','Crescent',NULL,'NG14 6BT','20120501' union all

    select 43196681, 'Mr','C J E','Glowser',NULL,'Harcourt','21 Harcourt Crescent',NULL,'NG14 6BT','20120501' union all

    select 64273, 'Mrs', 'A', 'Tomlinson', NULL, '13 Harwood New Road', NULL, 'BLACKBURN', 'BB4 6QS', 2 union all

    select 74575, 'Mr', 'M', 'Tomlinson', NULL, '13 Harwood New Road Great Harw', 'Great Harwood', 'BLACKBURN', 'BB4 6QS', 2

    go

    if object_id('PostMergeTest') > 0

    drop table dbo.PostMergeTest

    go

    SELECT

    r.master_id as retained_master_id,

    t.master_id as deleted_master_id,

    '#' AS '#',

    isnull(FuzzyTitle.MatchRatio, 0) as Title_MatchRatio,

    isnull(FuzzyForename.MatchRatio, 0) as Forename_MatchRatio,

    isnull(FuzzySurname.MatchRatio, 0) as Surname_MatchRatio,

    isnull(FuzzyAddress1.MatchRatio, 0) as Address1_MatchRatio,

    isnull(FuzzyTown.MatchRatio, 0) as Town_MatchRatio,

    '#' AS '##',

    r.title as retained_title,

    r.forename as retained_forename,

    r.surname as retained_surname,

    r.Address1 as retained_Address1,

    r.Address2 as retained_Address2,

    r.Town as retained_Town,

    '#' AS '###',

    t.title as deleted_title,

    t.forename as deleted_forename,

    t.Surname as deleted_surname,

    t.Address1 as deleted_Address1,

    t.Address2 as deleted_Address2,

    t.Town as deleted_Town

    into dbo.PostMergeTest

    FROM dbo.PreMerge r

    INNER JOIN dbo.PreMerge t

    ON t.postcode = r.postcode

    AND ((r.hierarchy<t.hierarchy) or (r.hierarchy=t.hierarchy and r.master_id>t.master_id))

    -------------------------------------------------------------------------------------------------------------------------------------

    CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Title, t.Title) FuzzyTitle-- Title

    CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Forename, t.Forename) FuzzyForename-- Forename

    CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Surname, t.Surname) FuzzySurname-- Surname

    CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.address1, t.address2) FuzzyAddress12-- Address1&Address2

    CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.address1, t.Address1) FuzzyAddress1-- Address1

    CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Town, t.Town) FuzzyTown-- Town

    --------------------------------------------------------------------------------------------------------------------------------------

    order by Fuzzysurname.MatchRatio

    go

    I then created a new table to hold the duplicate IDs:

    -- Match Level 1 = Surname, Address1 100% match

    insert into dbo.FinalSiteDuplicates (retained_master_id, deleted_master_id, MatchLevel)

    select

    retained_master_id,

    deleted_master_id,

    1

    from PostMergeTest

    where (Surname_MatchRatio > 99 and Address1_MatchRatio > 99)

    .

    .

    .

    .

    .

    This find one duplicate but I've been trying different ration numbers and I jst can't get it to work.

    am I doing this whole thing wrong? Is there a better way to do this?

    Thanks in advance.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • You need to tune the original query to reduce the false positives. The easiest way to do this is to tweak the match ratios in the WHERE clause until you can only see false positives, then reverse the logic;

    -- select for false positives

    WHERE (FuzzySurname.MatchRatio < 50 AND FuzzyAddress1.MatchRatio < 30)

    ORDER BY surname, address1

    -- eliminate false positives

    WHERE NOT (FuzzySurname.MatchRatio < 50 AND FuzzyAddress1.MatchRatio < 30)

    ORDER BY surname, address1

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • The more I think about this the more I come to realise that the address fields must be standardised before any merge work is done. I've spent hours tweaking my cut off points but no matter what figure or combination of ratios I use, I always end up missing out what I class as obvious duplicates!

    The problem is I can’t really PAF my source file as some customers will not want their addresses rebuilding so I’m having to work with the original address data.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Abu Dina (8/2/2012)


    The more I think about this the more I come to realise that the address fields must be standardised before any merge work is done. I've spent hours tweaking my cut off points but no matter what figure or combination of ratios I use, I always end up missing out what I class as obvious duplicates!

    The problem is I can’t really PAF my source file as some customers will not want their addresses rebuilding so I’m having to work with the original address data.

    Been there...

    Try putting more than one address element into the function, address1+address2+address3 (replace NULL with '').

    Don't forget that you can have many surnames at an address, and many individuals with the same surname but different fornames, also at the same address. It might help if you think in terms of Person and Household dedupe levels.

    So long as your postcodes are diamond you'll get there.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 8 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply