July 31, 2012 at 9:50 am
I see.
Another question if you don't mind (you're going to hate me once this little project of mine is done! but if you're anywhere near the northwest then I'd be happy to repay your kindness with drinks of your choice lol).
How can Willis & Lilley have a match ratio = 60 ?! I've been tweaking the surname setting and I've settled on 54 for now but this is a strange result or is this one of many inevitable false positives I will always encounter with any record set?
Thanks.
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
July 31, 2012 at 10:05 am
Abu Dina (7/31/2012)
I see.Another question if you don't mind (you're going to hate me once this little project of mine is done! but if you're anywhere near the northwest then I'd be happy to repay your kindness with drinks of your choice lol).
Northwest UK? Sure!
Keep coming with the questions, there are plenty of good folks here to lend a hand.
How can Willis & Lilley have a match ratio = 60 ?! I've been tweaking the surname setting and I've settled on 54 for now but this is a strange result or is this one of many inevitable false positives I will always encounter with any record set?
Thanks.
That's because I made a mistake playing with the function yesterday :blush:
Here are the 2-char overlapping tokens from 'Willis' which are matched in 'Lilley':
il, ll, li
Here's the fixed function:
-- =============================================
-- Author:Chris Morris
-- Create date: 30/07/2012
-- =============================================
ALTER FUNCTION [dbo].[FuzzyMatch_iTVF2k5]
(@Reference VARCHAR(100) = NULL,
@Target VARCHAR(100) = NULL)
RETURNS table WITH SCHEMABINDING
AS
-- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf
RETURN
SELECT d.Result, MatchRatio = CAST(CASE
WHEN d.Result = 1 THEN 100
WHEN d.Result = 3 THEN DATALENGTH(@Target)*100.00/DATALENGTH(@Reference)
WHEN d.Result = 4 THEN DATALENGTH(@Reference)*100.00/DATALENGTH(@Target)
WHEN d.Result = 5 THEN
(SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-2.00)) AS DECIMAL(5,2))
* CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END
FROM (
SELECT n = t1.n+t2.n
FROM (VALUES (0),(10),(20),(30),(40),(50),(60),(70),(80),(90)) t1(n)
CROSS JOIN (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t2(n)
WHERE t1.n+t2.n <= DATALENGTH(@Reference)
) Tally
WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 3), @Target) > 0
AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 3)) = 3)
WHEN d.Result = 6 THEN
(SELECT MatchPC = CAST((100.00*COUNT(*)/(DATALENGTH(@Reference)-0.00)) AS DECIMAL(5,2))
* CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN DATALENGTH(@Reference)*1.00 / DATALENGTH(@Target)*1.00 ELSE 1 END
FROM (
SELECT n = t1.n
FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t1(n)
WHERE t1.n <= DATALENGTH(@Reference)
) Tally
WHERE CHARINDEX(SUBSTRING(@Reference, Tally.n, 2), @Target) > 0
AND DATALENGTH(SUBSTRING(@Reference, Tally.n, 2)) = 2)
ELSE NULL
END AS DECIMAL(5,2))
FROM (
SELECT Result = CASE
WHEN @Reference = @Target THEN 1
WHEN @Reference IS NULL OR @Target IS NULL THEN 2
WHEN @Reference LIKE '%'+@Target+'%' THEN 3
WHEN @Target LIKE '%'+@Reference+'%' THEN 4
WHEN DATALENGTH(@Reference) > 7 AND DATALENGTH(@Target) > 7 THEN 5
WHEN DATALENGTH(@Reference) > 2 AND DATALENGTH(@Target) > 2 THEN 6 -- SOUNDEX may work better
ELSE 7
END
) d
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
July 31, 2012 at 10:17 am
Northwest UK yes 🙂 I will PM you.. seriously you deserve drinks for all this help!!!!!
I've made the change and the function is working better now. 🙂
Going to play around with this more and will report back again tomorrow.
Once again, thanks so much for your efforts I really appreciate it.
Have a good evening.
p.s.
Sorry I must have PMed you 3 times already by mistake. The sent folder doesn't get updated with sent messages then :rolleyes:
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
August 1, 2012 at 1:42 am
Abu Dina (7/31/2012)
Northwest UK yes 🙂 I will PM you.. seriously you deserve drinks for all this help!!!!!I've made the change and the function is working better now. 🙂
Going to play around with this more and will report back again tomorrow.
Once again, thanks so much for your efforts I really appreciate it.
Have a good evening.
p.s.
Sorry I must have PMed you 3 times already by mistake. The sent folder doesn't get updated with sent messages then :rolleyes:
We're all volunteers here, but thanks!
When you're ready you will probably need an explanation of how this function works - a good place to start would be one of Jeff Moden's tally table articles.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 2, 2012 at 9:03 am
Hi Chris,
I've been trying to work out a stragey for my merging but I'm struggling a little. :blush:
The merge stored procedure will accept two parameters, the source table and the merge criteria
1) Merge by Contact
2) Merge by Business
3) Merge by Address
Maybe I've started with the wrong one but I'm trying to work out the Merge by Address but I just can't get it right.
As mentioned previoulsy, the source table will always be in the following format:
create table dbo.PreMerge (
master_idint,
Titlevarchar(20),
Forenamevarchar(50),
Surnamevarchar(50),
CompanyNamevarchar(200),
CompanyName2varchar(200),
Address1varchar(200),
Address2varchar(200),
Address3varchar(200),
Address4varchar(200),
Townvarchar(100),
Countyvarchar(100),
Postcodevarchar(20),
Hierarchyint
)
I've applied the iTVF to get a match ratio for each column in the following sample data.
insert into PreMerge (master_id, Title, Forename, Surname, CompanyName, Address1, Address2, Town, Postcode, Hierarchy)
select 1547300,'Mr','E','Glowver',NULL,'21 Harcourt Crescent','Nuthall','Nottingham','NG14 6BT','20100309' union all
select 1581378,'Mrs','C E','Glowver',NULL,'21 Harcourt Crescent',NULL,'Nottingham','NG14 6BT','20100309' union all
select 43196681, NULL,NULL,'Glowveer',NULL,'21 Harcourt','Crescent',NULL,'NG14 6BT','20120501' union all
select 43196681, 'Mr','C J E','Glowser',NULL,'Harcourt','21 Harcourt Crescent',NULL,'NG14 6BT','20120501' union all
select 64273, 'Mrs', 'A', 'Tomlinson', NULL, '13 Harwood New Road', NULL, 'BLACKBURN', 'BB4 6QS', 2 union all
select 74575, 'Mr', 'M', 'Tomlinson', NULL, '13 Harwood New Road Great Harw', 'Great Harwood', 'BLACKBURN', 'BB4 6QS', 2
go
if object_id('PostMergeTest') > 0
drop table dbo.PostMergeTest
go
SELECT
r.master_id as retained_master_id,
t.master_id as deleted_master_id,
'#' AS '#',
isnull(FuzzyTitle.MatchRatio, 0) as Title_MatchRatio,
isnull(FuzzyForename.MatchRatio, 0) as Forename_MatchRatio,
isnull(FuzzySurname.MatchRatio, 0) as Surname_MatchRatio,
isnull(FuzzyAddress1.MatchRatio, 0) as Address1_MatchRatio,
isnull(FuzzyTown.MatchRatio, 0) as Town_MatchRatio,
'#' AS '##',
r.title as retained_title,
r.forename as retained_forename,
r.surname as retained_surname,
r.Address1 as retained_Address1,
r.Address2 as retained_Address2,
r.Town as retained_Town,
'#' AS '###',
t.title as deleted_title,
t.forename as deleted_forename,
t.Surname as deleted_surname,
t.Address1 as deleted_Address1,
t.Address2 as deleted_Address2,
t.Town as deleted_Town
into dbo.PostMergeTest
FROM dbo.PreMerge r
INNER JOIN dbo.PreMerge t
ON t.postcode = r.postcode
AND ((r.hierarchy<t.hierarchy) or (r.hierarchy=t.hierarchy and r.master_id>t.master_id))
-------------------------------------------------------------------------------------------------------------------------------------
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Title, t.Title) FuzzyTitle-- Title
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Forename, t.Forename) FuzzyForename-- Forename
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Surname, t.Surname) FuzzySurname-- Surname
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.address1, t.address2) FuzzyAddress12-- Address1&Address2
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.address1, t.Address1) FuzzyAddress1-- Address1
CROSS APPLY [dbo].[FuzzyMatch_iTVF2k5] (r.Town, t.Town) FuzzyTown-- Town
--------------------------------------------------------------------------------------------------------------------------------------
order by Fuzzysurname.MatchRatio
go
I then created a new table to hold the duplicate IDs:
-- Match Level 1 = Surname, Address1 100% match
insert into dbo.FinalSiteDuplicates (retained_master_id, deleted_master_id, MatchLevel)
select
retained_master_id,
deleted_master_id,
1
from PostMergeTest
where (Surname_MatchRatio > 99 and Address1_MatchRatio > 99)
.
.
.
.
.
This find one duplicate but I've been trying different ration numbers and I jst can't get it to work.
am I doing this whole thing wrong? Is there a better way to do this?
Thanks in advance.
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
August 2, 2012 at 9:32 am
You need to tune the original query to reduce the false positives. The easiest way to do this is to tweak the match ratios in the WHERE clause until you can only see false positives, then reverse the logic;
-- select for false positives
WHERE (FuzzySurname.MatchRatio < 50 AND FuzzyAddress1.MatchRatio < 30)
ORDER BY surname, address1
-- eliminate false positives
WHERE NOT (FuzzySurname.MatchRatio < 50 AND FuzzyAddress1.MatchRatio < 30)
ORDER BY surname, address1
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 2, 2012 at 9:35 am
The more I think about this the more I come to realise that the address fields must be standardised before any merge work is done. I've spent hours tweaking my cut off points but no matter what figure or combination of ratios I use, I always end up missing out what I class as obvious duplicates!
The problem is I can’t really PAF my source file as some customers will not want their addresses rebuilding so I’m having to work with the original address data.
---------------------------------------------------------
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
August 2, 2012 at 9:48 am
Abu Dina (8/2/2012)
The more I think about this the more I come to realise that the address fields must be standardised before any merge work is done. I've spent hours tweaking my cut off points but no matter what figure or combination of ratios I use, I always end up missing out what I class as obvious duplicates!The problem is I can’t really PAF my source file as some customers will not want their addresses rebuilding so I’m having to work with the original address data.
Been there...
Try putting more than one address element into the function, address1+address2+address3 (replace NULL with '').
Don't forget that you can have many surnames at an address, and many individuals with the same surname but different fornames, also at the same address. It might help if you think in terms of Person and Household dedupe levels.
So long as your postcodes are diamond you'll get there.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 8 posts - 16 through 23 (of 23 total)
You must be logged in to reply to this topic. Login to reply