December 17, 2014 at 9:30 am
Hi all
I need to do some fuzzy-matching on several fields to find the "close but not quite" matches.
I've concatenated all the fields together to (hopefully) make life a bit easier.
This is the code I'm using so far:-
;with cte_main as
(select top 0.1 percent
ROW_NUMBER() over (order by isnull(SocialSecurityNumber,'') + Upper(NameLast) + isnull(Upper(NameFirst),'')
+ isnull(convert(varchar(10),Birthdate,103),'') + isnull([Address1],'') + isnull([Address2],'')
+ isnull([City],'') + isnull([Zip],'')) as rn
,isnull(SocialSecurityNumber,'') + Upper(NameLast) + isnull(Upper(NameFirst),'')
+ isnull(convert(varchar(10),Birthdate,103),'') + isnull([Address1],'') + isnull([Address2],'')
+ isnull([City],'') + isnull([Zip],'') as Unique_String
,SocialSecurityNumber as [NHS Number]
,med.[PrefixMedicalRecordNumber] as [RU Number]
,Upper(NameLast) as [Surname]
,Upper(NameFirst) as [Forename]
,Birthdate
,[Address1]
,[Address2]
,[City]
,[Zip] as PostCode
,CreatedDateTime
FROM
[livefocdb_daily].[dbo].[HimRec_Main] main
left join [livefocdb_daily].[dbo].[HimRec_MedicalRecordNumbers] med
on main.[PatientID]=med.[PatientID]
and main.[SourceID]=med.[SourceID]
left join [livefocdb_daily].[dbo].[HimRec_Address] addr
on main.[PatientID]=addr.[PatientID]
and main.[SourceID]=addr.[SourceID]
Where
MergedTo_HimRecID Is Null
and SocialSecurityNumber is not null
and (med.[PrefixMedicalRecordNumber] like 'RU%'
or med.[PrefixMedicalRecordNumber] is null)
Group By
SocialSecurityNumber
,med.[PrefixMedicalRecordNumber]
,Upper(NameLast)
,Upper(NameFirst)
,Birthdate
,[Address1]
,[Address2]
,[City]
,[Zip]
,CreatedDateTime
order by
isnull(SocialSecurityNumber,'') + Upper(NameLast) + isnull(Upper(NameFirst),'')
+ isnull(convert(varchar(10),Birthdate,103),'') + isnull([Address1],'') + isnull([Address2],'')
+ isnull([City],'') + isnull([Zip],'')
)
--select * from cte_main
select
a.[NHS Number]
,a.[RU Number]
,a.[Surname]
,a.[Forename]
,a.Birthdate
,a.[Address1]
,a.[Address2]
,a.[City]
,a.PostCode
,a.CreatedDateTime
,'is like' as Header
,[RFT_LOOKUPS].[dbo].[edit_distance](a.Unique_String,b.Unique_String) as Differences
,len(a.Unique_String) as String_Length_A
,len(b.Unique_String) as String_Length_B
,cast([RFT_LOOKUPS].[dbo].[edit_distance](a.Unique_String,b.Unique_String) as float)
/cast(len(a.Unique_String) as float) as Percentage_Match
,b.[NHS Number]
,b.[RU Number]
,b.[Surname]
,b.[Forename]
,b.Birthdate
,b.[Address1]
,b.[Address2]
,b.[City]
,b.PostCode
from
cte_main a
left join cte_main b
on a.rn=b.rn+1
--where
--[RFT_LOOKUPS].[dbo].[edit_distance](a.Unique_String,b.Unique_String)<10
--a.Unique_String<>b.Unique_String
order by
--[RFT_LOOKUPS].[dbo].[edit_distance](a.Unique_String,b.Unique_String)
a.Unique_String
And this is the code for the function referred to in the above code:-
CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int, @s2_len int
DECLARE @i int, @j-2 int, @s1_char nchar, @C int, @c_temp int
DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)
SELECT
@s1_len = LEN(@s1),
@s2_len = LEN(@s2),
@cv1 = 0x0000,
WHILE @j-2 <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j-2 = @j-2 + 1
WHILE @i <= @s1_len
BEGIN
SELECT
@s1_char = SUBSTRING(@s1, @i, 1),
@C = @i,
@cv0 = CAST(@i AS binary(2)),
@j-2 = 1
WHILE @j-2 <= @s2_len
BEGIN
SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j-2, 1) THEN 0 ELSE 1 END
IF @C > @c_temp SET @C = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j+1, 2) AS int)+1
IF @C > @c_temp SET @C = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j-2 = @j-2 + 1
END
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN @C
END
As things stand, the above code is taking around 90 seconds to process 470 records (I've got almost half a million to check) and seems to be quite slow.
Has anyone got any suggestions for speeding it up please?
Can anyone also suggest any other methods of fuzzy-matching which might be quicker?
Viewing 0 posts
You must be logged in to reply to this topic. Login to reply