Fuzzy-matching taking forever

  • Hi all

    I need to do some fuzzy-matching on several fields to find the "close but not quite" matches.

    I've concatenated all the fields together to (hopefully) make life a bit easier.

    This is the code I'm using so far:-

    ;with cte_main as

    (select top 0.1 percent

    ROW_NUMBER() over (order by isnull(SocialSecurityNumber,'') + Upper(NameLast) + isnull(Upper(NameFirst),'')

    + isnull(convert(varchar(10),Birthdate,103),'') + isnull([Address1],'') + isnull([Address2],'')

    + isnull([City],'') + isnull([Zip],'')) as rn

    ,isnull(SocialSecurityNumber,'') + Upper(NameLast) + isnull(Upper(NameFirst),'')

    + isnull(convert(varchar(10),Birthdate,103),'') + isnull([Address1],'') + isnull([Address2],'')

    + isnull([City],'') + isnull([Zip],'') as Unique_String

    ,SocialSecurityNumber as [NHS Number]

    ,med.[PrefixMedicalRecordNumber] as [RU Number]

    ,Upper(NameLast) as [Surname]

    ,Upper(NameFirst) as [Forename]

    ,Birthdate

    ,[Address1]

    ,[Address2]

    ,[City]

    ,[Zip] as PostCode

    ,CreatedDateTime

    FROM

    [livefocdb_daily].[dbo].[HimRec_Main] main

    left join [livefocdb_daily].[dbo].[HimRec_MedicalRecordNumbers] med

    on main.[PatientID]=med.[PatientID]

    and main.[SourceID]=med.[SourceID]

    left join [livefocdb_daily].[dbo].[HimRec_Address] addr

    on main.[PatientID]=addr.[PatientID]

    and main.[SourceID]=addr.[SourceID]

    Where

    MergedTo_HimRecID Is Null

    and SocialSecurityNumber is not null

    and (med.[PrefixMedicalRecordNumber] like 'RU%'

    or med.[PrefixMedicalRecordNumber] is null)

    Group By

    SocialSecurityNumber

    ,med.[PrefixMedicalRecordNumber]

    ,Upper(NameLast)

    ,Upper(NameFirst)

    ,Birthdate

    ,[Address1]

    ,[Address2]

    ,[City]

    ,[Zip]

    ,CreatedDateTime

    order by

    isnull(SocialSecurityNumber,'') + Upper(NameLast) + isnull(Upper(NameFirst),'')

    + isnull(convert(varchar(10),Birthdate,103),'') + isnull([Address1],'') + isnull([Address2],'')

    + isnull([City],'') + isnull([Zip],'')

    )

    --select * from cte_main

    select

    a.[NHS Number]

    ,a.[RU Number]

    ,a.[Surname]

    ,a.[Forename]

    ,a.Birthdate

    ,a.[Address1]

    ,a.[Address2]

    ,a.[City]

    ,a.PostCode

    ,a.CreatedDateTime

    ,'is like' as Header

    ,[RFT_LOOKUPS].[dbo].[edit_distance](a.Unique_String,b.Unique_String) as Differences

    ,len(a.Unique_String) as String_Length_A

    ,len(b.Unique_String) as String_Length_B

    ,cast([RFT_LOOKUPS].[dbo].[edit_distance](a.Unique_String,b.Unique_String) as float)

    /cast(len(a.Unique_String) as float) as Percentage_Match

    ,b.[NHS Number]

    ,b.[RU Number]

    ,b.[Surname]

    ,b.[Forename]

    ,b.Birthdate

    ,b.[Address1]

    ,b.[Address2]

    ,b.[City]

    ,b.PostCode

    from

    cte_main a

    left join cte_main b

    on a.rn=b.rn+1

    --where

    --[RFT_LOOKUPS].[dbo].[edit_distance](a.Unique_String,b.Unique_String)<10

    --a.Unique_String<>b.Unique_String

    order by

    --[RFT_LOOKUPS].[dbo].[edit_distance](a.Unique_String,b.Unique_String)

    a.Unique_String

    And this is the code for the function referred to in the above code:-

    CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))

    RETURNS int

    AS

    BEGIN

    DECLARE @s1_len int, @s2_len int

    DECLARE @i int, @j-2 int, @s1_char nchar, @C int, @c_temp int

    DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)

    SELECT

    @s1_len = LEN(@s1),

    @s2_len = LEN(@s2),

    @cv1 = 0x0000,

    @j-2 = 1, @i = 1, @C = 0

    WHILE @j-2 <= @s2_len

    SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j-2 = @j-2 + 1

    WHILE @i <= @s1_len

    BEGIN

    SELECT

    @s1_char = SUBSTRING(@s1, @i, 1),

    @C = @i,

    @cv0 = CAST(@i AS binary(2)),

    @j-2 = 1

    WHILE @j-2 <= @s2_len

    BEGIN

    SET @C = @C + 1

    SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j-1, 2) AS int) +

    CASE WHEN @s1_char = SUBSTRING(@s2, @j-2, 1) THEN 0 ELSE 1 END

    IF @C > @c_temp SET @C = @c_temp

    SET @c_temp = CAST(SUBSTRING(@cv1, @j-2+@j+1, 2) AS int)+1

    IF @C > @c_temp SET @C = @c_temp

    SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j-2 = @j-2 + 1

    END

    SELECT @cv1 = @cv0, @i = @i + 1

    END

    RETURN @C

    END

    As things stand, the above code is taking around 90 seconds to process 470 records (I've got almost half a million to check) and seems to be quite slow.

    Has anyone got any suggestions for speeding it up please?

    Can anyone also suggest any other methods of fuzzy-matching which might be quicker?

Viewing 0 posts

You must be logged in to reply to this topic. Login to reply