Fuzzy Matching

  • I'm reaching out looking for pointers as I'm starting to brainstorm ideas how to handle a task that has been presented to me. We receive a set of data that has a manually entered piece to it so, therefore, we can receive a file that has the entries "ABC Movers", "A.B.C. Movers", or "ABC Mvoers". We would like to maintain a list of what we know the value should be (i.e. ABC Movers) and have any variation that is close to it change to show the correct value. I started some research and came across the soundex function but I'm not sure how reliable that can be or if it is the best way. Does anyone have experience in doing something like this?

  • RonMexico (8/30/2016)


    I'm reaching out looking for pointers as I'm starting to brainstorm ideas how to handle a task that has been presented to me. We receive a set of data that has a manually entered piece to it so, therefore, we can receive a file that has the entries "ABC Movers", "A.B.C. Movers", or "ABC Mvoers". We would like to maintain a list of what we know the value should be (i.e. ABC Movers) and have any variation that is close to it change to show the correct value. I started some research and came across the soundex function but I'm not sure how reliable that can be or if it is the best way. Does anyone have experience in doing something like this?

    oh yay, lucky you!

    Start of Minor rant: I hate free-flowing note fields that contain data that belongs in columns in perhaps lookup tables. End minor rant.

    Your best bet is going to be regular expressions (fun fun, luckily there are a lot of websites with pre-made ones to get one started) , and tsql will not be the easiest way to do that. If possible, do it in C# or vb.net or something that support strong regex's.

    This whole thing is problematic tho, because people who are entering things in free-flowing notes will not always format it in the way you assume they would. There will always be mis-matches. Therefore, in addition to the matches, you should report on the non-matches so someone can be tasked with seeing if there additional patterns to add to the regex's that are scrubbing the data out for you.

  • The SOUNDEX hashing function is primitive and will result in many false matches when comparing strings of multiple words, even if they are very dissimilar. Basically, it retains the first character, strips additional vowels, hashes the first character of each word, and is senstive to spaces. It is better suited for hashing single words.

    For examples, all of the following coverts to the same soundex hash, despite their being obviously different.

    print soundex('ABC Movers')

    A120

    print soundex('ABC Maplethorpe')

    A120

    print soundex('ABC M')

    A120

    It also is senstive to punctuation and symbols, which again is not what you want:

    print soundex('A.B.C. Movers')

    A000

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • I've got a better function than SOUNDEX for you. It's called "StringsClose" and it compares strings while forgiving one and only one typo. A typo is defined as a mistyped character, an extra character, an omitted character or a single transposition of two characters. It is not perfect, and really shouldn't be trusted where both strings are 1-5 characters. All disclaimers aside, it will still handle what you're asking for in your examples.

    Before comparing the strings in your example, you will probably want to get rid of all periods (and probably other special characters as well). Like this:

    d eclare @StringA varchar(50) = 'A.B.C. Movers'

    ,@StringB varchar(50) = 'ABC Movres'

    select @StringA as [@StringA], @StringB as [@StringB], sc.StringsClose

    from dbo.itvf_StringsClose(replace(@StringA,'.',''),

    replace(@StringB,'.','')

    ) sc

    The code for the function follows. It looks clunky but runs fast enough that I've never had to revisit it and make it more elegant. Let me know if this works for you, or if you have any questions.

    C REATE FUNCTION [dbo].[itvf_StringsClose]

    (

    @String1 varchar(50)

    ,@String2 varchar(50)

    )

    RETURNS TABLE

    AS

    -- =================================================================================================

    -- Author:Bob Hovious

    -- Create date: 3/19/2008

    -- Description:This function compares two Strings and deems them to be "close" if:

    --1. One has one more character than the other, in any position

    --2. One character has been mistyped

    --3. Two characters are transposed

    -- =================================================================================================

    RETURN

    (

    WITH

    -- remove all blanks from Strings

    trimmedStrings AS (SELECT replace(@String1,' ','') AS String1

    ,replace(@String2,' ','') AS String2)

    -- make sure that the longest String is always put in String1

    -- if they are 2 or more apart they aren't close, so don't bother passing rows to the "stuffing cte

    ,basevalues AS

    (SELECT CASE WHEN len(String1) >= len(String2) THEN String1 ELSE String2 END AS String1 -- long String 1

    ,CASE WHEN len(String1) >= len(String2) THEN String2 ELSE String1 END AS String2 -- short String 2

    FROM trimmedStrings

    WHERE abs(len(String1)-len(String2)) < 2

    )

    ,tally1 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

    (11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

    (21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

    (31),(32),(33),(34),(35),(36),(37),(38),(39),(40),

    (41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N1)

    )

    ,tally2 AS (SELECT * FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

    (11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

    (21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

    (31),(32),(33),(34),(35),(36),(37),(38),(39),(40),

    (41),(42),(43),(44),(45),(46),(47),(48),(49),(50)) v (N2)

    )

    ----------------------------------------------------------------------------------------

    -- build sets of character Strings for "Like" comparisons and transposition comparisons

    ----------------------------------------------------------------------------------------

    ,comparisons AS (-- if they're equal, they're equal

    SELECT TOP (1) 'Y' AS StringsClose

    FROM basevalues

    WHERE String1 = String2

    UNION ALL

    -- replace each character in String1 with a wildcard to handle single keystroke typos

    -- whether wrong character typed or single character omitted

    -- this works whether or not the length of String1 is greater than String2

    SELECT TOP (1) 'Y' AS StringsClose

    FROM basevalues

    CROSS JOIN tally1 t1

    WHERE 1 = 1

    AND String2 LIKE stuff(String1,N1,1,'%')

    AND N1 BETWEEN 2 and len(String1)

    UNION ALL

    -- where Strings are the same length, check for transposition

    SELECT TOP (1) 'Y' AS Stringsclose

    FROM basevalues

    CROSS JOIN tally2 t2

    WHERE 1 = 1

    AND len(String1) = len(String2)

    AND N2 < len(String1)

    AND String2 = stuff(String1,N2,2,reverse(subString(String1,N2,2)))

    )

    SELECT isnull((SELECT TOP (1) StringsClose FROM comparisons),'N') AS StringsClose

    )

    Edited to add: Having re-read and thought about your problem, this may not be much help as it would run slow. It runs fast as written because it compares two strings almost instantly. In your case, you would have to adapt it to run against your table of correct values for each of the string one "LIKE" combinations generated by the function. That would NOT run almost instantly.

    The deciding factor is how often you hit these misspellings. If it's one time in a thousand, this approach might be workable for you.

    How many rows are in the table with correct values? I assume you have an index over the column with the correct values. What is the length of that column?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • RonMexico (8/30/2016)


    I'm reaching out looking for pointers as I'm starting to brainstorm ideas how to handle a task that has been presented to me. We receive a set of data that has a manually entered piece to it so, therefore, we can receive a file that has the entries "ABC Movers", "A.B.C. Movers", or "ABC Mvoers". We would like to maintain a list of what we know the value should be (i.e. ABC Movers) and have any variation that is close to it change to show the correct value. I started some research and came across the soundex function but I'm not sure how reliable that can be or if it is the best way. Does anyone have experience in doing something like this?

    You need to maintain a list of what you know the value should be (i.e. ABC Movers) along with a table with all variations that are identified to be associated with the correct value.

    I had an extensive experience with attempts of matching "close strings" on fly - it never works.

    I bet "A.B.C" in "A.B.C. Movers" stands for something.

    I can guarantee that somebody will enter the full name, not the abbreviation.

    And it will never be close to "ABC Movers". You need just to know that, say, "Authentic British Cars Movers" is another name for "ABC Movers".

    _____________
    Code for TallyGenerator

  • Sergiy, the problem with that approach is that it works for synonyms but not for typos in the correct character string. For example:

    ABC Movres

    A.B.C Movers

    Authentic British Cars Movers

    Authentic British Car Movers

    Authntic British Cars Movers

    Auth. Brit. Car Movers

    Auth Br Car Moevrs

    I was about to say you can't anticipate typos, but if you are willing to pay the price in disk space, it's possible to generate a routine that will generate the typo patterns I handle with StringsClose. Obviously, the larger the string, the more rows of acceptable combinations will be produced. But with some limitations on what is considered acceptable, it might be workable.

    There would need to be additional synonym capability to associate the 'ABC Movers' with 'Authentic British Car Movers' so you could handle typos against the longer string and against abbreviated strings such as 'Auth Br Car Movres'. There is no taking some of the manual work out of it, and it will always be imperfect. Some edge case you didn't anticipate will always slip through the cracks. But when it's running, it's probably going to be the fastest of many imperfect solutions.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • This article[/url] describes a CLR function to calculate the Levenshtein-Distance. Implementing the full functionality in an iTVF is tricky and results in poor performance. Here's an iTVF which is almost there but runs fast enough for most purposes - about 10,000 matches per second.

    USE [Matching]

    GO

    /****** Object: UserDefinedFunction [dbo].[IF_Levenshtein02] Script Date: 27/01/2014 20:12:53 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    -- this will score around 10,000 word pairs per second on 2010 laptop technology

    ALTER FUNCTION [dbo].[IF_Levenshtein02]

    (

    @Reference VARCHAR(20), @Target VARCHAR(20)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    ( -- output query

    SELECT [Score %] = CASE

    WHEN @Reference = @Target THEN CAST(100 AS NUMERIC(5,2))

    WHEN 0 = 1 THEN CAST(100 AS NUMERIC(5,2))-- placeholder for any other shortcuts

    ELSE

    (SELECT

    [Score %] = CAST(SUM(LetterScore)*100.0/MAX(WordLength*WordLength) AS NUMERIC(5,2))

    FROM ( -- do

    SELECT

    seq = t1.n,

    ref.Letter,

    v.WordLength,

    LetterScore = v.WordLength - ISNULL(MIN(tgt.n),v.WordLength)

    FROM ( -- v

    SELECT

    Reference = LEFT(@Reference + REPLICATE('_',WordLength),WordLength),

    Target = LEFT(@Target + REPLICATE('_',WordLength),WordLength),

    WordLength = WordLength

    FROM ( -- di

    SELECT WordLength = MAX(WordLength)

    FROM (VALUES (DATALENGTH(@Reference)),(DATALENGTH(@Target))) d (WordLength)

    ) di

    ) v

    CROSS APPLY ( -- t1

    SELECT TOP(WordLength) n

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

    ) t1

    CROSS APPLY (SELECT Letter = SUBSTRING(Reference,t1.n,1)) ref

    OUTER APPLY ( -- tgt

    SELECT TOP(WordLength) n = ABS(t1.n - t2.n)

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)) t2 (n)

    WHERE SUBSTRING(@Target,t2.n,1) = ref.Letter

    ) tgt

    GROUP BY t1.n, ref.Letter, v.WordLength

    ) do

    )

    END

    ) -- output query

    Edit: see also these articles [/url]by Alan.B

    โ€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.โ€ - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • The Dixie Flatline (8/30/2016) if you are willing to pay the price in disk space, it's possible to generate a routine that will generate the typo patterns I handle with StringsClose.

    No, not generate all possible typos.

    Just record every resolution an operator applied when a name was not recognised by the system.

    On every such occasion system shoul stop processing a file and notify about the issue. Once the correct match is suggested the system saves it in the [Synonyms] table and uses it on every further occasion.

    From my experience, over the course of 10 years every name accumulated 5-7 nicknames, with couple of unfortunate ones having >20 variations.

    Initially it will require some attention, but when when all most typical errors are recorded it will bother people probably once a month.

    _____________
    Code for TallyGenerator

  • Sergiy, I too have been working with fuzzy matching on peoples' names for over a decade. I hear what you're saying, but that remains a manual process - when something doesn't match, look at it and make some table entries. Obviously that has to happen for edge cases that fall through the cracks in code. My point was that for the price in disk space, one-character typos can be anticipated by the system.

    Having slept on this, it may still be possible to generate LIKE patterns on the fly, but substantially reduce the I/O. I have to go to work in a few, but I could do some coding and testing tonight. If anyone else wants to tackle this, here are my thoughts.

    Have indexes on the correct string (targetstring) and on REVERSE(targetstrings)

    Remove spaces and special characters.

    First pass, get all rows where targetstring LIKE left(sourcestring,2)+% AND ABS(len(1)-len(2) =< 1

    Cross join the result set to the set of patterns generated by stringclose

    Take top (1) result.

    If no hit, get rows where REVERSE(RIGHT(sourcestring,2)+% in the REVERSE(targetstring column)

    and ABS(len(1)-len(2) =< 1

    Cross join the result set to the set of patterns generated by stringclose

    Take top (1) result

    This would not run as fast as generating a table with all possible one-character typos, but it would certainly run much faster than a manual process.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Thanks for all the help! It sounds like I might have to maintain a table with the correct name and create a list of all of the incorrect names that go along with it as they happen. It's a monthly process that brings in 300k records but I don't anticipate there being many misspellings given that it is a feed from a reputable source.

  • Ron, how many correct names are we talking about? I'm assuming this is like a customer name or something. Is the amount of correct names in the thousands, tens of thousands, millions?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • If the manual process works for you, then go with it. But thanks for bringing up an interesting problem. ๐Ÿ™‚

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Mis-posted. Withdrawn.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • This is a new process so we haven't really taken a close look at the data at this point. I can't put a number on it but would venture to say the vast majority will be correct.

  • No, I'm not talking about the input batches. I mean how many names are in your client table, or whatever.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply