Fuzzy Search

  • Bill Talada

    SSChampion

    Points: 11956

    Comments posted to this topic are about the item Fuzzy Search

  • Rod Weir

    Old Hand

    Points: 323

    Love it.  Brilliant work.  So simple and effective.  I just tried this on one of my databases, and it's a solid performer.  I'm thinking of updating the algorithm to account for common transpositions.  ie = ei and vice-versa..

  • David.Poole

    SSC Guru

    Points: 75191

    I've tried this on MySQL with similar results.

    As a homage to Jeff Modem I created a tally table with as least as many number as the longest string to be compared.

    MySQL uses LOCATE rather than CHARINDEX but the principal is the same.

    SELECT COUNT( DISTINCT LOCATE(SUBSTRING(@name1,tally_id,2),@name2))
    FROM tally_table
    WHERE tally_id<=LENGTH(@name1);

    I then ran the following to make sure that I understood what was going on
    SELECT LOCATE(SUBSTRING(@name1,tally_id,2),@name2),SUBSTRING(@name2,tally_id,2),@name1, @name2
    FROM tally_table
    WHERE tally_id<=LENGTH(@name1);

    This revealed a gotcha to consider.  Non-alphabetic characters crop up in pairs and count towards the score.  This can increase the score without improving the match thus suggesting that some string cleansing would be required in the source data or some considerations to be added to the function.  The combinations that can inflate the score that I can see are as follows

    • Punctuation & space
    • Letter & space
    • Space &letter
    • Letter & non-letter (such as in hyphenated names)

    • Non letter & letter

    • Double space or indeed any double non-letter

    As a quick bodge to get around this I tried changing the score to a fraction of the 2 character matches to 3 character matches.
    SELECT
    CASE trigram_position_match
    WHEN 0 THEN 0
    ELSE trigram_match_count/bigram_match_count END AS Score
    FROM (
    SELECT
    COUNT(DISTINCT LOCATE(SUBSTRING(@name1,tally_id,2),@name2)) AS bigram_match_count,
    COUNT(DISTINCT LOCATE(SUBSTRING(@name1,tally_id,3),@name2)) AS trigram_match_count,
    MAX(LOCATE(SUBSTRING(@name1,tally_id,3),@name2)) AS trigram_position_match
    FROM tally_table
    WHERE tally_id<=LENGTH(@name1)
    ) AS D;

    By having the score as a 3 letter match / 2 letter match the score range is from 0 to 1 with 1 being a perfect match.

  • ChrisM@Work

    SSC Guru

    Points: 186045

    Tried this method years ago - it's not very accurate. Here's an alternative:

    ChrisM@Work - Wednesday, November 19, 2008 5:43 AM

    Token matching works quite well... [font="Courier New"]DROP TABLE #Table CREATE TABLE #Table (TheValues VARCHAR (20)) INSERT INTO #Table (TheValues) SELECT '0010' UNION ALL         SELECT '11242' UNION ALL       SELECT '0011246' UNION ALL   SELECT '0011264' UNION ALL     SELECT '11284' UNION ALL                                 SELECT '11345' UNION ALL                                 SELECT '585' UNION ALL SELECT '5852' DECLARE @mcode VARCHAR(20) SET @mcode= '5857275' SET @mcode= '00112476' SELECT TOP 1 c.TheValues, (COUNT(*) * 100) / (LEN(@mcode)-2.00) AS MatchLevel FROM [Numbers] n INNER JOIN #Table c ON CHARINDEX(SUBSTRING(c.TheValues, n.number, 3), @mcode) > 0 WHERE n.number <= LEN(@mcode) AND LEN(SUBSTRING(c.TheValues, n.number, 3)) = 3 GROUP BY c.TheValues ORDER BY COUNT(*) DESC [/font] Results: TheValues TokenMatches ---------- ------------ 585 2 0011246 4 Cheers ChrisM

    [font="Arial"]“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw[/font]


    For fast, accurate and documented assistance in answering your questions, please read this article[/url].
    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]
    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
    [url

  • Matt Algate

    SSC Enthusiast

    Points: 170

    I like the "brute-force" approach to this - just whip through the data and count matches.

    For organization matches I have a table of common abbreviations (Co., Company, LLC, Inc., Brothers, And Sons etc.) and industry-specific words (Construction, Contractors, Builders, etc.) that I toss when looking for matches.

    ONE caveat to the function here - I would add a SUBTRACTION to account for extra characters.  
    Example:  
    [James Madison] --> [James Madison]  *SHOULD SCORE HIGHER THAN: [James Madison, JR] --> [James Madison]

  • lauri.pietarinen

    SSC Veteran

    Points: 236

    There has actually been some research on this approach,  see for instance:
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.6750&rep=rep1&type=pdf

    And Microsoft SSIS uses a similar approach in it's "fuzzy matching" component, see:
    https://www.microsoft.com/en-us/research/project/data-cleaning/

  • Bill Talada

    SSChampion

    Points: 11956

    lauri.pietarinen - Thursday, August 3, 2017 5:18 AM

    There has actually been some research on this approach,  see for instance:
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.6750&rep=rep1&type=pdf

    And Microsoft SSIS uses a similar approach in it's "fuzzy matching" component, see:
    https://www.microsoft.com/en-us/research/project/data-cleaning/

    From reading those links, I believe those approaches use old procedural logical methodologies to solve the problem.  I've tried many "intelligent" approaches and they are inferior.  The approach in this article uses what I would call a holistic approach.  Notice there is virtually no process to this function.  I'm forced to loop because of the language but with the right construct there would be no need for looping and you would just "get" a score.

  • lauri.pietarinen

    SSC Veteran

    Points: 236

    Well, not really procedural, since all those "side tables" can be created using "vanilla" SQL. Your approach may run into problems if you are comparing thousands of values with each other.  (1000x1000 = 1M, 1M x 1 sec = a very long time!).  So basically the idea is exactly the same, but some "preprocessing" is done in order to use SQL join power to process lot's of rows efficiently.

    see:
    https://www.dropbox.com/s/v6rn2n4e81sp63a/big_fuzzy.sql?dl=0

    and
    https://www.dropbox.com/s/7i84qf8p8m13g80/Fuzzy%20Matching%20with%20SQL%201.pptx?dl=0

    That being said your solution is certainly very simple and elegant for certain situations!

    Lauri

  • Sergiy

    SSC Guru

    Points: 109670

    Did you try it for pairs like
    "UCLA" <--> "University of California Los Angeles" ?
    What kind of score does it return?

    And which pair is a better match:
    [James Madison, JR] --> [James Madison, SR]
    or 
    [James Madison, JR] --> [James Madison Junior]
    ?

  • Bill Talada

    SSChampion

    Points: 11956

    Sergiy - Thursday, August 3, 2017 6:09 AM

    Did you try it for pairs like
    "UCLA" <--> "University of California Los Angeles" ?
    What kind of score does it return?

    And which pair is a better match:
    [James Madison, JR] --> [James Madison, SR]
    or 
    [James Madison, JR] --> [James Madison Junior]
    ?

    In the first parameter include all known variations so any of them can be found in the second parameter.


    SELECT dbo.FuzzyMatchString('UCLA','University of California Los Angeles'); -- returns 0
    SELECT dbo.FuzzyMatchString('UCLA University of California Los Angeles','University of California Los Angeles'); -- returns 36

    SELECT dbo.FuzzyMatchString('James Madison, JR','James Madison, Junior'); -- returns 15
    SELECT dbo.FuzzyMatchString('Mr. Dr. James R. Madison, JR Junior','James Madison, Junior'); -- returns 22

  • Kimberly Blum

    SSC Enthusiast

    Points: 141

    While this is very useful for returning a list to be examined manually, it's not appropriate for most automated name matching. For instance, if you were running this against first name, and used William as the search string, an exact match on William is rated the same as William-Frank, and Bill is rated the same as Lillian. Additional logic to weight the score to account for things such as exact match, matching substrings, etc. would increase the accuracy.

  • Bill Talada

    SSChampion

    Points: 11956

    Kimberly Blum - Thursday, August 3, 2017 9:04 AM

    While this is very useful for returning a list to be examined manually, it's not appropriate for most automated name matching. For instance, if you were running this against first name, and used William as the search string, an exact match on William is rated the same as William-Frank, and Bill is rated the same as Lillian. Additional logic to weight the score to account for things such as exact match, matching substrings, etc. would increase the accuracy.

    Yes, it is a fuzzy search and not applicable to equi-join situations otherwise we would just do exact matching.  I question any "automated name matching" routine as being 100% reliable.  The magic is that the old saying "garbage in, garbage out" is no longer true.  If a receptionist types your name in wrong while searching for your previous account, it will still find it, thus, "garbage in, good data out".

  • Matt Algate

    SSC Enthusiast

    Points: 170

    Bill Talada - Thursday, August 3, 2017 10:09 AM

    Kimberly Blum - Thursday, August 3, 2017 9:04 AM

    While this is very useful for returning a list to be examined manually, it's not appropriate for most automated name matching. For instance, if you were running this against first name, and used William as the search string, an exact match on William is rated the same as William-Frank, and Bill is rated the same as Lillian. Additional logic to weight the score to account for things such as exact match, matching substrings, etc. would increase the accuracy.

    Yes, it is a fuzzy search and not applicable to equi-join situations otherwise we would just do exact matching.  I question any "automated name matching" routine as being 100% reliable.  The magic is that the old saying "garbage in, garbage out" is no longer true.  If a receptionist types your name in wrong while searching for your previous account, it will still find it, thus, "garbage in, good data out".

    I see this as a FABULOUS approach to give the user a quick-and-dirty listing of what MIGHT be what they are looking for on a lookup.  Hey...I didn't find EXACTLY what you are looking for (else I return that first) but here are some items that MIGHT be what you are looking for.

  • J Livingston SQL

    SSC Guru

    Points: 51272

    any links to code/data sources to generate this 1 million row "Names" table please?

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Lowell

    SSC Guru

    Points: 323400

    this was really neat Bill, I like it;
    the thing i wanted was to also see a perfect score alongside the actual score, since longer strings would return larger scores
    I slapped together a ITVF version, and my scrores seem to be +1 compared to your score, but i get a bit more detail for me to see
    IF OBJECT_ID('[dbo].[FuzzyMatchString_ITVF]') IS NOT NULL
    DROP FUNCTION [dbo].[FuzzyMatchString_ITVF]
    GO
    CREATE FUNCTION dbo.FuzzyMatchString_ITVF (@s1 varchar(100), @s2 varchar(100))
    RETURNS TABLE
    AS
    RETURN
      WITH MiniTally
      AS
      (
      --yet another small tally generator: 256 values
      select
       ROW_NUMBER() over (order by num) as n
      from (values (1)) t (num)
      group by cube (num, num, num, num, num, num,num,num)
      --more num = another order power of 2
      )
      SELECT
       SUM(CASE WHEN CHARINDEX(SUBSTRING(@s2,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) AS SearchScore,
       SUM(CASE WHEN CHARINDEX(SUBSTRING(@s1,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) AS PerfectScore,
       CONVERT(DECIMAL(5,2),100 * ((SUM(CASE WHEN CHARINDEX(SUBSTRING(@s2,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) * 1.0) / (SUM(CASE WHEN CHARINDEX(SUBSTRING(@s1,MiniTally.n,2),@s1) > 0 THEN 1 ELSE 0 END) * 1.0))) AS PercentScore,
       @s1 AS SearchTerm,
       @s2 AS ItemCompared
      FROM MiniTally
      WHERE MiniTally.n<=LEN(@s1)
    GO

    and here's some sample data and the results

    ;WITH MovieStarNames([ID],[StarFirstName],[StarLastName])
    AS
    (
    SELECT '40','Harrison','Ford' UNION ALL
    SELECT '93','Jason','Robards' UNION ALL
    SELECT '88','Edward','Norton' UNION ALL
    SELECT '25','Orson','Welles' UNION ALL
    SELECT '3','Marlon','Brando' UNION ALL
    SELECT '61','Michael','Caine' UNION ALL
    SELECT '56','Denzel','Washington'
    )
    SELECT * , dbo.FuzzyMatchString('Harrison Ford',msn.[StarFirstName] + ' ' + msn.[StarLastName]) AS ScalarVal
    FROM [MovieStarNames] AS [msn]
    CROSS APPLY dbo.FuzzyMatchString_ITVF('Harrison Ford',msn.[StarFirstName] + ' ' + msn.[StarLastName]) fn

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply