Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL

  • Any update on the potential error in the function?

    Martin


    Kindest Regards,

    Martin

  • Hi,

    Article is very interesting.

    I use to work with name and address matching for a while.

    Never tried Jaro Winkler algorithm and want to thank Ira for that.

    What striked my mind is that good matching itself is just a part of matching process.

    Another interesting part of matching is data parsing.

    Good parsing gives better matching results.

    It would be interesting to hear about data parsing and it's common issues.

    Regards

    Anatoliy

  • I agree that parsing is critical. On the other hand, how to parse is also very much an "it depends" question.

    In general, for my only slightly imperfect matching I tend to combine split fields (addr1, addr2, first name, last name), clean them up (nested REPLACE() for SQL), remove extraneous elements (professional designations, etc.), remove all remaining spaces, and then compare them. With at least some matching on another selective field (license number, for instance), this produces high quality results.

    For later, fuzzier passes... well, that's where splitting pieces apart, cross-matching, Jaro-Winkler, Double Metaphone, and so on come in.

    As far as splitting pieces apart, for the ' ' + addr1 + ' ' + addr2 + ' ' (etc.), I have code based on U.S. addresses that tries to find all leading digits, all suite/apartment indicators (whether they be first or last), whatever's in the middle, and the first "word" of whatever's in the middle. These are then used in various combinations with City+State and/or Zip code on the address portion of the matching.

    In pure T-SQL, splitting the pieces apart is fairly resource heavy.

  • Hi,

    In my case I had also Middle Name involved in matching process.

    Also we could have initials instead of full names which actually makes it more complicated,

    and even more complicated for Asian names.

    Regarding addresses, I want to mention that from my poiny of view addressses are virtual.

    There are different address databases around but there is no absolute one.

    Addresses people use are another source of address virtualisation we dealing with.

    And to build up to date address DB is quite complex task.

  • In my case, middle name/initial wasn't of any real value in and of itself.

    First initials are often an issue.

    One step I have not yet taken is splitting up names into individual words, so Guthridge-Henderson matches with either Guthridge or Henderson at a fuzzyish level of matching.

    Likewise, matching on common nicknames (Dick vs. Richard, Bob vs. Rob vs. Robert) I haven't implemented yet.

    Matching on the first initial of nicknames, however, I'd judge as a lost cause (D matching R).

    Addresses take quite a lot of cleanup, and even then, they're hard to match. Sending them out to a good address cleanup service is probably the best bet, but failing that, we have to try ourselves.

  • OK,

    Address matching isn't quite exactly related to the topic, but of course it's related to name matching somehow. Because of the scale of the address matching topic I will not discuss it on this BLOG and rather have a little bit discussion on name match.

    Regarding matching on initial, it's quite useful when matching on household level; when you have father/son relationship, when they share the same first name.

    For each name part we can have the following cases if we using fuzzy matching:

    1. Exact match

    2. Nick name match

    3. Fuzzy match

    4. Match on initial

    5 No match

    all cases are listed according their matching weight, actually you can have it differently;for surname we don't do match on first letter(case 4) and nickname(case2); also there could be cases when surname could change, in case of marriage.

    Each name part itself has matching weight; for example match on surname has higher weight than match on first or second name, not sure about to compare weight of first name and second name against surname.

    That mean we having here a scoring engine with possible tuneup parameters.

    That was my approach.

    Also a word about cleansing function used by Ira.

    I think it's kind of so called matchkey generator and it take me back to the times when no fuzzy matching was used but we were using matchkey to do our match.

    For example keeping first 3 letters and removing wovels and doubles.

    Match key generation could be another topic of discussion, but I would stop here.

    Thank you again for the interesting discussion and useful script.

  • Update: Based on several excellent forum contribution’s I asked Michael Capes(friend with big brain) to review them and provide feedback. The result is a revision to the original script/article. The following is the result of his review and the code for for the article has been updated. The code in the article for the JaroWinkler algothrim has has been updated and new code is in the zip file TestJaroWinkler.ZIP an will be available soon. Any question please let me know.

    I believe the "expected" results that the forum poster gave are incorrect. I was looking through the pdf document from Winkler, that I believe the poster took his test set. The paper is titled: "AN APPLICATION OF THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE TO THE 1990 U.S. DECENNIAL CENSUS" There's a table in that paper which matches the poster's comments. I don't beleive this table was produced with Jaro-Winkler. Instead, it looks like it was produced by another algorithm.

    Furthermore, I got the C# SimMetrics utility to run this morning. It's results also matched the earlier results from my Oracle and T-SQL test.

    I believe the revised T-SQL is correctly implementing the Jaro-Winkler algorthim. I've verified it against two other implementations, and all three results match.

    Ira

    [font="Comic Sans MS"]Ira Warren Whiteside[/font]

  • I agree but keep in mind many folks still rely on tsql. I have used SSIS extensively, but have found it has a bit of a learning curve. I wanted to remove the "Black Box"aura surronding Fuzzy Matching

    [font="Comic Sans MS"]Ira Warren Whiteside[/font]

  • Martin,

    I have updated the article and code.

    Thanks

    Ira

    [font="Comic Sans MS"]Ira Warren Whiteside[/font]

  • I got jaro-winkler functionality working by exposing a C# class via CLR- any thoughts on how this might compare speed/efficiency wise? I figured C# would be a more efficient / speedy string handler?

    Interested parties can PM me for the code if they want.

    Regards,

    Mike

  • Hi Guys Can you please tell how can we achieve Fuzzy Grouping with this.

    I am able to get fuzzy lookup but no fuzzy goruping.

    Can you please give me the query.

    Thanks in advance.

  • Hi Guys Can you please tell how can we achieve Fuzzy Grouping with this.

    I am able to get fuzzy lookup but no fuzzy goruping.

    Can you please give me the query.

    Thanks in advance.

  • Hi All I am try to do a fuzzy grouping

    with the example given I am not able to do it please let me know if I am missing anything.

  • Hi can you provide more detail?

    [font="Comic Sans MS"]Ira Warren Whiteside[/font]

  • Hi ,

    I have set of records as shown below.

    ID String

    1 help

    2 HELP

    3 hel

    4 he

    5 helipad

    6 Helpad

    7 vice

    8 VICEadmin

    9 vice admin

    10 Vice President

    I wanted to aggregate the count of common strings based on the similarity.

    My end should look like

    COUNTS STRING

    2 help

    1 hel

    1 he

    2 helipad

    1 vice

    2 Vice admin

    1 Vice President

    Please let me know thanks in advance.

Viewing 15 posts - 16 through 30 (of 58 total)

You must be logged in to reply to this topic. Login to reply