Fuzzy-String Search: Find misspelled information with T-SQL

  • Thomas Keller

    Ten Centuries

    Points: 1222

    Comments posted to this topic are about the item Fuzzy-String Search: Find misspelled information with T-SQL

  • Dwain Camps

    SSC Guru

    Points: 86873

    This is really cool stuff!

    Only scanned the article so far but have bookmarked it for a return read this weekend.

    I had often wondered if there’s an existing algorigthm out there to do this sort of thing.

    Thanks Thomas!


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • ZA_Crafty

    Ten Centuries

    Points: 1118

    Awesome!!!!! Thanks for sharing this knowledge!!!

    I’ve been pondering this for a while now. Especially with my interest in Neural Networks as well.

    Perhaps the performance can be improved by implementing an external assembly, and CLR function?

  • Thomas Keller

    Ten Centuries

    Points: 1222

    @ZA_Crafty: Perhaps CLR could be faster, but a Transact-SQL-only solution was important to me and to my clients.

  • Davin21

    SSC Eights!

    Points: 850

    I’ve tended to find using a CLR implementation of the Jaro-Winkler algorithm has suited my needs a bit more, but really cool to see a T-SQL only alternative!

  • Thomas Keller

    Ten Centuries

    Points: 1222

    @Davin21 I have some non-T-SQL implementations (including a Jaro-Winkler) listed on my Reference[/url] page – please let me know of others I should consider linking to.

  • Eric Hobbs

    SSC Veteran

    Points: 218

    😀

    Fantastic article. Thanks for sharing some of this knowledge. I’m sure that I am not the only one who will be making use of this technique.

  • Davin21

    SSC Eights!

    Points: 850

    @ThomasKeller – I found the information from SimMetrics (linky) to be quite useful – what i think would be really interesting to see would be a weighted version of this to give precedence to the first n characters in the string (as I’ve found that weighting the first four characters particularly for first name / surname validation) gives much more reliable results.

  • Jinxseee

    Old Hand

    Points: 378

    Thanks for the info.

    I find the CLR/external assembly implementation runs a lot faster for the algorithms I have implemented (Jaro, Levenstein etc), but having a T-SQL version is certainly handy.

  • Martin Bastable

    SSCommitted

    Points: 1606

    Great article! 🙂 Brilliantly written too!

    Out of interest, what other fuzzy search methods do people use?

    One I came up with for a crm system a fair few years ago works as follows…

    (please note examples given below are all bad examples but quick to type 🙂 )

    Requirements:

    Fuzzy matching of data for automated de-duplication, automated `existing record` matching so duplicates wouldn’t get created in the first place, and fuzzy searches by users on varying fields of data

    Solution:

    Firstly, identified main fields our system was interested in handling the above. e.g. Firstname, Lastname, email address, 1st line of their address, phone numbers

    Secondly, came up with some algorithms for `preparing data`- i.e. cleaning up variants of strings and common errors, and getting rid of useless data, while retaining the ability to not touch the original data.

    e.g. Phone numbers: internationalization of numbers, stripping of excess characters etc. so (+44) 123 45678 90 -> 441234567890, 01234567890 -> 441234567890, (0712)34567890 ex 123 -> 071234567890.

    e.g. Names: Get rid of characters that don’t matter much, punctuation, extra words, etc.. Making up some examples (not the actual algorithm we use): Fred-> frd, Mc Donald -> MCDONALD, Bob Smithly-Smithington -> BOB, Frank J. -> FRANK

    e.g. Addresses: 123, Little House -> 123LITTLEHOUSE, Flat 4b, 57, Some Street -> FLAT4B57SOMEST

    You get the idea. Character removal can help with misspellings, variants, etc. – and there are plenty of fancy algorithms, encoding methods, etc. around to tweak this. Can vary per field/data type. We did some analysis on our data to find what was effective – which may vary by data type – names, addresses, emails, etc.

    Anyway

    Thirdly,

    Create a second duplicate set of fields to hold the above encoded data. I.e. EncodedFirstname, EncodedPhoneNumber, etc.

    So we end up with a set of `encoded fields` – e.g.

    Firstname: Mc Donald EncodedFirstname: MCDONALD

    Address1: , 12 3 I can’t type road, EncodedAddress1: 123ICANTTYPERD

    etc.

    Note that users can still see their method of entered data e.g. (+44) 567890 as a phone number and not a purified 44567890 that we search against.

    Results:

    Multi-part searches can be completed very easily based on the encoded fields.

    E.g. search for a phone number 0123 456 789, depending on your level of accuracy require you can exactly match on 44123456789, part match on 123456789, or return matches of only a certain number of matching numbers e.g. 1234567 – but you always search against a consistent phone number

    E.g. A name search might return exact matches only on the encoded string, or include exact matches + e.g. matches where the first X characters or last X characters match.

    As we have multiple separate versions of EncodedFields with different encoding/cleaning methods for the same source field.

    Depending on the fuzziness of results required by the end user, we just vary the scope of the match – the number of characters matched on, etc.

    You can use multiple fields combined with each other to generate unique keys for automated duplicate matching, with varying levels of confidence in a match.

    e.g. ENCODEDFIRSTNAME+ENCODEDSURNAME+DOB

    e.g. ENCODEDFIRSTNAME+LEFT(ENCODEDADDRESSLINE1, 5)

    e.g. LEFT(ENCODEDFIRSTNAME, 3) + LEFT(ENCODEDSURNAME, 3) + LEFT(ADDRESSLINE1, 3)

    The common combinations of the encoded fields we stick into separate extra fields KEYFIRSTNAMESURNAMEDOB etc. for speed of later searches.

    (Incidentally, we don’t use firstname + surname + date of birth as an automated duplicate match as there were too many people in our datasets with the same name and potential date of birth 🙂 )

    Very fuzzy matches against multiple fields can return small sets of records (e.g. first 3 letters or last 3 letters of encodedfirstname = “abc” + first 3 letters or last 3 letters of surname = “zyx”) – 2 keyfields greatly reduces the number of matches, even with very fuzzy input data.

    User interface – often overlooked – when we present the results of a search to the users, rather than just a massive list, we give a massive list with visual clues as to the more likely matches. So e.g. a firstname/surname search, exactly matching first/surname entries are highlighted, exactly matching first OR surname gets a different highlight, fuzzy matches appear normally. The users have e.g.`Fuzzy, Partial and Exact`buttons they can click to quickly try more restrictive variants of the search if too many results are returned. (Can also auto-research on a more restrictive result set if too many records are returned). Can also stick the more likely matches at the top of the search results.

    We share the `encoding` functions that generate our encoded strings, numbers etc. across parts our applications. To update or upgrade what we encode, how we encode it, our algorithms or some of our searches, we merely update our code in one place (.net libraries, which we cross use with SQL server – good old CLR 🙂 ) and then apply against our encoded fields/table to update the encoded data. Any part of our system (which has desktop front end, public web side, back end applications and automated processes) all point at the same data. Of course, all search code passes the search criteria through the encoding routines first before doing the search, to make sure our search data is in the correct format.

    Creation of the encoded fields is spread across time (i.e. when records are added or updated). Records are always up to date. Possible to update all encoded fields in bulk at the same time too (e.g. when updating algorithms). Doesn’t touch the original data.

    Searches are very very fast – all searches are performed against what are simply extra fields in the database, with relevant indexes.

    Automated De-duplication routines can match on various combinations of relevant (note I say relevant, we only de-duplicate on certain levels of certainty of match) matching fields and feed back to us on statistics of the most use matches, which help us identify areas of improvement.

    We have had this running for years and years, catching, preventing and hunting down huge numbers of potential duplicates. Wipes the floor with the `bought in` CRM systems I have tried.

    Of course we also use front end validation etc. in our interfaces to help, but hey – you know what data can be like 🙂

    Unfortunately not been allowed the time for a few years to add more fancy techniques in to the mix – ahh the potential!

    There are overheads – overhead of the extra encoded/key fields etc. – but this isn’t a problem for us.

    We run the above against millions of records.

    Hoping this might be of interest to someone!

    M.

  • Thomas Keller

    Ten Centuries

    Points: 1222

    @Davin21: Indeed, I had originally linked the very informative SimMetrics Web Site, which was linked from your sourceforge link, but it seems to be permanently down now:(

  • Thomas Keller

    Ten Centuries

    Points: 1222

    @Martin Bastable: Thanks for your kind words, and thanks to those who proofread various drafts of my article, including Kellie Sanchez, Dan Hess, Eric Freeman, Ernest Cook, and Chris Anderson! (Sorry I didn’t thank y’all in the article itself.)

    I have also implemented systems such as you describe, that pre-process the data and combine it in various ways, to make “fuzzy” searching quicker and more flexible. But for this article, I wanted to present a method that can be dropped into any database with a minimum of work, and still have acceptable performance when “fuzzy” searching on one field.

  • Martin Bastable

    SSCommitted

    Points: 1606

    I call it a win win! You could use it as a stand alone search in a field, or combine it with other things for a super search 🙂

    M.

  • dmusick

    SSC Enthusiast

    Points: 119

    Several years ago, I had the recurring task of locating and merging records for duplicate customer names in a large database. Because the names, DOB and SSN are not generally identical, I had to develop an algorithm that could compare any two strings and give a “percent similar” number, so I could filter out only the most likely duplicates and go through those manually.

    I put a space at the beginning and end of each string and break each string up into 2 character sequences of strings. Then, I count how many of these sequences in the shorter string are also in the longer string, pairing them up, so I don’t double-count. The number of common 2 character strings divided by the total in the shorter string is the percent similar.

    For example, the word ‘farm’ and ‘charm’ can be broken into (using an underscore to represent the space):

    _f fa ar rm m_

    and

    _c ch ha ar rm m_

    Comparing the two, we can pair up three of these 2 character strings (ar rm m_). Since the shorter string has 5, this is 3/5, or 60% similarity.

    If the second string was ‘family charm’, it would be 100% similar, because it would contain all the 2 character sequences. This is one weakness of this algorithm, although it is easy to compensate for by comparing string length also. This problem mostly comes up when one string is very short.

    Overall, I have used this algorithm quite a lot, and it seems to work very well, finding duplicates and similarities that would be extremely difficult otherwise.

    Here is the code I wrote to implement this algorithm:

    CREATE FUNCTION [dbo].[percentSimilar](@string1 as varchar(1000), @string2 as varchar(1000) )

    RETURNS decimal(10,9)

    AS

    BEGIN

    Declare @percentSimilar decimal(10,9);

    Declare @length1 int;

    Declare @length2 int;

    Declare @inCommon int;

    Declare @minSequences int;

    Declare @maxLength int;

    Declare @position int;

    Declare @tally table ( sequence char(2),

    times1 int,

    times2 int);

    Declare @intersect table ( sequence char(2),

    overlap int);

    If @string1 <> ” and @string2 <> ” –if either string is an empty string, return 0

    Begin

    Set @string1 = ‘ ‘ + @string1 + ‘ ‘;

    Set @string2 = ‘ ‘ + @string2 + ‘ ‘; –put a space at the beginning and end of each string

    Set @length1 = Len(@string1);

    Set @length2 = Len(@string2);

    If @length1 < @length2 –find the minimum 2-character sequences, to divide the total overlap

    Set @minSequences = @length1 – 1; –by and get the percentage

    Else

    Set @minSequences = @length2 – 1;

    Set @position = 1; –add the two character strings from string 1 to the tally table

    While @position < @length1

    Begin

    Insert Into @tally (sequence, times1, times2) Values(Substring(@string1, @position, 2),1,0);

    Set @position = @position + 1

    End

    Set @position = 1; –add the two character strings from string 2 to the tally table

    While @position < @length2

    Begin

    Insert Into @tally (sequence, times1, times2) Values(Substring(@string2, @position, 2),0,1);

    Set @position = @position + 1

    End

    Insert Into @intersect –count how many times each sequence overlaps the other

    Select sequence, case when sum(times1) < sum(times2) then sum(times1) else sum(times2) end as overlap

    From @tally

    Group By sequence

    Select @inCommon = Sum(overlap) From @intersect; –the total overlap of two character sequences

    Set @percentSimilar = 1.000000000 * @inCommon / @minSequences;

    End

    Else

    Set @percentSimilar = 0;

    RETURN @percentSimilar;

    END

  • dmusick

    SSC Enthusiast

    Points: 119

    I wanted to add that one of the big strengths of this algorithm is that it will detect similarities even if the words are scrambled in different orders or if one string has an additional word in the beginning. Because it is only looking at 2 character sequences, it doesn’t care about positioning within the string. This is very helpful when dealing with names, which can often be out of order, such as when the data entry person switches the first and last name, or they put a double last name (common among Hispanics) in the middle name, then the last name, and then the next person does it the opposite way.

    I have seen so many bizarre data entry issues, where the same person was entered into the database quite differently each time, but my algorithm was able to find them.

Viewing 15 posts - 1 through 15 (of 41 total)

You must be logged in to reply to this topic. Login to reply