RE: Fuzzy-String Search: Find misspelled information with T-SQL

Martin Bastable

SSCommitted

Points: 1766

September 18, 2012 at 3:16 am

#1538963

Great article! 🙂 Brilliantly written too!

Out of interest, what other fuzzy search methods do people use?

One I came up with for a crm system a fair few years ago works as follows...

(please note examples given below are all bad examples but quick to type 🙂 )

Requirements:

Fuzzy matching of data for automated de-duplication, automated `existing record` matching so duplicates wouldn't get created in the first place, and fuzzy searches by users on varying fields of data

Solution:

Firstly, identified main fields our system was interested in handling the above. e.g. Firstname, Lastname, email address, 1st line of their address, phone numbers

Secondly, came up with some algorithms for `preparing data`- i.e. cleaning up variants of strings and common errors, and getting rid of useless data, while retaining the ability to not touch the original data.

e.g. Phone numbers: internationalization of numbers, stripping of excess characters etc. so (+44) 123 45678 90 -> 441234567890, 01234567890 -> 441234567890, (0712)34567890 ex 123 -> 071234567890.

e.g. Names: Get rid of characters that don't matter much, punctuation, extra words, etc.. Making up some examples (not the actual algorithm we use): Fred-> frd, Mc Donald -> MCDONALD, Bob Smithly-Smithington -> BOB, Frank J. -> FRANK

e.g. Addresses: 123, Little House -> 123LITTLEHOUSE, Flat 4b, 57, Some Street -> FLAT4B57SOMEST

You get the idea. Character removal can help with misspellings, variants, etc. - and there are plenty of fancy algorithms, encoding methods, etc. around to tweak this. Can vary per field/data type. We did some analysis on our data to find what was effective - which may vary by data type - names, addresses, emails, etc.

Anyway

Thirdly,

Create a second duplicate set of fields to hold the above encoded data. I.e. EncodedFirstname, EncodedPhoneNumber, etc.

So we end up with a set of `encoded fields` - e.g.

Firstname: Mc Donald EncodedFirstname: MCDONALD

Address1: , 12 3 I can't type road, EncodedAddress1: 123ICANTTYPERD

etc.

Note that users can still see their method of entered data e.g. (+44) 567890 as a phone number and not a purified 44567890 that we search against.

Results:

Multi-part searches can be completed very easily based on the encoded fields.

E.g. search for a phone number 0123 456 789, depending on your level of accuracy require you can exactly match on 44123456789, part match on 123456789, or return matches of only a certain number of matching numbers e.g. 1234567 - but you always search against a consistent phone number

E.g. A name search might return exact matches only on the encoded string, or include exact matches + e.g. matches where the first X characters or last X characters match.

As we have multiple separate versions of EncodedFields with different encoding/cleaning methods for the same source field.

Depending on the fuzziness of results required by the end user, we just vary the scope of the match - the number of characters matched on, etc.

You can use multiple fields combined with each other to generate unique keys for automated duplicate matching, with varying levels of confidence in a match.

e.g. ENCODEDFIRSTNAME+ENCODEDSURNAME+DOB

e.g. ENCODEDFIRSTNAME+LEFT(ENCODEDADDRESSLINE1, 5)

e.g. LEFT(ENCODEDFIRSTNAME, 3) + LEFT(ENCODEDSURNAME, 3) + LEFT(ADDRESSLINE1, 3)

The common combinations of the encoded fields we stick into separate extra fields KEYFIRSTNAMESURNAMEDOB etc. for speed of later searches.

(Incidentally, we don't use firstname + surname + date of birth as an automated duplicate match as there were too many people in our datasets with the same name and potential date of birth 🙂 )

Very fuzzy matches against multiple fields can return small sets of records (e.g. first 3 letters or last 3 letters of encodedfirstname = "abc" + first 3 letters or last 3 letters of surname = "zyx") - 2 keyfields greatly reduces the number of matches, even with very fuzzy input data.

User interface - often overlooked - when we present the results of a search to the users, rather than just a massive list, we give a massive list with visual clues as to the more likely matches. So e.g. a firstname/surname search, exactly matching first/surname entries are highlighted, exactly matching first OR surname gets a different highlight, fuzzy matches appear normally. The users have e.g.`Fuzzy, Partial and Exact`buttons they can click to quickly try more restrictive variants of the search if too many results are returned. (Can also auto-research on a more restrictive result set if too many records are returned). Can also stick the more likely matches at the top of the search results.

We share the `encoding` functions that generate our encoded strings, numbers etc. across parts our applications. To update or upgrade what we encode, how we encode it, our algorithms or some of our searches, we merely update our code in one place (.net libraries, which we cross use with SQL server - good old CLR 🙂 ) and then apply against our encoded fields/table to update the encoded data. Any part of our system (which has desktop front end, public web side, back end applications and automated processes) all point at the same data. Of course, all search code passes the search criteria through the encoding routines first before doing the search, to make sure our search data is in the correct format.

Creation of the encoded fields is spread across time (i.e. when records are added or updated). Records are always up to date. Possible to update all encoded fields in bulk at the same time too (e.g. when updating algorithms). Doesn't touch the original data.

Searches are very very fast - all searches are performed against what are simply extra fields in the database, with relevant indexes.

Automated De-duplication routines can match on various combinations of relevant (note I say relevant, we only de-duplicate on certain levels of certainty of match) matching fields and feed back to us on statistics of the most use matches, which help us identify areas of improvement.

We have had this running for years and years, catching, preventing and hunting down huge numbers of potential duplicates. Wipes the floor with the `bought in` CRM systems I have tried.

Of course we also use front end validation etc. in our interfaces to help, but hey - you know what data can be like 🙂

Unfortunately not been allowed the time for a few years to add more fancy techniques in to the mix - ahh the potential!

There are overheads - overhead of the extra encoded/key fields etc. - but this isn't a problem for us.

We run the above against millions of records.

Hoping this might be of interest to someone!