SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Fuzzy-String Search: Find misspelled information with T-SQL


Fuzzy-String Search: Find misspelled information with T-SQL

Author
Message
Thomas Keller
Thomas Keller
SSC Veteran
SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)

Group: General Forum Members
Points: 216 Visits: 158
Comments posted to this topic are about the item Fuzzy-String Search: Find misspelled information with T-SQL
dwain.c
dwain.c
SSCoach
SSCoach (18K reputation)SSCoach (18K reputation)SSCoach (18K reputation)SSCoach (18K reputation)SSCoach (18K reputation)SSCoach (18K reputation)SSCoach (18K reputation)SSCoach (18K reputation)

Group: General Forum Members
Points: 18175 Visits: 6431
This is really cool stuff!

Only scanned the article so far but have bookmarked it for a return read this weekend.

I had often wondered if there's an existing algorigthm out there to do this sort of thing.

Thanks Thomas!


My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
My temporal SQL musings: Calendar Tables, an Easter SQL, Time Slots and Self-maintaining, Contiguous Effective Dates in Temporal Tables
ZA_Crafty
ZA_Crafty
Old Hand
Old Hand (358 reputation)Old Hand (358 reputation)Old Hand (358 reputation)Old Hand (358 reputation)Old Hand (358 reputation)Old Hand (358 reputation)Old Hand (358 reputation)Old Hand (358 reputation)

Group: General Forum Members
Points: 358 Visits: 260
Awesome!!!!! Thanks for sharing this knowledge!!!
I've been pondering this for a while now. Especially with my interest in Neural Networks as well.

Perhaps the performance can be improved by implementing an external assembly, and CLR function?
Thomas Keller
Thomas Keller
SSC Veteran
SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)

Group: General Forum Members
Points: 216 Visits: 158
@ZA_Crafty: Perhaps CLR could be faster, but a Transact-SQL-only solution was important to me and to my clients.
Davin21
Davin21
SSC-Addicted
SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)

Group: General Forum Members
Points: 412 Visits: 1300
I've tended to find using a CLR implementation of the Jaro-Winkler algorithm has suited my needs a bit more, but really cool to see a T-SQL only alternative!
Thomas Keller
Thomas Keller
SSC Veteran
SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)SSC Veteran (216 reputation)

Group: General Forum Members
Points: 216 Visits: 158
@Davin21 I have some non-T-SQL implementations (including a Jaro-Winkler) listed on my Reference page - please let me know of others I should consider linking to.
Eric Hobbs
Eric Hobbs
SSC Rookie
SSC Rookie (38 reputation)SSC Rookie (38 reputation)SSC Rookie (38 reputation)SSC Rookie (38 reputation)SSC Rookie (38 reputation)SSC Rookie (38 reputation)SSC Rookie (38 reputation)SSC Rookie (38 reputation)

Group: General Forum Members
Points: 38 Visits: 48
:-D
Fantastic article. Thanks for sharing some of this knowledge. I'm sure that I am not the only one who will be making use of this technique.
Davin21
Davin21
SSC-Addicted
SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)SSC-Addicted (412 reputation)

Group: General Forum Members
Points: 412 Visits: 1300
@ThomasKeller - I found the information from SimMetrics (linky) to be quite useful - what i think would be really interesting to see would be a weighted version of this to give precedence to the first n characters in the string (as I've found that weighting the first four characters particularly for first name / surname validation) gives much more reliable results.
Jinxseee
Jinxseee
SSC-Enthusiastic
SSC-Enthusiastic (158 reputation)SSC-Enthusiastic (158 reputation)SSC-Enthusiastic (158 reputation)SSC-Enthusiastic (158 reputation)SSC-Enthusiastic (158 reputation)SSC-Enthusiastic (158 reputation)SSC-Enthusiastic (158 reputation)SSC-Enthusiastic (158 reputation)

Group: General Forum Members
Points: 158 Visits: 159
Thanks for the info.
I find the CLR/external assembly implementation runs a lot faster for the algorithms I have implemented (Jaro, Levenstein etc), but having a T-SQL version is certainly handy.
Martin Bastable
Martin Bastable
SSC-Addicted
SSC-Addicted (414 reputation)SSC-Addicted (414 reputation)SSC-Addicted (414 reputation)SSC-Addicted (414 reputation)SSC-Addicted (414 reputation)SSC-Addicted (414 reputation)SSC-Addicted (414 reputation)SSC-Addicted (414 reputation)

Group: General Forum Members
Points: 414 Visits: 248
Great article! Smile Brilliantly written too!

Out of interest, what other fuzzy search methods do people use?

One I came up with for a crm system a fair few years ago works as follows...

(please note examples given below are all bad examples but quick to type Smile )

Requirements:
Fuzzy matching of data for automated de-duplication, automated `existing record` matching so duplicates wouldn't get created in the first place, and fuzzy searches by users on varying fields of data

Solution:
Firstly, identified main fields our system was interested in handling the above. e.g. Firstname, Lastname, email address, 1st line of their address, phone numbers

Secondly, came up with some algorithms for `preparing data`- i.e. cleaning up variants of strings and common errors, and getting rid of useless data, while retaining the ability to not touch the original data.
e.g. Phone numbers: internationalization of numbers, stripping of excess characters etc. so (+44) 123 45678 90 -> 441234567890, 01234567890 -> 441234567890, (0712)34567890 ex 123 -> 071234567890.
e.g. Names: Get rid of characters that don't matter much, punctuation, extra words, etc.. Making up some examples (not the actual algorithm we use): Fred-> frd, Mc Donald -> MCDONALD, Bob Smithly-Smithington -> BOB, Frank J. -> FRANK
e.g. Addresses: 123, Little House -> 123LITTLEHOUSE, Flat 4b, 57, Some Street -> FLAT4B57SOMEST

You get the idea. Character removal can help with misspellings, variants, etc. - and there are plenty of fancy algorithms, encoding methods, etc. around to tweak this. Can vary per field/data type. We did some analysis on our data to find what was effective - which may vary by data type - names, addresses, emails, etc.

Anyway

Thirdly,

Create a second duplicate set of fields to hold the above encoded data. I.e. EncodedFirstname, EncodedPhoneNumber, etc.

So we end up with a set of `encoded fields` - e.g.
Firstname: Mc Donald EncodedFirstname: MCDONALD
Address1: , 12 3 I can't type road, EncodedAddress1: 123ICANTTYPERD
etc.

Note that users can still see their method of entered data e.g. (+44) 567890 as a phone number and not a purified 44567890 that we search against.

Results:

Multi-part searches can be completed very easily based on the encoded fields.
E.g. search for a phone number 0123 456 789, depending on your level of accuracy require you can exactly match on 44123456789, part match on 123456789, or return matches of only a certain number of matching numbers e.g. 1234567 - but you always search against a consistent phone number
E.g. A name search might return exact matches only on the encoded string, or include exact matches + e.g. matches where the first X characters or last X characters match.

As we have multiple separate versions of EncodedFields with different encoding/cleaning methods for the same source field.

Depending on the fuzziness of results required by the end user, we just vary the scope of the match - the number of characters matched on, etc.

You can use multiple fields combined with each other to generate unique keys for automated duplicate matching, with varying levels of confidence in a match.

e.g. ENCODEDFIRSTNAME+ENCODEDSURNAME+DOB
e.g. ENCODEDFIRSTNAME+LEFT(ENCODEDADDRESSLINE1, 5)
e.g. LEFT(ENCODEDFIRSTNAME, 3) + LEFT(ENCODEDSURNAME, 3) + LEFT(ADDRESSLINE1, 3)

The common combinations of the encoded fields we stick into separate extra fields KEYFIRSTNAMESURNAMEDOB etc. for speed of later searches.
(Incidentally, we don't use firstname + surname + date of birth as an automated duplicate match as there were too many people in our datasets with the same name and potential date of birth Smile )

Very fuzzy matches against multiple fields can return small sets of records (e.g. first 3 letters or last 3 letters of encodedfirstname = "abc" + first 3 letters or last 3 letters of surname = "zyx") - 2 keyfields greatly reduces the number of matches, even with very fuzzy input data.

User interface - often overlooked - when we present the results of a search to the users, rather than just a massive list, we give a massive list with visual clues as to the more likely matches. So e.g. a firstname/surname search, exactly matching first/surname entries are highlighted, exactly matching first OR surname gets a different highlight, fuzzy matches appear normally. The users have e.g.`Fuzzy, Partial and Exact`buttons they can click to quickly try more restrictive variants of the search if too many results are returned. (Can also auto-research on a more restrictive result set if too many records are returned). Can also stick the more likely matches at the top of the search results.

We share the `encoding` functions that generate our encoded strings, numbers etc. across parts our applications. To update or upgrade what we encode, how we encode it, our algorithms or some of our searches, we merely update our code in one place (.net libraries, which we cross use with SQL server - good old CLR Smile ) and then apply against our encoded fields/table to update the encoded data. Any part of our system (which has desktop front end, public web side, back end applications and automated processes) all point at the same data. Of course, all search code passes the search criteria through the encoding routines first before doing the search, to make sure our search data is in the correct format.

Creation of the encoded fields is spread across time (i.e. when records are added or updated). Records are always up to date. Possible to update all encoded fields in bulk at the same time too (e.g. when updating algorithms). Doesn't touch the original data.

Searches are very very fast - all searches are performed against what are simply extra fields in the database, with relevant indexes.

Automated De-duplication routines can match on various combinations of relevant (note I say relevant, we only de-duplicate on certain levels of certainty of match) matching fields and feed back to us on statistics of the most use matches, which help us identify areas of improvement.

We have had this running for years and years, catching, preventing and hunting down huge numbers of potential duplicates. Wipes the floor with the `bought in` CRM systems I have tried.

Of course we also use front end validation etc. in our interfaces to help, but hey - you know what data can be like Smile

Unfortunately not been allowed the time for a few years to add more fancy techniques in to the mix - ahh the potential!

There are overheads - overhead of the extra encoded/key fields etc. - but this isn't a problem for us.

We run the above against millions of records.

Hoping this might be of interest to someone!

M.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search