Doing Fuzzy Searches in SQL Server

A series of arguments with developers who insist that fuzzy searches or spell-checking be done within the application rather then a relational database inspired Phil Factor to show how it is done. When the database must find relevant material from search terms entered by users, the database must learn to expect, and deal with, both expected and unexpected

When an application searches for specific text data, you can generally rely on it to get the search term right. For example, if it needs to find ‘sausages’, you won’t expect to receive a search on ‘sossyjez’ , however, When people search your website or application, they have grown to expect it.

In this article, I’ll be explaining some of the strategies that you can adopt to make the searches that are made by users more satisfactory for them. Although they are well-tried in the industry, It is rare to hear of them being used in SQL.

General principles

When you create your application, you will need to have an ‘inversion table’ that lists all the words that are legitimately ‘searchable’. This technique is described here. If, for example you are selling widgets, the inversion table would contain a list of widgets, and the widget spares, repairs, advice, instructions and so on. At this stage, we’ll stick to a single language site, but if your site is multi-language, then the structure of the related tables is rather different.

This word list will be referenced, by a foreign key constraint, from a large narrow table that records where the various entities that are associated with the word are stored, so that relevant matches, maybe a link, a phrase, a description or an image, can be returned to the user. This second table usually records at least the location of the text and the sequence number of the word in the text, where it occurs. It is the inversion table, containing the list of words, that we’ll focus on in this article.

The task

When users search the site, even if they are using a single language, they will misspell or use national spellings. They will also use dialect words, or specialist words, for the widget, and even then they may get the spelling wrong. They will use argot or even textspeak (‘I H8 U 4 UR HCC’).

In short, your “fuzzy search” algorithm ought to be able to cope with a lot of creative ways to search for the same thing, for example:

  • dialect differences – crayfish, crawfish, écrevisse, crawdads, yabbies
  • national spelling differences – yoghourt, yogurt and yoghurt
  • national word differences – pants and trousers
  • multiple valid spellings – encyclopaedia, encyclopedia
  • misspellings – definitely often rendered as definitly, definately or defiantly
  • mistyping – computer as copmuter or comptuer.

We’ll use three basic techniques to cope with all this: the ‘edit difference’, the hardcoded ‘common misspelling’ lookup, and the sound-alike. Although some of the ways that words become difficult to match are dealt with by the lookup, and the ‘edit difference’ approach is good for interpreting mistyping, in truth, both that and the sound-alike technique work together to catch a wide variety of errors.

The Test-bed

To set this up, we’ll have our inversion table. We’ll also need a function to generate metaphones for when we get to the ‘sound-alike’ techniques

All we need to start is a list of all the common words in the English language. To this we can then add the common mistypings and ‘aliases’(by which I mean dialect words or national differences). The list of common words can be found all over the internet but is here in the SQL of Scrabble and Rapping or the Fireside Fun of Decapitations. Common Mistypings and common misspellings are available in the public domain. Dialect words and national differences are dependent on your location. You’ll also need to load the code for the metaphone function from here

You can then insert your secondary ‘alias’ words like this

…and so on through to

Now, you can select both the standard ‘canonical’ words and the ‘aliases’

Or you may want to just search the canonical words by excluding the aliases by adding the condition

Edit Difference

We can use the ‘edit difference’ technique to cope with simple mistyping or misspelling errors.

When you compare words, and need a measure of how alike they are, you generally calculate how many single-character edits, deletions, transpositions, replacements and insertions you would need to make to get from one to the other. It is commonly believed that to search for a word, you need to calculate the number of edits required to change your candidate word to any other word in the inversion table. Well, no. This would take at least thirteen seconds on our word list, with the full Damerau-Levenshtein algorithm.

A much simpler and faster technique is to select just the candidates that are one ‘edit difference’, or Levenshtein Distance, away because you only want candidate matches that are similar.

You just produce all the character strings that are one edit distance away and do an inner join with your table of words to eliminate the vast majority that aren’t legitimate words in your table. This deals with the majority of errors. No fuss at all.

Here is the code

With a test bank of 147 common misspellings, the function failed to come up with the correct version only thirteen times. That is less than a ten percent failure rate. In all, it took 380 ms on a slow test server to find all the one-edit-difference matches for 147 words. That’s roughly 2.6 ms each

This might be all you need for some applications, but we can pick up most of the mis-spellings that the edit-difference search failed to find, such as phonetic misspellings, using sound-alike.

Sound-Alike

For words that are mis-typed phonetically, there is the Metaphone algorithm. There are more advanced and much more complicated versions of this, but I’m using the early public-domain form. I have a SQL Version here. Soundex, which is built-in to SQL, isn’t much use because it was developed for hand-coding, before computers, and isn’t discriminating enough.

The sample word table has the metaphone stored with each word. This means that all you need to do is to find the metaphone for the word and search the metaphone column.

Lets return to our extreme example from the introduction.

We aren’t always that lucky in getting just one candidate that matches. If we” by the same technique:

We can get the most likely ones listed first in the candidate list by ordering the candidates by their edit difference:

In a simple test run, for 30 common misspellings, a fifth of the misspellings in total, the metaphone comparison showed a different metaphone for the common misspelling than for the correct spelling. Only three of these were misspellings that the edit-difference approach also failed to resolve correctly, but it means that it was quite likely to suggest the wrong word, if used by itself. In other words, the metaphone approach would be no good on its for resolving those common misspellings, but is still a good fall back for when the edit-difference approach fails to deliver.

The general problem with the metaphone approach comes from the language itself in that a great number of common words sound alike if one eliminates the ‘syllabic peak or nucleus’, by making vowels all alike. As a Metaphone tries to represent the sounds, in some cases it is very discriminating, too discriminating in fact, and in other cases it would come up with fifty matches.

However, if you get too many, then you can fall back on a function that calculates the edit distance, preferably a fast Damerau-Levenshtein such as Steve Hatchett’s version here Optimizing the Damerau-Levenshtein Algorithm in TSQL . You can then order the candidates by their edit difference to get the most likely candidates. The best approach is to limit the search to strings that are two edit differences away; anything further is pretty pointless for finding similar strings because at three edit-differences away, they just aren’t similar! I believe that some applications attempt to find edit differences between the metaphones themselves but I think that is probably taking things too far.

Hard-coded Common Misspellings, Dialect Words and Aliases

Sometimes, a common misspelling is more than a single edit away from the intended search term, and won’t be picked up by either our “edit difference” or sound-alike algorithms. In such cases, you can treat it as a hard-coded alias. We also need a special way of catching dialect words and other synonyms that can’t be caught any other way.

Let’s illustrate this simply.

You will see that this technique doesn’t allow an alias or misspelling that is a real ‘canonical’ word. This means that word confusions can only be spotted in context, and also you’re not going to cope with the difference between ‘pants’ and ‘trousers’ this way.

It also doesn’t allow you to misspell an alias. ‘Crawdads’ need to be spelled right if you are asking for a ‘crayfish’. However, in this case, you can allow aliases in the initial edit-difference approach. It is a matter of fine-tuning for your application.

Putting it all together

By now, if you’ve been following my argument, the best tactic is probably going to be a composite one. The first stage is to see if a word is an alias or a known misspelling. If not, then has it an edit-distance of one to any canonical words in your table? If not then does it share a metaphone with any canonical words in your table? If yes, and there are too many, sort them in ascending edit difference and take the top few. Otherwise, just return them. If no, then it is time to get something at least by getting a limited number of words with an edit distance of two, using Steve Hatchett’s version of the Damerau-Levenshtein Algorithm, specifying that it abandons its work on a particular word once it realises that it is more than two edit distances away.

The general principles are:

  • Use fast table searches where the word is the clustered primary key wherever possible.
  • Where you are forced to use a slow algorithm, reduce your candidate set as much as possible beforehand using SQL’s built-in text searching.
  • Make sure that your search feature can learn, or rather be taught, from the search terms that cause difficulty. This means logging all searches and their success, and use this data to maintain the aliases.

Here’s the complete “fuzzy search” algorithm:

We can now test this with this:

Conclusions

Many times in the past, I’ve had arguments with members of the development teams who, when we are discussing fuzzy searches, draw themselves up to their full height, look dignified, and say that a relational database is no place to be doing fuzzy searches or spell-checking. It should, they say, be done within the application layer. This is nonsense, and we can prove it with a stopwatch.

We are dealing with data. Relational databases do this well, but it just has to be done right. This implies searching on well-indexed fields such as the primary key, and not being ashamed of having quite large working tables. It means dealing with the majority of cases as rapidly as possible. It implies learning from failures to find a match. It means, most of all, a re-think from a procedural strategy.