RE: Owning the Spelling Suggestion Feature

SSC-Insane

Points: 23224

July 21, 2008 at 7:08 pm

tnolan (7/21/2008)
Sorry guys, my soundex suggestion was not meant entirely seriously. The typo function provided here is pretty well done, but I think adding phonetic search would add to its overall use. I also do not like SQL SOUNDEX very much. 😉 http://microsoft.apress.com/index.php?id=72 or even a conversion of http://everything2.com/node/459981 could easily be worked in to the code here for phonetic search.

LOL I wrote that Apress article ages ago for their old e-zine. I'm surprised it's still up there - thanks for the link 🙂 There are problems with using Soundex, and phonetic match algorithms in general, for general-purpose applications like spell-checking. Here's a few of the problems I've run into in this area:

- The majority of phonetic match algorithms are geared toward surname-based searching, and they don't work well for non-surname data (like general dictionary/spell-check applications or even a wide variety of first names).

- Most phonetic match algorithms are geared toward surnames of western European origin, since most are based to some extent on Soundex which is about 90+ years old and designed to index common American names of the time (mostly of western European origin). Because of this most phonetic match algorithms don't index names (or words) with other origins (eastern European, Asian, African, Indian, Spanish, etc.) very well.

- Almost all phonetic algorithms preserve the first letter, which makes them pretty useless in spell-checks when the first letter is wrong (transposed letters in the first two positions, missing letter in the first position, etc.) There's usually an implicit assumption that the first letter of the word being encoded is always correct. You can get around this limitation by using string difference calculations, LCS, n-grams, edit distance, etc.; but if you're going to use them in that way may as well consider them to begin with.

If you wanted to go the phonetic search route I would highly recommend using a better, more modern algorithm than Soundex. NYSIIS is a Soundex variant that improves recall quite a bit; Double Metaphone accounts for some non-English characters that Soundex misses; Daitch-Mokotoff provides better handling for Eastern European names/words, etc. An edit distance, LCS, or n-gram-type algorithm can provide better results than phonetic match. As examples, SSIS uses an n-gram variant in the fuzzy lookup component, and an edit distance algorithm is used by MS Word's spell-checker. The hardest part of these algorithms is making them efficient for large dictionaries.