Owning the Spelling Suggestion Feature

  • Comments posted to this topic are about the item Owning the Spelling Suggestion Feature

    Bill Nicolich: www.SQLFave.com.
    Daily tweet of what's new and interesting: AppendNow

  • Not a single reference to SOUNDEX? That doesn't seem right in a spelling suggestion article.

  • So add it!

    SOUNDEX doesn't cover, particularly, the typos angle that Bill seems to have concentrated on. Probably a lot better for the search results, given its sometimes 'strange' output.

    Seriously though, if you incorporated this with some Ajax, and used that for dynamic spelling correction, like Word does, it would be very nifty.

    I recently built an Ajax menu that works on the very simple concept of 'update as you type'. Given folks' ability for typos and fat-finger, this might just get added to second-guess them.

  • My concern would be the number of round trips that might get generated. Other than that, I've always used a 3rd party spell checker (including the one in Word at times) and while they weren't cheap, they typically aren't hugely expensive either.

  • We went with the premise that people would rather find what they're looking for than be prompted with spelling corrections, so we wrote the search feature to join the product keywords to the misspellings as synonyms of the correct term, and match on whatever the customer entered. We monitor the misses between requested keywords and those already in the search table, so new misspellings are added from those seen in the wild. The programmatic permutations could be easily incorporated.

    BTW, an added benefit of the "synonyms" notion is that we can make the plural form of the keyword a synonym of the singular so the content editors need only associate the singular keyword with the product. Also those keywords that are similar in concept can be aliased together so only one of that concept group needs to be applied to the product to allow any of the group to match. (different rank/weightings can be used to resolve exact matches higher in the results than alias matches)

  • I think a hash table would be a much better implementation. It would be hugely faster, and if the word doesn't hash, it would be easy to grab all the words near the hash as suggestions.

  • tnolan (7/21/2008)


    Not a single reference to SOUNDEX? That doesn't seem right in a spelling suggestion article.

    Soundex is horrible. And the SQL Server implementation doesn't match with the NARA Soundex standard anyway. If you want to do phonetic matching, I'd recommend starting with a better phonetic match algorithm. For spell-checking you'll get better accuracy using LCS, edit distance, or n-gram matching. Set-based operations work well with n-gram matching.

  • gcopeland (7/21/2008)


    I think a hash table would be a much better implementation. It would be hugely faster, and if the word doesn't hash, it would be easy to grab all the words near the hash as suggestions.

    A ternary search tree might work even better (although it would be hard to develop in SQL without using CLR). In the TST words that are similar tend to "bunch" together as well. BTW, just wondering but what type of hash function would you use to get similar words to group together? If the function is too simple you're going to pay a penalty in access costs.

  • Regarding number of round trips... In my implementation of spell suggestion, I use a single stored procedure call which uses a single T-SQL statement using a set-based approach even though the search entry is often multiple words. The search entry goes in, empty or filled spelling suggestion message goes out.

    Although the performance seems great, I'd like to take the chance later to go into the specifics of the stored procedure including the performance angle and see if Andy and some of the other performance gurus around here will share their optimization tips.

    Thank you everybody for taking the time to read. Let me know if you find other helper table techniques!

    Bill

    Bill Nicolich: www.SQLFave.com.
    Daily tweet of what's new and interesting: AppendNow

  • Sorry guys, my soundex suggestion was not meant entirely seriously. The typo function provided here is pretty well done, but I think adding phonetic search would add to its overall use. I also do not like SQL SOUNDEX very much. 😉 http://microsoft.apress.com/index.php?id=72 or even a conversion of http://everything2.com/node/459981 could easily be worked in to the code here for phonetic search.

  • tnolan (7/21/2008)


    Sorry guys, my soundex suggestion was not meant entirely seriously. The typo function provided here is pretty well done, but I think adding phonetic search would add to its overall use. I also do not like SQL SOUNDEX very much. 😉 http://microsoft.apress.com/index.php?id=72 or even a conversion of http://everything2.com/node/459981 could easily be worked in to the code here for phonetic search.

    LOL I wrote that Apress article ages ago for their old e-zine. I'm surprised it's still up there - thanks for the link 🙂 There are problems with using Soundex, and phonetic match algorithms in general, for general-purpose applications like spell-checking. Here's a few of the problems I've run into in this area:

    - The majority of phonetic match algorithms are geared toward surname-based searching, and they don't work well for non-surname data (like general dictionary/spell-check applications or even a wide variety of first names).

    - Most phonetic match algorithms are geared toward surnames of western European origin, since most are based to some extent on Soundex which is about 90+ years old and designed to index common American names of the time (mostly of western European origin). Because of this most phonetic match algorithms don't index names (or words) with other origins (eastern European, Asian, African, Indian, Spanish, etc.) very well.

    - Almost all phonetic algorithms preserve the first letter, which makes them pretty useless in spell-checks when the first letter is wrong (transposed letters in the first two positions, missing letter in the first position, etc.) There's usually an implicit assumption that the first letter of the word being encoded is always correct. You can get around this limitation by using string difference calculations, LCS, n-grams, edit distance, etc.; but if you're going to use them in that way may as well consider them to begin with.

    If you wanted to go the phonetic search route I would highly recommend using a better, more modern algorithm than Soundex. NYSIIS is a Soundex variant that improves recall quite a bit; Double Metaphone accounts for some non-English characters that Soundex misses; Daitch-Mokotoff provides better handling for Eastern European names/words, etc. An edit distance, LCS, or n-gram-type algorithm can provide better results than phonetic match. As examples, SSIS uses an n-gram variant in the fuzzy lookup component, and an edit distance algorithm is used by MS Word's spell-checker. The hardest part of these algorithms is making them efficient for large dictionaries.

  • Bill, on the round trips I should have said per word. Think of typing in a forum post, if every word is a proc call, thats still verbose. I'd say my challenge is take in a paragraph or two of text, split it and spell check it, return a list of words that were potentially wrong and their answers. On the Dev side that works reasonably well, in most cases if a word appears more than once it would be spelled wrong in both cases.

  • I have really enjoyed reading this thread. Mike C especially has been extremely enlightening. What I take from his expertise is that there are numerous phonetic matching algorithms, and you need to analyze your requirements carefully to chose the right one.

    I note that spell check is (or was when I was in school) a common problem given in college CS classes. That was the last time I encountered this problem. Like the original article stated, when I need this kind of functionality, I always use a third-party vendor. Dabbling in this stuff is fun as an excercise, but it is too complicated for an amateur solution. It might be almost as silly as trying to roll your own encryption.

  • After this discussion of phonetic algorythms that we have been having, I cannot believe that I just ran across this:

    http://www.prnewswire.co.uk/cgi/news/release?id=189324

    The gist of the article is that the US Transportation Security Administration uses Soundex to search databases for terrorists, and that the search results are almost useless.

  • I thought it was a very interesting article with an interesting concept.

    One note is that there is a set based way of generating the helper table that may be more efficient. Jeff Moden described it in http://www.sqlservercentral.com/articles/TSQL/62867/ .

    ---
    Timothy A Wiseman
    SQL Blog: http://timothyawiseman.wordpress.com/

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply