|
|
|
Ten Centuries
      
Group: General Forum Members
Last Login: Wednesday, March 27, 2013 3:22 PM
Points: 1,276,
Visits: 1,112
|
|
tnolan (7/21/2008) Sorry guys, my soundex suggestion was not meant entirely seriously. The typo function provided here is pretty well done, but I think adding phonetic search would add to its overall use. I also do not like SQL SOUNDEX very much. ;) http://microsoft.apress.com/index.php?id=72 or even a conversion of http://everything2.com/node/459981 could easily be worked in to the code here for phonetic search.
LOL I wrote that Apress article ages ago for their old e-zine. I'm surprised it's still up there - thanks for the link :) There are problems with using Soundex, and phonetic match algorithms in general, for general-purpose applications like spell-checking. Here's a few of the problems I've run into in this area:
- The majority of phonetic match algorithms are geared toward surname-based searching, and they don't work well for non-surname data (like general dictionary/spell-check applications or even a wide variety of first names).
- Most phonetic match algorithms are geared toward surnames of western European origin, since most are based to some extent on Soundex which is about 90+ years old and designed to index common American names of the time (mostly of western European origin). Because of this most phonetic match algorithms don't index names (or words) with other origins (eastern European, Asian, African, Indian, Spanish, etc.) very well.
- Almost all phonetic algorithms preserve the first letter, which makes them pretty useless in spell-checks when the first letter is wrong (transposed letters in the first two positions, missing letter in the first position, etc.) There's usually an implicit assumption that the first letter of the word being encoded is always correct. You can get around this limitation by using string difference calculations, LCS, n-grams, edit distance, etc.; but if you're going to use them in that way may as well consider them to begin with.
If you wanted to go the phonetic search route I would highly recommend using a better, more modern algorithm than Soundex. NYSIIS is a Soundex variant that improves recall quite a bit; Double Metaphone accounts for some non-English characters that Soundex misses; Daitch-Mokotoff provides better handling for Eastern European names/words, etc. An edit distance, LCS, or n-gram-type algorithm can provide better results than phonetic match. As examples, SSIS uses an n-gram variant in the fuzzy lookup component, and an edit distance algorithm is used by MS Word's spell-checker. The hardest part of these algorithms is making them efficient for large dictionaries.
|
|
|
|
|
SSCertifiable
       
Group: Moderators
Last Login: Thursday, May 09, 2013 12:38 PM
Points: 6,462,
Visits: 1,384
|
|
|
|
|
|
SSC-Enthusiastic
      
Group: General Forum Members
Last Login: Monday, May 06, 2013 12:50 PM
Points: 129,
Visits: 223
|
|
I have really enjoyed reading this thread. Mike C especially has been extremely enlightening. What I take from his expertise is that there are numerous phonetic matching algorithms, and you need to analyze your requirements carefully to chose the right one.
I note that spell check is (or was when I was in school) a common problem given in college CS classes. That was the last time I encountered this problem. Like the original article stated, when I need this kind of functionality, I always use a third-party vendor. Dabbling in this stuff is fun as an excercise, but it is too complicated for an amateur solution. It might be almost as silly as trying to roll your own encryption.
|
|
|
|
|
SSC-Enthusiastic
      
Group: General Forum Members
Last Login: Monday, May 06, 2013 12:50 PM
Points: 129,
Visits: 223
|
|
After this discussion of phonetic algorythms that we have been having, I cannot believe that I just ran across this:
http://www.prnewswire.co.uk/cgi/news/release?id=189324
The gist of the article is that the US Transportation Security Administration uses Soundex to search databases for terrorists, and that the search results are almost useless.
|
|
|
|
|
Right there with Babe
      
Group: General Forum Members
Last Login: Thursday, February 14, 2013 12:01 PM
Points: 743,
Visits: 900
|
|
I thought it was a very interesting article with an interesting concept.
One note is that there is a set based way of generating the helper table that may be more efficient. Jeff Moden described it in http://www.sqlservercentral.com/articles/TSQL/62867/ .
--- Timothy A Wiseman SQL Blog: http://timothyawiseman.wordpress.com/
|
|
|
|
|
Hall of Fame
       
Group: General Forum Members
Last Login: Monday, July 30, 2012 10:42 AM
Points: 3,434,
Visits: 519
|
|
Try this to create and populate the table with numbers: CREATE TABLE [#tblNumbers]( [number] [int]) Go insert #tblnumbers(number) values( 1) Go 100 Declare @i int Set @i = 0 Update #tblnumbers set @i = number = @i+1 --select * from #tblnumbers --truncate table #tblnumbers
Go 100 - will execute the batch 100 times as somebody mentioned on this site a while ago. set @i = number = @i+1 is a trick pointed out in the book "Inside Microsoft SQL Server 7.0 " by Ron Soukup and Kalen Delaney to get running totals. This is a valid syntax for UPDATE statement.
Regards, Yelena Varshal
|
|
|
|