Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

Owning the Spelling Suggestion Feature Expand / Collapse
Author
Message
Posted Monday, July 21, 2008 7:08 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135
tnolan (7/21/2008)
Sorry guys, my soundex suggestion was not meant entirely seriously. The typo function provided here is pretty well done, but I think adding phonetic search would add to its overall use. I also do not like SQL SOUNDEX very much. ;) http://microsoft.apress.com/index.php?id=72 or even a conversion of http://everything2.com/node/459981 could easily be worked in to the code here for phonetic search.


LOL I wrote that Apress article ages ago for their old e-zine. I'm surprised it's still up there - thanks for the link :) There are problems with using Soundex, and phonetic match algorithms in general, for general-purpose applications like spell-checking. Here's a few of the problems I've run into in this area:

- The majority of phonetic match algorithms are geared toward surname-based searching, and they don't work well for non-surname data (like general dictionary/spell-check applications or even a wide variety of first names).

- Most phonetic match algorithms are geared toward surnames of western European origin, since most are based to some extent on Soundex which is about 90+ years old and designed to index common American names of the time (mostly of western European origin). Because of this most phonetic match algorithms don't index names (or words) with other origins (eastern European, Asian, African, Indian, Spanish, etc.) very well.

- Almost all phonetic algorithms preserve the first letter, which makes them pretty useless in spell-checks when the first letter is wrong (transposed letters in the first two positions, missing letter in the first position, etc.) There's usually an implicit assumption that the first letter of the word being encoded is always correct. You can get around this limitation by using string difference calculations, LCS, n-grams, edit distance, etc.; but if you're going to use them in that way may as well consider them to begin with.

If you wanted to go the phonetic search route I would highly recommend using a better, more modern algorithm than Soundex. NYSIIS is a Soundex variant that improves recall quite a bit; Double Metaphone accounts for some non-English characters that Soundex misses; Daitch-Mokotoff provides better handling for Eastern European names/words, etc. An edit distance, LCS, or n-gram-type algorithm can provide better results than phonetic match. As examples, SSIS uses an n-gram variant in the fuzzy lookup component, and an edit distance algorithm is used by MS Word's spell-checker. The hardest part of these algorithms is making them efficient for large dictionaries.
Post #538096
Posted Tuesday, July 22, 2008 5:31 AM
SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: Moderators
Last Login: Today @ 7:12 AM
Points: 6,790, Visits: 1,903
Bill, on the round trips I should have said per word. Think of typing in a forum post, if every word is a proc call, thats still verbose. I'd say my challenge is take in a paragraph or two of text, split it and spell check it, return a list of words that were potentially wrong and their answers. On the Dev side that works reasonably well, in most cases if a word appears more than once it would be spelled wrong in both cases.

Andy
SQLAndy - My Blog!
Connect with me on LinkedIn
Follow me on Twitter
Post #538336
Posted Tuesday, July 22, 2008 8:57 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, October 15, 2014 7:46 AM
Points: 193, Visits: 291
I have really enjoyed reading this thread. Mike C especially has been extremely enlightening. What I take from his expertise is that there are numerous phonetic matching algorithms, and you need to analyze your requirements carefully to chose the right one.

I note that spell check is (or was when I was in school) a common problem given in college CS classes. That was the last time I encountered this problem. Like the original article stated, when I need this kind of functionality, I always use a third-party vendor. Dabbling in this stuff is fun as an excercise, but it is too complicated for an amateur solution. It might be almost as silly as trying to roll your own encryption.
Post #538535
Posted Tuesday, July 22, 2008 1:35 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, October 15, 2014 7:46 AM
Points: 193, Visits: 291
After this discussion of phonetic algorythms that we have been having, I cannot believe that I just ran across this:

http://www.prnewswire.co.uk/cgi/news/release?id=189324

The gist of the article is that the US Transportation Security Administration uses Soundex to search databases for terrorists, and that the search results are almost useless.
Post #538796
Posted Tuesday, July 22, 2008 5:46 PM


Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Friday, April 4, 2014 4:40 PM
Points: 751, Visits: 917
I thought it was a very interesting article with an interesting concept.

One note is that there is a set based way of generating the helper table that may be more efficient. Jeff Moden described it in http://www.sqlservercentral.com/articles/TSQL/62867/ .


---
Timothy A Wiseman
SQL Blog: http://timothyawiseman.wordpress.com/
Post #538911
Posted Tuesday, July 29, 2008 2:11 PM
Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Monday, September 29, 2014 2:44 PM
Points: 3,475, Visits: 583
Try this to create and populate the table with numbers:
CREATE TABLE [#tblNumbers]( [number] [int])
Go
insert #tblnumbers(number) values( 1)
Go 100
Declare @i int
Set @i = 0
Update #tblnumbers
set @i = number = @i+1
--select * from #tblnumbers
--truncate table #tblnumbers

Go 100 - will execute the batch 100 times as somebody mentioned on this site a while ago.
set @i = number = @i+1
is a trick pointed out in the book "Inside Microsoft SQL Server 7.0 "
by Ron Soukup and Kalen Delaney to get running totals. This is a valid syntax for UPDATE statement.



Regards,
Yelena Varshal

Post #543032
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse