|
|
|
Grasshopper
      
Group: General Forum Members
Last Login: Wednesday, August 30, 2006 11:06 AM
Points: 11,
Visits: 1
|
|
An example: select dbo.udf_levenshtein('Yacht Sales', 'Yacht Charters International')--21 select dbo.udf_levenshtein('Yacht Charters International', 'Yacht Sales')--22 The first parameter issue for the udf's that are used in select clauses is a known issue. I don't have the link though. Thanks
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Tuesday, April 30, 2013 2:01 PM
Points: 39,
Visits: 25
|
|
Thank you! I had just been grousing about soundex and a better solution drops into my lap. I hope to use this in production ASAP. A Question... When I run: select 'SCHUMAKER' as Name, dbo.udf_nysiis ('SCHUMAKER') as Result union all select 'SHOEMAKER', dbo.udf_nysiis ('SHOEMAKER') union all select 'SHEWMAKER', dbo.udf_nysiis ('SHEWMAKER') union all select 'SCHUWMACKER', dbo.udf_nysiis ('SCHUWMACKER')
I get: Name Result ----------- --------------- SCHUMAKER SANAKAR SHOEMAKER SANAKAR SHEWMAKER SAENAKAR SCHUWMACKER SAUNACKAR
The last two results bother me. Aren't all vowels supposed to be replaced with "A" or have I misread the algorithm? I'm using the version of the code included in the install directory. Thanks again
|
|
|
|
|
Ten Centuries
      
Group: General Forum Members
Last Login: Wednesday, March 27, 2013 3:22 PM
Points: 1,276,
Visits: 1,112
|
|
Hi and thanks for the comments! You're right, all vowels should be replaced with 'A', and adjacent vowels should be reduced to a single 'A' during the process. I'll have a look at the code and see where the issue lies. I believe it's probably a simple matter of the way in which embedded 'W' characters are being handled. It looks like the rule for encoding the 'W' is being applied correctly (when a 'W' is encountered, if previous letter is a vowel, then the 'W' is replaced with the previous letter); however, it looks like the previous letter is not being properly encoded as 'A' when it is a vowel. I'll get a fix for that together and ask the good guys here to re-post. I should have the fix together by Tuesday at the latest. I'll re-post direct contact info. here at that time if you'd like to have the update sent directly to you. Thanks again!
|
|
|
|
|
Valued Member
      
Group: General Forum Members
Last Login: Tuesday, July 17, 2012 6:17 AM
Points: 55,
Visits: 91
|
|
Mike, Thanks for this tool, it's quite useful. It appears that the 'PH' replacement portion needs a fix too. -- this part works (it makes a 'P' into an 'F') } else if (currentchar == 'P' && nextchar == 'H') { enc_char = 'F';
-- this part takes the 'H' and makes it a 'P' because the original value from the string (P) is being evaluated instead of the encoded value (F). So you end up with 'FP' instead of 'FF' (the 'FF' would be made 'F' by the dupe checker). } else if (currentchar == 'H') { if (!IsVowel(prevchar) || !IsVowel(nextchar)) { { if (IsVowel(prevchar)) enc_char = 'A'; else enc_char = prevchar; } Example: select fn_NYSIIS('joseph') Returns: 'JASAFP' Thanks again
|
|
|
|
|
Ten Centuries
      
Group: General Forum Members
Last Login: Wednesday, March 27, 2013 3:22 PM
Points: 1,276,
Visits: 1,112
|
|
Hi Robert, Duray and Shewmaker - I resolved the issues and have sent the updated files to the good guys here at SQLServerCentral. Here's the fixes: -NYSIIS multi-character encoding (i.e., "vowel + H", "vowel + W", "SCH", "EV", "PH", etc.) was modified to resolve the issues found. 'JOSEPH' now encodes as 'JASAF'. 'SCHUMAKER', 'SHOEMAKER', 'SHEWMAKER', and 'SCHUWMAKER' now all properly encode as 'SANAKAR'. -Levenshtein Edit Distance algorithm was revised to provide "symmetric" results. I also tested the Levenshtein Edit Distance against a couple of versions in other languages to make sure the results are consistent. dbo.udf_levenshtein('Yacht Sales', 'Yacht Charters International') and dbo.udf_levenshtein('Yacht Charters International', 'Yacht Sales') both return an accurate edit distance calculation of 19. Also, a small correction to the article. In NYSIIS, 'johnson', 'johnsen' encode as 'JANSAN'. 'johansen' and 'johannsen' encode as 'JAHANSAN'. The way NYSIIS works, if an 'H' is surrounded by vowels, it is kept as an 'H' otherwise it is eliminated from the encoded string. The good guys here uploaded the new version, and it can now be downloaded directly from the article. Thanks for the feedback guys!
|
|
|
|
|
Grasshopper
      
Group: General Forum Members
Last Login: Wednesday, August 30, 2006 11:06 AM
Points: 11,
Visits: 1
|
|
Hi, I downloaded the update, and replaced the dll's. I guess that's all I need to do... I have another issue (more like a confession), possibly out of NYSIIS and LEVENSHTEIN context, but might help others... I was using soundex to filter out the results, and udf_levenshtein to sort them, which gave me quite a bit of performance increase, and sounded like a great idea... I was thinking that NYSIIS is an improved soundex, so filtering with soundex would do just fine... Well, it does UNLESS you have punctuation characters such as ' (as in O'brian) ... declare @myname varchar(255) declare @mydex varchar(4) declare @mydex2 varchar(10) set @myname = 'o''brian' set @mydex = soundex(@myname) set @mydex2 = dbo.udf_nysiis(@myname) select @mydex --( returns O000) select @mydex2 -- (returns OBRAN) however declare @mydex varchar(4) declare @mydex2 varchar(10) set @myname = 'obrian' set @mydex = soundex(@myname) set @mydex2 = dbo.udf_nysiis(@myname) select @mydex --( returns O165) select @mydex2 --( returns OBRAN ) So, soundex does NOT eliminate the punctuation errors, but NYSIIS does. Just fyi... I guess I will be researching some more "optimization"  Thank you again...
|
|
|
|
|
Ten Centuries
      
Group: General Forum Members
Last Login: Wednesday, March 27, 2013 3:22 PM
Points: 1,276,
Visits: 1,112
|
|
That's it, just copy the new DLLs into the MSSQL\BINN directory (may require you to stop and restart MSSQLServer service). That's a very good point. The handling of punctuation is implementation-specific. There's really no 'standards body' or organization that sets specific rules for implementation of the various phonetic algorithms. (This is really apparent with newer algorithms, like Double Metaphone!) It appears that the MS implementation of soundex stops encoding when it hits a punctuation mark. You'll get 'O000' if you try to encode SOUNDEX('O.BRIEN'), SOUNDEX('O''BRIEN') or SOUNDEX ('O BRIEN'). The NYSIIS implementation in the toolkit, on the other hand, ignores punctuation altogether. One way to deal with it on the SOUNDEX side is to eliminate all non-alphabetic characters from your string before attempting to encode it with SOUNDEX. This could be done in a UDF, but keep in mind that string manipulations and looping in T-SQL are not extremely efficient (that's one of the main reasons I implemented NYSIIS and Levenshtein Edit Distance as XP's in C++). I would recommend that you create a separate table with the proper SOUNDEX encoding and the NYSIIS encoding, and relate it to your main table. That way you would only have to perform these string manipulations once; also, you'll be able to take advantage of proper indexes on the Encodings table in your queries. All of the encoding overhead would be handled at load time, and your queries would run a lot more efficiently. The down-side is that you'll be using up more space to store these encodings, but it shouldn't be significantly more. For 1 million rows, you're looking at about ~10 MB to store the Soundex encodings and six-byte NYSIIS encodings. Proper non-clustered indexes will take up additional space, of course. You could cut down the amount of space used in the Encodings table by using the actual name as a foreign key, to avoid duplicate entries for "JOHNSON, JAMES" and other common names (example). Alternatively, since NYSIIS encodes the first letter as itself (as does SOUNDEX), you could narrow down your searches by just comparing the first letter. SELECT * FROM People WHERE Name LIKE 'J%' AND dbo.udf_NYSIIS('JOHNSON') = dbo.udf_NYSIIS(Name) Thanks!
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Wednesday, March 08, 2006 10:43 AM
Points: 1,
Visits: 1
|
|
Hi all, I hope this doesn't fall on deaf ears. There's a memory leak in both of these extended procedures. In proc.cpp, you allocate new params structures but don't delete them. in proc.cpp in the xp_levenshtein dll you need to add the lines delete P1; delete P2; delete P3;before the line return sqlerr; and in proc.cpp in the xp_nysiis project likewise, you need to add the lines delete P1; delete P2;
before the same line; return sqlerr; Hope that helps someone, as the provided dlls helped bring down one of our production sql servers! All the best, Leon
|
|
|
|
|
Ten Centuries
      
Group: General Forum Members
Last Login: Wednesday, March 27, 2013 3:22 PM
Points: 1,276,
Visits: 1,112
|
|
Hi Leon, Thanks for pointing that out. I actually included that in the clean-up of the most recent version, but sent the SSC guys an older version. I've asked them to repost the updated version that handles the memory leak, a potential memory reclamation issue, and includes a DROP_XP.SQL script as well as the REDIST directory with support DLL's required by some installations (some installations might have difficulty locating the ODS and VC++ Runtime libraries if they're not in the \Binn directory). Thanks again for the feedback, and sorry for the inconvenience.
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Tuesday, May 14, 2013 4:15 PM
Points: 5,
Visits: 392
|
|
This is exactly what I've been looking for; thank you. However I am running SQLServer 2005 32-bit (Std Edition) on Win2003 Server 64-bit. Should I be compatable with the DLL's and toolkit? Thanks again...
|
|
|
|