Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««1234»»»

Sound Matching and a Phonetic Toolkit Expand / Collapse
Author
Message
Posted Wednesday, August 3, 2005 3:14 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, August 30, 2006 11:06 AM
Points: 11, Visits: 1

An example:

select dbo.udf_levenshtein('Yacht Sales', 'Yacht Charters International')--21
select dbo.udf_levenshtein('Yacht Charters International', 'Yacht Sales')--22

The first parameter issue for the udf's that are used in select clauses is a known issue. I don't have the link though.

Thanks

Post #207115
Posted Wednesday, August 3, 2005 5:11 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, April 24, 2014 9:21 AM
Points: 40, Visits: 30

Thank you! I had just been grousing about soundex and a better solution drops into my lap. I hope to use this in production ASAP.

A Question...

When I run:

select 'SCHUMAKER' as Name, dbo.udf_nysiis ('SCHUMAKER') as Result union all
select 'SHOEMAKER', dbo.udf_nysiis ('SHOEMAKER') union all
select 'SHEWMAKER', dbo.udf_nysiis ('SHEWMAKER') union all
select 'SCHUWMACKER', dbo.udf_nysiis ('SCHUWMACKER')

I get:

Name               Result
-----------        ---------------
SCHUMAKER          SANAKAR
SHOEMAKER          SANAKAR
SHEWMAKER          SAENAKAR
SCHUWMACKER        SAUNACKAR

The last two results bother me. Aren't all vowels supposed to be replaced with "A" or have I misread the algorithm?

I'm using the version of the code included in the install directory.

Thanks again




Post #207139
Posted Sunday, August 7, 2005 9:58 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Monday, April 14, 2014 4:18 PM
Points: 1,276, Visits: 1,132

Hi and thanks for the comments!

You're right, all vowels should be replaced with 'A', and adjacent vowels should be reduced to a single 'A' during the process.  I'll have a look at the code and see where the issue lies.  I believe it's probably a simple matter of the way in which embedded 'W' characters are being handled.  It looks like the rule for encoding the 'W' is being applied correctly (when a 'W' is encountered, if previous letter is a vowel, then the 'W' is replaced with the previous letter); however, it looks like the previous letter is not being properly encoded as 'A' when it is a vowel.

I'll get a fix for that together and ask the good guys here to re-post.  I should have the fix together by Tuesday at the latest.  I'll re-post direct contact info. here at that time if you'd like to have the update sent directly to you.

Thanks again!

Post #208220
Posted Monday, August 8, 2005 6:22 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Wednesday, November 20, 2013 7:09 AM
Points: 55, Visits: 92

Mike,

Thanks for this tool, it's quite useful. It appears that the 'PH' replacement portion needs a fix too.

-- this part works (it makes a 'P' into an 'F')

   } else if (currentchar == 'P' && nextchar == 'H') {
    enc_char = 'F';

-- this part takes the 'H' and makes it a 'P' because the original value from the string (P) is being evaluated instead of the encoded value (F). So you end up with 'FP' instead of 'FF' (the 'FF' would be made 'F' by the dupe checker).

   } else if (currentchar == 'H') {
    if (!IsVowel(prevchar) || !IsVowel(nextchar)) {
     {
      if (IsVowel(prevchar))
       enc_char = 'A';
      else
       enc_char = prevchar;
     }

Example:

select fn_NYSIIS('joseph')

Returns: 'JASAFP'

Thanks again

Post #208290
Posted Monday, August 8, 2005 9:04 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Monday, April 14, 2014 4:18 PM
Points: 1,276, Visits: 1,132

Hi Robert, Duray and Shewmaker -

I resolved the issues and have sent the updated files to the good guys here at SQLServerCentral.  Here's the fixes:

-NYSIIS multi-character encoding (i.e., "vowel + H", "vowel + W", "SCH", "EV", "PH", etc.) was modified to resolve the issues found.  'JOSEPH' now encodes as 'JASAF'. 'SCHUMAKER', 'SHOEMAKER', 'SHEWMAKER', and 'SCHUWMAKER' now all properly encode as 'SANAKAR'.

-Levenshtein Edit Distance algorithm was revised to provide "symmetric" results.  I also tested the Levenshtein Edit Distance against a couple of versions in other languages to make sure the results are consistent.  dbo.udf_levenshtein('Yacht Sales', 'Yacht Charters International') and dbo.udf_levenshtein('Yacht Charters International', 'Yacht Sales') both return an accurate edit distance calculation of 19.

Also, a small correction to the article.  In NYSIIS,  'johnson', 'johnsen' encode as 'JANSAN'.  'johansen' and 'johannsen' encode as 'JAHANSAN'.  The way NYSIIS works, if an 'H' is surrounded by vowels, it is kept as an 'H' otherwise it is eliminated from the encoded string.

The good guys here uploaded the new version, and it can now be downloaded directly from the article.

Thanks for the feedback guys!

Post #208542
Posted Monday, August 15, 2005 8:27 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, August 30, 2006 11:06 AM
Points: 11, Visits: 1

Hi,  

I downloaded the update, and replaced the dll's. I guess that's all I need to do...

I have another issue (more like a confession), possibly out of NYSIIS and LEVENSHTEIN context, but might help others...

I was using soundex to filter out the results, and udf_levenshtein to sort them, which gave me quite a bit of performance increase, and sounded like a great idea... I was thinking that NYSIIS is an improved soundex, so filtering with soundex would do just fine...

Well,  it does UNLESS you have punctuation characters such as ' (as in O'brian) ...

declare @myname varchar(255)
declare @mydex varchar(4)
declare @mydex2 varchar(10)
set @myname = 'o''brian'
set @mydex = soundex(@myname)
set @mydex2 = dbo.udf_nysiis(@myname)
select @mydex --( returns O000)
select @mydex2 -- (returns OBRAN)

however

declare @mydex varchar(4)
declare @mydex2 varchar(10)
set @myname = 'obrian'
set @mydex = soundex(@myname)
set @mydex2 = dbo.udf_nysiis(@myname)
select @mydex --( returns O165)
select @mydex2 --( returns OBRAN )

So, soundex does NOT eliminate the punctuation errors, but NYSIIS does.

Just fyi...

I guess I will be researching some more "optimization"

Thank you again...

Post #210461
Posted Monday, August 15, 2005 8:59 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Monday, April 14, 2014 4:18 PM
Points: 1,276, Visits: 1,132

That's it, just copy the new DLLs into the MSSQL\BINN directory (may require you to stop and restart MSSQLServer service).

That's a very good point.  The handling of punctuation is implementation-specific.  There's really no 'standards body' or organization that sets specific rules for implementation of the various phonetic algorithms.  (This is really apparent with newer algorithms, like Double Metaphone!)  It appears that the MS implementation of soundex stops encoding when it hits a punctuation mark.  You'll get 'O000' if you try to encode SOUNDEX('O.BRIEN'), SOUNDEX('O''BRIEN') or SOUNDEX ('O BRIEN').

The NYSIIS implementation in the toolkit, on the other hand, ignores punctuation altogether.  One way to deal with it on the SOUNDEX side is to eliminate all non-alphabetic characters from your string before attempting to encode it with SOUNDEX.  This could be done in a UDF, but keep in mind that string manipulations and looping in T-SQL are not extremely efficient (that's one of the main reasons I implemented NYSIIS and Levenshtein Edit Distance as XP's in C++).

I would recommend that you create a separate table with the proper SOUNDEX encoding and the NYSIIS encoding, and relate it to your main table.  That way you would only have to perform these string manipulations once; also, you'll be able to take advantage of proper indexes on the Encodings table in your queries.  All of the encoding overhead would be handled at load time, and your queries would run a lot more efficiently.  The down-side is that you'll be using up more space to store these encodings, but it shouldn't be significantly more.  For 1 million rows, you're looking at about ~10 MB to store the Soundex encodings and six-byte NYSIIS encodings.  Proper non-clustered indexes will take up additional space, of course.

You could cut down the amount of space used in the Encodings table by using the actual name as a foreign key, to avoid duplicate entries for "JOHNSON, JAMES" and other common names (example).

Alternatively, since NYSIIS encodes the first letter as itself (as does SOUNDEX), you could narrow down your searches by just comparing the first letter.

SELECT * FROM People
WHERE Name LIKE 'J%'
AND dbo.udf_NYSIIS('JOHNSON') = dbo.udf_NYSIIS(Name)

Thanks!

Post #210482
Posted Wednesday, March 8, 2006 9:45 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, March 8, 2006 10:43 AM
Points: 1, Visits: 1

Hi all,

I hope this doesn't fall on deaf ears. There's a memory leak in both of these extended procedures. In proc.cpp, you allocate new params structures but don't delete them.

in proc.cpp in the xp_levenshtein dll you need to add the lines

delete P1;
delete P2;
delete P3;

before the line

return sqlerr;

and in proc.cpp in the xp_nysiis project likewise, you need to add the lines

delete P1;
delete P2;

before the same line;

return sqlerr;

Hope that helps someone, as the provided dlls helped bring down one of our production sql servers!

All the best,

Leon

Post #264211
Posted Wednesday, March 8, 2006 5:53 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Monday, April 14, 2014 4:18 PM
Points: 1,276, Visits: 1,132

Hi Leon,

Thanks for pointing that out.  I actually included that in the clean-up of the most recent version, but sent the SSC guys an older version.  I've asked them to repost the updated version that handles the memory leak, a potential memory reclamation issue, and includes a DROP_XP.SQL script as well as the REDIST directory with support DLL's required by some installations (some installations might have difficulty locating the ODS and VC++ Runtime libraries if they're not in the \Binn directory).

Thanks again for the feedback, and sorry for the inconvenience.

Post #264356
Posted Thursday, July 13, 2006 10:05 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, May 14, 2013 4:15 PM
Points: 5, Visits: 392

This is exactly what I've been looking for; thank you.  However I am running SQLServer 2005 32-bit (Std Edition) on Win2003 Server 64-bit.  Should I be compatable with the DLL's and toolkit?

Thanks again...




Post #294303
« Prev Topic | Next Topic »

Add to briefcase ««1234»»»

Permissions Expand / Collapse