Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Sound Matching and a Phonetic Toolkit


Sound Matching and a Phonetic Toolkit

Author
Message
Duray AKAR
Duray AKAR
Grasshopper
Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)

Group: General Forum Members
Points: 11 Visits: 1

An example:

select dbo.udf_levenshtein('Yacht Sales', 'Yacht Charters International')--21
select dbo.udf_levenshtein('Yacht Charters International', 'Yacht Sales')--22

The first parameter issue for the udf's that are used in select clauses is a known issue. I don't have the link though.

Thanks


Shewmaker
Shewmaker
SSC Rookie
SSC Rookie (41 reputation)SSC Rookie (41 reputation)SSC Rookie (41 reputation)SSC Rookie (41 reputation)SSC Rookie (41 reputation)SSC Rookie (41 reputation)SSC Rookie (41 reputation)SSC Rookie (41 reputation)

Group: General Forum Members
Points: 41 Visits: 69

Thank you! I had just been grousing about soundex and a better solution drops into my lap. I hope to use this in production ASAP.

A Question...

When I run:

select 'SCHUMAKER' as Name, dbo.udf_nysiis ('SCHUMAKER') as Result union all
select 'SHOEMAKER', dbo.udf_nysiis ('SHOEMAKER') union all
select 'SHEWMAKER', dbo.udf_nysiis ('SHEWMAKER') union all
select 'SCHUWMACKER', dbo.udf_nysiis ('SCHUWMACKER')

I get:

Name Result
----------- ---------------
SCHUMAKER SANAKAR
SHOEMAKER SANAKAR
SHEWMAKER SAENAKAR
SCHUWMACKER SAUNACKAR

The last two results bother me. Aren't all vowels supposed to be replaced with "A" or have I misread the algorithm?

I'm using the version of the code included in the install directory.

Thanks again





Mike C
Mike C
Ten Centuries
Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)

Group: General Forum Members
Points: 1351 Visits: 1168

Hi and thanks for the comments!

You're right, all vowels should be replaced with 'A', and adjacent vowels should be reduced to a single 'A' during the process. I'll have a look at the code and see where the issue lies. I believe it's probably a simple matter of the way in which embedded 'W' characters are being handled. It looks like the rule for encoding the 'W' is being applied correctly (when a 'W' is encountered, if previous letter is a vowel, then the 'W' is replaced with the previous letter); however, it looks like the previous letter is not being properly encoded as 'A' when it is a vowel.

I'll get a fix for that together and ask the good guys here to re-post. I should have the fix together by Tuesday at the latest. I'll re-post direct contact info. here at that time if you'd like to have the update sent directly to you.

Thanks again!


Robert A. Booey
Robert A. Booey
Valued Member
Valued Member (55 reputation)Valued Member (55 reputation)Valued Member (55 reputation)Valued Member (55 reputation)Valued Member (55 reputation)Valued Member (55 reputation)Valued Member (55 reputation)Valued Member (55 reputation)

Group: General Forum Members
Points: 55 Visits: 93

Mike,

Thanks for this tool, it's quite useful. It appears that the 'PH' replacement portion needs a fix too.

-- this part works (it makes a 'P' into an 'F')

} else if (currentchar == 'P' && nextchar == 'H') {
enc_char = 'F';

-- this part takes the 'H' and makes it a 'P' because the original value from the string (P) is being evaluated instead of the encoded value (F). So you end up with 'FP' instead of 'FF' (the 'FF' would be made 'F' by the dupe checker).

} else if (currentchar == 'H') {
if (!IsVowel(prevchar) || !IsVowel(nextchar)) {
{
if (IsVowel(prevchar))
enc_char = 'A';
else
enc_char = prevchar;
}

Example:

select fn_NYSIIS('joseph')

Returns: 'JASAFP'

Thanks again


Mike C
Mike C
Ten Centuries
Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)

Group: General Forum Members
Points: 1351 Visits: 1168

Hi Robert, Duray and Shewmaker -

I resolved the issues and have sent the updated files to the good guys here at SQLServerCentral. Here's the fixes:

-NYSIIS multi-character encoding (i.e., "vowel + H", "vowel + W", "SCH", "EV", "PH", etc.) was modified to resolve the issues found. 'JOSEPH' now encodes as 'JASAF'. 'SCHUMAKER', 'SHOEMAKER', 'SHEWMAKER', and 'SCHUWMAKER' now all properly encode as 'SANAKAR'.

-Levenshtein Edit Distance algorithm was revised to provide "symmetric" results. I also tested the Levenshtein Edit Distance against a couple of versions in other languages to make sure the results are consistent. dbo.udf_levenshtein('Yacht Sales', 'Yacht Charters International') and dbo.udf_levenshtein('Yacht Charters International', 'Yacht Sales') both return an accurate edit distance calculation of 19.

Also, a small correction to the article. In NYSIIS, 'johnson', 'johnsen' encode as 'JANSAN'. 'johansen' and 'johannsen' encode as 'JAHANSAN'. The way NYSIIS works, if an 'H' is surrounded by vowels, it is kept as an 'H' otherwise it is eliminated from the encoded string.

The good guys here uploaded the new version, and it can now be downloaded directly from the article.

Thanks for the feedback guys!


Duray AKAR
Duray AKAR
Grasshopper
Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)Grasshopper (11 reputation)

Group: General Forum Members
Points: 11 Visits: 1

Hi,

I downloaded the update, and replaced the dll's. I guess that's all I need to do...

I have another issue (more like a confession), possibly out of NYSIIS and LEVENSHTEIN context, but might help others...

I was using soundex to filter out the results, and udf_levenshtein to sort them, which gave me quite a bit of performance increase, and sounded like a great idea... I was thinking that NYSIIS is an improved soundex, so filtering with soundex would do just fine...

Well, it does UNLESS you have punctuation characters such as ' (as in O'brian) ...

declare @myname varchar(255)
declare @mydex varchar(4)
declare @mydex2 varchar(10)
set @myname = 'o''brian'
set @mydex = soundex(@myname)
set @mydex2 = dbo.udf_nysiis(@myname)
select @mydex --( returns O000)
select @mydex2 -- (returns OBRAN)

however

declare @mydex varchar(4)
declare @mydex2 varchar(10)
set @myname = 'obrian'
set @mydex = soundex(@myname)
set @mydex2 = dbo.udf_nysiis(@myname)
select @mydex --( returns O165)
select @mydex2 --( returns OBRAN )

So, soundex does NOT eliminate the punctuation errors, but NYSIIS does.

Just fyi...

I guess I will be researching some more "optimization"

Thank you again...


Mike C
Mike C
Ten Centuries
Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)

Group: General Forum Members
Points: 1351 Visits: 1168

That's it, just copy the new DLLs into the MSSQL\BINN directory (may require you to stop and restart MSSQLServer service).

That's a very good point. The handling of punctuation is implementation-specific. There's really no 'standards body' or organization that sets specific rules for implementation of the various phonetic algorithms. (This is really apparent with newer algorithms, like Double Metaphone!) It appears that the MS implementation of soundex stops encoding when it hits a punctuation mark. You'll get 'O000' if you try to encode SOUNDEX('O.BRIEN'), SOUNDEX('O''BRIEN') or SOUNDEX ('O BRIEN').

The NYSIIS implementation in the toolkit, on the other hand, ignores punctuation altogether. One way to deal with it on the SOUNDEX side is to eliminate all non-alphabetic characters from your string before attempting to encode it with SOUNDEX. This could be done in a UDF, but keep in mind that string manipulations and looping in T-SQL are not extremely efficient (that's one of the main reasons I implemented NYSIIS and Levenshtein Edit Distance as XP's in C++).

I would recommend that you create a separate table with the proper SOUNDEX encoding and the NYSIIS encoding, and relate it to your main table. That way you would only have to perform these string manipulations once; also, you'll be able to take advantage of proper indexes on the Encodings table in your queries. All of the encoding overhead would be handled at load time, and your queries would run a lot more efficiently. The down-side is that you'll be using up more space to store these encodings, but it shouldn't be significantly more. For 1 million rows, you're looking at about ~10 MB to store the Soundex encodings and six-byte NYSIIS encodings. Proper non-clustered indexes will take up additional space, of course.

You could cut down the amount of space used in the Encodings table by using the actual name as a foreign key, to avoid duplicate entries for "JOHNSON, JAMES" and other common names (example).

Alternatively, since NYSIIS encodes the first letter as itself (as does SOUNDEX), you could narrow down your searches by just comparing the first letter.

SELECT * FROM People
WHERE Name LIKE 'J%'
AND dbo.udf_NYSIIS('JOHNSON') = dbo.udf_NYSIIS(Name)

Thanks!


Leon Jollans
Leon Jollans
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 1

Hi all,

I hope this doesn't fall on deaf ears. There's a memory leak in both of these extended procedures. In proc.cpp, you allocate new params structures but don't delete them.

in proc.cpp in the xp_levenshtein dll you need to add the lines

delete P1;
delete P2;
delete P3;

before the line

return sqlerr;

and in proc.cpp in the xp_nysiis project likewise, you need to add the lines

delete P1;
delete P2;

before the same line;

return sqlerr;

Hope that helps someone, as the provided dlls helped bring down one of our production sql servers!

All the best,

Leon


Mike C
Mike C
Ten Centuries
Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)

Group: General Forum Members
Points: 1351 Visits: 1168

Hi Leon,

Thanks for pointing that out. I actually included that in the clean-up of the most recent version, but sent the SSC guys an older version. I've asked them to repost the updated version that handles the memory leak, a potential memory reclamation issue, and includes a DROP_XP.SQL script as well as the REDIST directory with support DLL's required by some installations (some installations might have difficulty locating the ODS and VC++ Runtime libraries if they're not in the \Binn directory).

Thanks again for the feedback, and sorry for the inconvenience.


a0153494
a0153494
Forum Newbie
Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)

Group: General Forum Members
Points: 5 Visits: 394

This is exactly what I've been looking for; thank you. However I am running SQLServer 2005 32-bit (Std Edition) on Win2003 Server 64-bit. Should I be compatable with the DLL's and toolkit?

Thanks again...





Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search