Sound Matching and a Phonetic Toolkit

Question

Sound Matching and a Phonetic Toolkit

Mike C

SSC-Insane

Points: 23224
More actions
July 11, 2005 at 12:53 pm

#167724

Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/mcoles/soundmatchingandaphonetictoolkit.asp

Viewing 15 posts - 1 through 15 (of 37 total)

You must be logged in to reply to this topic. Login to reply

Mike C SSC-Insane Points: 23224 More actions · Answer 1

Note that some Windows 2003 systems might require modification to the ADD_XP.SQL installation script. The following is a script change that might need to be made for some Windows 2003 systems.

Change:

EXEC sp_addextendedproc xp_nysiis, 'xp_nysiis.dll'

GO

EXEC sp_addextendedproc xp_levenshtein, 'xp_levenshtein.dll'

To:

EXEC sp_addextendedproc xp_nysiis, 'C:\Program Files\Microsoft SQL Server\MSSQL\BINN\xp_nysiis.dll'

GO

EXEC sp_addextendedproc xp_levenshtein, 'C:\Program Files\Microsoft SQL Server\MSSQL\BINN\xp_levenshtein.dll'

The directory used above is the default directory for C: drive installations. Replace the directory with your appropriate SQL Server directory in the lines above, as necessary.

Also, it may be necessary to download the two Microsoft redistributable DLL files and copy them to your MSSQL\BINN directory on some Windows 2003 systems. These files are OPENDS60.DLL and MSVCR71.DLL. They can be downloaded as part of the SQL Encryption Toolkit here.

David Poole-249495 Ten Centuries Points: 1106 More actions · Answer 2

If you are interested in phonetic matching then try http://aspell.net/metaphone/

If memory serves me correctly Lawrence Philips went on to work for Verity, the search engine people

rkd993 SSC Journeyman Points: 82 More actions · Answer 3

Good to see this being discussed. Sound and speech are only going to become more important as we move forward.

Need to tie the discussion in with English Query and the new variants of this too when possible.

Cheers

Russell

Russell K Darroch

Mike C SSC-Insane Points: 23224 More actions · Answer 4

There's a "compact" double-metaphone implementation over at The Code Project.

Mike C SSC-Insane Points: 23224 More actions · Answer 5

Thanks for the feedback! There is, of course, always more that can be added to most discussions I've used the English Query Automation API to develop interfaces that perform "English"(-like) queries, but haven't really used it to correct user typographical errors or to phonetically match surnames.

I'd love some more information on this, if you have some links. Thanks!

Duray AKAR SSC Veteran Points: 261 More actions · Answer 6

Thanks for the great article and toolkit. Works like a charm.

I just wanted to share my experience, maybe someone else will make use of it .

I have an average sized table, about 40,000 rows, and faced a performance issue while trying to use udf_NYSIIS. The query that takes NO time with soundex took like 12 seconds with udf_NYSIIS.

So I tried a workaround, and added the soundex to my where clause in front of the udf_NYSIIS clause with an AND...

This way, the lazy SQL processor checked soundex first, and if that one passes, then checked the udf_NYSIIS.

I was getting the same result again in NO time.

Try for yourself:

declare @myname varchar(255)

declare @mydex varchar(4)

declare @mydex2 varchar(10)

set @myname = 'smit' -- your criteria goes here

set @mydex = soundex(@myname)

set @mydex2 = master.dbo.udf_NYSIIS(@myname)

select * from customers where soundex(lastname)=@mydex

select * from customers where master.dbo.udf_NYSIIS(lastname)=@mydex2

select * from customers where soundex(lastname)=@mydex and master.dbo.udf_NYSIIS(lastname)=@mydex2

Thanks again.

Duray AKAR

Mike C SSC-Insane Points: 23224 More actions · Answer 7

Thanks for the feedback! I'm glad you found it useful. UDF's in the WHERE clause do tend to slow down your SELECT statements since they have to perform a row-by-row calculation/comparison to return your final result. This is especially true in this instance, since SOUNDEX is a very simple and straightforward algorithm that involves encoding a single character at a time; the NYSIIS algorithm is more complex, as it requires several multi-character position-dependent encodings as well as single-character encodings.

One method that I've used on a table I work with (about 19,000,000 rows and growing) is to generate the SOUNDEX and NYSIIS values for the last name at row insertion time. This gives the advantage of eliminating the redundant calculations/conversions at SELECT time, and allows me to take advantage of indexes on the lastname_SOUNDEX and lastname_NYSIIS columns. The down side is that it slows down row INSERTs (not significantly on my current database however), and it takes up more storage space for the extra columns and indexes.

Additionally, if you find that you are running across a lot of people with the same last name, it might make sense to move the SOUNDEX and NYSIIS encoded values to a separate table using the lastname as the Primary Key, to eliminate a lot of duplicate entries.

Thanks again.

Duray AKAR SSC Veteran Points: 261 More actions · Answer 8

Another quick note:

The user defined function udf_levenshtein is not symmetric. In other words,

udf_levenshtein(a,b) is not always equal to udf_levenshtein(b,a)

You might wonder "so what, why does it matter?"

But if I use it to order my results, and if I have two fields to weight, such as bl_lastname_str and bl_company_str, SQL Server will not allow me to use it as below:

order by dbo.udf_levenshtein(@myname, isnull(bl_lastname_str,''))

, bl_lastname_str

, dbo.udf_levenshtein(@myname, isnull(bl_company_str,''))

, bl_company_str

So I have to use :

order by dbo.udf_levenshtein(isnull(bl_lastname_str,''), @myname)

, bl_lastname_str

, dbo.udf_levenshtein(isnull(bl_company_str,''), @myname)

, bl_company_str

which in fact, produces different results.

I know that this is a general problem with user defined functions in SQL server. SQL server tends to optimize the function results in the query with only the value of the FIRST parameter, so you have to change the value of your FIRST parameter... Otherwise, both function calls will be considered to return the same value... Quick example, lets say you have a udfBinAnd that takes two int parameters a and b, and calculates a & b (Binary and) ... If you use it in an SQL query, it will consider

udfBinAnd(1,2) = udfBinAnd(1,3).

Pretty SCARY thing, so watch out...

That might even be considered an SQL Server "BUG", but I will take it as a "caveat of performance" And I won't go further with the storage optimization of consecutive bit fields and their negative effects on the group by queries, because it is totally out of context.

But again, I thought that might be some useful info for other people, or maybe even trigger some workarounds to make the function symmetric...

Thanks

Duray AKAR

Mike C SSC-Insane Points: 23224 More actions · Answer 9

Thanks for the feedback. Do you have some examples of that issue you can share? A couple of strings that are returning different results dependent on the order? One workaround I can think of might be to modify the UDF to always submit the shorter string first and the longer string second to the XP. If you can post a couple of examples I'll see what I can come up with.

You can also calculate a Levenshtein Edit Distance "score". Basically you take the calculated Edit Distance and divide by the longer of the two strings (be sure to CAST one of the operands as a floating point type). The result is a "score" that estimates how closely the two strings match. You might then set a threshold of returning only strings that match with a score of 80% (0.80) or greater.

As a matter of fact, I just thought of something else - I've had issues using UDF's in an ORDER BY clause before. Not sure if it's documented as an issue on the MS website, but I have encountered that very issue. I'll look into that as well.

Thanks again!

Duray AKAR SSC Veteran Points: 261 More actions · Answer 10

An example:

select dbo.udf_levenshtein('Yacht Sales', 'Yacht Charters International')--21

select dbo.udf_levenshtein('Yacht Charters International', 'Yacht Sales')--22

The first parameter issue for the udf's that are used in select clauses is a known issue. I don't have the link though.

Thanks

Shewmaker Valued Member Points: 73 More actions · Answer 11

Thank you! I had just been grousing about soundex and a better solution drops into my lap. I hope to use this in production ASAP.

A Question...

When I run:

select 'SCHUMAKER' as Name, dbo.udf_nysiis ('SCHUMAKER') as Result union all
select 'SHOEMAKER', dbo.udf_nysiis ('SHOEMAKER') union all
select 'SHEWMAKER', dbo.udf_nysiis ('SHEWMAKER') union all
select 'SCHUWMACKER', dbo.udf_nysiis ('SCHUWMACKER')

I get:

Name               Result
-----------        ---------------
SCHUMAKER          SANAKAR
SHOEMAKER          SANAKAR
SHEWMAKER        SAENAKAR
SCHUWMACKER       SAUNACKAR

The last two results bother me. Aren't all vowels supposed to be replaced with "A" or have I misread the algorithm?

I'm using the version of the code included in the install directory.

Thanks again

Mike C SSC-Insane Points: 23224 More actions · Answer 12

Hi and thanks for the comments!

You're right, all vowels should be replaced with 'A', and adjacent vowels should be reduced to a single 'A' during the process. I'll have a look at the code and see where the issue lies. I believe it's probably a simple matter of the way in which embedded 'W' characters are being handled. It looks like the rule for encoding the 'W' is being applied correctly (when a 'W' is encountered, if previous letter is a vowel, then the 'W' is replaced with the previous letter); however, it looks like the previous letter is not being properly encoded as 'A' when it is a vowel.

I'll get a fix for that together and ask the good guys here to re-post. I should have the fix together by Tuesday at the latest. I'll re-post direct contact info. here at that time if you'd like to have the update sent directly to you.

Thanks again!

Robert A. Booey SSC-Addicted Points: 426 More actions · Answer 13

Mike,

Thanks for this tool, it's quite useful. It appears that the 'PH' replacement portion needs a fix too.

-- this part works (it makes a 'P' into an 'F')

} else if (currentchar == 'P' && nextchar == 'H') {

enc_char = 'F';

-- this part takes the 'H' and makes it a 'P' because the original value from the string (P) is being evaluated instead of the encoded value (F). So you end up with 'FP' instead of 'FF' (the 'FF' would be made 'F' by the dupe checker).

} else if (currentchar == 'H') {

if (!IsVowel(prevchar) || !IsVowel(nextchar)) {

{

if (IsVowel(prevchar))

enc_char = 'A';

else

enc_char = prevchar;

}

Example:

select fn_NYSIIS('joseph')

Returns: 'JASAFP'

Thanks again

Mike C SSC-Insane Points: 23224 More actions · Answer 14

Hi Robert, Duray and Shewmaker -

I resolved the issues and have sent the updated files to the good guys here at SQLServerCentral. Here's the fixes:

-NYSIIS multi-character encoding (i.e., "vowel + H", "vowel + W", "SCH", "EV", "PH", etc.) was modified to resolve the issues found. 'JOSEPH' now encodes as 'JASAF'. 'SCHUMAKER', 'SHOEMAKER', 'SHEWMAKER', and 'SCHUWMAKER' now all properly encode as 'SANAKAR'.

-Levenshtein Edit Distance algorithm was revised to provide "symmetric" results. I also tested the Levenshtein Edit Distance against a couple of versions in other languages to make sure the results are consistent. dbo.udf_levenshtein('Yacht Sales', 'Yacht Charters International') and dbo.udf_levenshtein('Yacht Charters International', 'Yacht Sales') both return an accurate edit distance calculation of 19.

Also, a small correction to the article. In NYSIIS, 'johnson', 'johnsen' encode as 'JANSAN'. 'johansen' and 'johannsen' encode as 'JAHANSAN'. The way NYSIIS works, if an 'H' is surrounded by vowels, it is kept as an 'H' otherwise it is eliminated from the encoded string.

The good guys here uploaded the new version, and it can now be downloaded directly from the article.

Thanks for the feedback guys!