Sound Matching and a Phonetic Toolkit

Question

Post reply

Sound Matching and a Phonetic Toolkit

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply

Duray AKAR SSC Veteran Points: 261 More actions · Answer 1

Hi,

I downloaded the update, and replaced the dll's. I guess that's all I need to do...

I have another issue (more like a confession), possibly out of NYSIIS and LEVENSHTEIN context, but might help others...

I was using soundex to filter out the results, and udf_levenshtein to sort them, which gave me quite a bit of performance increase, and sounded like a great idea... I was thinking that NYSIIS is an improved soundex, so filtering with soundex would do just fine...

Well, it does UNLESS you have punctuation characters such as ' (as in O'brian) ...

declare @myname varchar(255)

declare @mydex varchar(4)

declare @mydex2 varchar(10)

set @myname = 'o''brian'

set @mydex = soundex(@myname)

set @mydex2 = dbo.udf_nysiis(@myname)

select @mydex --( returns O000)

select @mydex2 -- (returns OBRAN)

however

declare @mydex varchar(4)

declare @mydex2 varchar(10)

set @myname = 'obrian'

set @mydex = soundex(@myname)

set @mydex2 = dbo.udf_nysiis(@myname)

select @mydex --( returns O165)

select @mydex2 --( returns OBRAN )

So, soundex does NOT eliminate the punctuation errors, but NYSIIS does.

Just fyi...

I guess I will be researching some more "optimization"

Thank you again...

Mike C SSC-Insane Points: 23224 More actions · Answer 2

That's it, just copy the new DLLs into the MSSQL\BINN directory (may require you to stop and restart MSSQLServer service).

That's a very good point. The handling of punctuation is implementation-specific. There's really no 'standards body' or organization that sets specific rules for implementation of the various phonetic algorithms. (This is really apparent with newer algorithms, like Double Metaphone!) It appears that the MS implementation of soundex stops encoding when it hits a punctuation mark. You'll get 'O000' if you try to encode SOUNDEX('O.BRIEN'), SOUNDEX('O''BRIEN') or SOUNDEX ('O BRIEN').

The NYSIIS implementation in the toolkit, on the other hand, ignores punctuation altogether. One way to deal with it on the SOUNDEX side is to eliminate all non-alphabetic characters from your string before attempting to encode it with SOUNDEX. This could be done in a UDF, but keep in mind that string manipulations and looping in T-SQL are not extremely efficient (that's one of the main reasons I implemented NYSIIS and Levenshtein Edit Distance as XP's in C++).

I would recommend that you create a separate table with the proper SOUNDEX encoding and the NYSIIS encoding, and relate it to your main table. That way you would only have to perform these string manipulations once; also, you'll be able to take advantage of proper indexes on the Encodings table in your queries. All of the encoding overhead would be handled at load time, and your queries would run a lot more efficiently. The down-side is that you'll be using up more space to store these encodings, but it shouldn't be significantly more. For 1 million rows, you're looking at about ~10 MB to store the Soundex encodings and six-byte NYSIIS encodings. Proper non-clustered indexes will take up additional space, of course.

You could cut down the amount of space used in the Encodings table by using the actual name as a foreign key, to avoid duplicate entries for "JOHNSON, JAMES" and other common names (example).

Alternatively, since NYSIIS encodes the first letter as itself (as does SOUNDEX), you could narrow down your searches by just comparing the first letter.

SELECT * FROM People

WHERE Name LIKE 'J%'

AND dbo.udf_NYSIIS('JOHNSON') = dbo.udf_NYSIIS(Name)

Thanks!

Leon Jollans Newbie Points: 7 More actions · Answer 3

Hi all,

I hope this doesn't fall on deaf ears. There's a memory leak in both of these extended procedures. In proc.cpp, you allocate new params structures but don't delete them.

in proc.cpp in the xp_levenshtein dll you need to add the lines

delete P1;

delete P2;

delete P3;

before the line

return sqlerr;

and in proc.cpp in the xp_nysiis project likewise, you need to add the lines

delete P1;

delete P2;

before the same line;

return sqlerr;

Hope that helps someone, as the provided dlls helped bring down one of our production sql servers!

All the best,

Leon

Mike C SSC-Insane Points: 23224 More actions · Answer 4

Hi Leon,

Thanks for pointing that out. I actually included that in the clean-up of the most recent version, but sent the SSC guys an older version. I've asked them to repost the updated version that handles the memory leak, a potential memory reclamation issue, and includes a DROP_XP.SQL script as well as the REDIST directory with support DLL's required by some installations (some installations might have difficulty locating the ODS and VC++ Runtime libraries if they're not in the \Binn directory).

Thanks again for the feedback, and sorry for the inconvenience.

a0153494 Newbie Points: 7 More actions · Answer 5

This is exactly what I've been looking for; thank you. However I am running SQLServer 2005 32-bit (Std Edition) on Win2003 Server 64-bit. Should I be compatable with the DLL's and toolkit?

Thanks again...

Mike C SSC-Insane Points: 23224 More actions · Answer 6

Oh my... I haven't tested any of these on 64-bit Win2003 or SQL 2005. They were only tested on SQL 2000, 32-bit. XP's on SQL 2005 have been deprecated but are still allowed... for now. A better bet might be to convert the routines to .NET assemblies for SQL 2K5. The downloads contain the C++ source code for these routines. If you were using SQL 2000, I'd recommend using the routines here: http://www.sqlservercentral.com/columnists/mcoles/sql2000dbatoolkitpart3.asp, with more optimizations and additional routines added.

Dave Frankson Valued Member Points: 50 More actions · Answer 7

Um... I checked the example given under the NYSIIS and they don't all return JANSAN

Johnson J525 JANSAN

Johnsen J525 JANSAN

Johanson J525 JAHANSAN

Johannsen J525 JAHANSAN

Simon Mallet Grasshopper Points: 23 More actions · Answer 8

Well, this is only a tool to help you out filtering better results than the crappy soundex function, if you make than many mistakes writing down your name, then its normal it doesnt return anything.

If I look for "De la pointe" using soundex, writing: "De la ponte",

it will return a LOT more results than it should...

basically everything starting with DE[SPACE]... will be returned!

Wow.

Will test NYSIIS and let you know how better it is or not.

Mike C SSC-Insane Points: 23224 More actions · Answer 9

The code was changed to reflect the NYSIIS "standard" a little better after the article was published (see this discussion board for details). The "H" in Johanson and Johannsen is not considered silent because it's surrounded by vowels; therefore it's not dropped. In the original version it was dropped.

Mike C SSC-Insane Points: 23224 More actions · Answer 10

Yes, unfortunately the SQL Server SOUNDEX() function stops encoding a name when it encounters a non-alphabetic character. So "DE LA POINTE", "DE LA SOUL", etc. will all encode to "D000". One work-around with Soundex is to eliminate all spaces from a name before encoding with the REPLACE() function:

SELECT

SOUNDEX(REPLACE('DE LA POINTE',' ', ''))

Another option (that NARA recommends) is to eliminate common prefixes like 'DE', 'LA', 'VAN', 'CON', etc. from the name before Soundex-encoding it. See http://www.archives.gov/publications/general-info-leaflets/55.html for more info. Also note that the MS SOUNDEX() function does not follow the official NARA Soundex standard, and may return results that don't match up with external data sources that encode correctly.

This NYSIIS implementation ignores invalid characters (spaces, etc.) in the name and encodes until it runs out of alphabetic characters. Also see http://www.sqlservercentral.com/columnists/mcoles/sql2000dbatoolkitpart3.asp for the update to this set. The update includes NYSIIS, Double-Metaphone, Celko Soundex, Daitch-Mokotoff Soundex, Levenshtein Edit Distance and Jaro-Winkler Distance functions.

Ravi-364966 Newbie Points: 4 More actions · Answer 11

Hi,

Having issues registering the nyssis.dll file on Windows server 2008.

Is there Any updated version of this file is available?

or please tell me how i can use it on windows server 2008 operating system.

Thanks in advance

ravi

citrowske SSC Enthusiast Points: 162 More actions · Answer 12

I have had this implemented for about 2 years on a 32-bit sql 2000 server and it's working like a champ. However, we are now implenting on a sql 2005 64-bit server, and it is blowing up on me. My event viewer shows the following error:

"Description:

Could not load the DLL xp_nysiis.dll, or one of the DLLs it references. Reason: 193(xp_nysiis.dll is not a valid Win32 application.)."

Please help.

davemiller3 SSC Enthusiast Points: 198 More actions · Answer 13

Hello All -

Does anyone have any solutions on how to get NYSIIS.DLL to run on a 64-Bit machine ??

AVB SSChampion Points: 11418 More actions · Answer 14

I had the same problem of getting the dll to work with 2008. I'm in the process of converting a 2000 db to 2008 and I used that xp_NYSIIS a lot. Sooooo I've decided to just write a function that will do it based on the algorithm. Here is a first draft and isn't fully tested. It's also based on the original algorithm and not the modified one. Gonna work on that next. Unfortunately I've found inconsistent results with some of the online tools that I've been using to test this (http://www.dropby.com/NYSIIS.html). Anyway... it's a start for those that want to use it:

Declare

@Word varchar(50),

@FirstCharacter char(1),

@Length int,

@LetterPosition int,

@PreviousPosition int,

@NextPosition int,

@IteratedWord varchar(50)

SET @Word = 'Alexander'

SET @Word = replace(@Word,' ','')

SET @Word = rtrim(ltrim(@Word))

SET @Word = Upper(@Word)

SET @Length = len(@Word)

/*

1.Transcode first characters of name:

MAC»MCC

KN»NN

K»C

PH»FF

PF»FF

SCH»SSS

*/

IF left(@Word,3) = 'MAC' BEGIN SET @Word = 'MCC' + right(@Word,@Length-3) END

IF left(@Word,2) = 'KN' BEGIN SET @Word = 'NN' + right(@Word,@Length-2) END

IF left(@Word,1) = 'K' BEGIN SET @Word = 'C' + right(@Word,@Length-1) END

IF left(@Word,2) in ('PH','PF') BEGIN SET @Word = 'FF' + right(@Word,@Length-2) END

IF left(@Word,3) = 'SCH' BEGIN SET @Word = 'SSS' + right(@Word,@Length-3) END

PRINT '1: ' + @Word

/* 2.Transcode last characters of name:

EE, IE»Y

DT,RT,RD,NT,ND»D */

IF right(@Word,2) in ('EE','IE') BEGIN SET @Word = substring(@Word,1,@Length-2) + 'Y' END

IF right(@Word,2) in ('DT','RT','RD','NT','ND') BEGIN SET @Word = substring(@Word,1,@Length-2) + 'D' END

SET @Length = len(@Word)

PRINT '2: ' + @Word

/*3.First character of key = first character of name. */

SET @FirstCharacter = left(@Word,1)

PRINT '3: ' + @FirstCharacter

--Set @Word to remaining characters for further processing

SET @Word = right(@Word,@Length-1)

SET @Length = len(@Word)

PRINT 'Trimmed of First Character: ' + @Word

/*

4.Transcode remaining characters by following these rules, incrementing by one character each time:

EV»AFelse A,E,I,O,U » A

Q»G

Z»S

M»N

KN»Nelse K » C

SCH»SSS

PH»FF

H»If previous or next is nonvowel, previous

W»If previous is vowel, previous

Add current to key if current != last key character

*/

SET @Word = replace(@Word,'EV','AF')

SET @Word = replace(@Word,'E','A')

SET @Word = replace(@Word,'I','A')

SET @Word = replace(@Word,'O','A')

SET @Word = replace(@Word,'U','A')

SET @Word = replace(@Word,'Q','G')

SET @Word = replace(@Word,'Z','S')

SET @Word = replace(@Word,'M','N')

SET @Word = replace(@Word,'KN','N')

SET @Word = replace(@Word,'K','C')

SET @Word = replace(@Word,'SCH','SSS')

SET @Word = replace(@Word,'PH','FF')

--Iterate through each letter for "H" conversion

SET @IteratedWord = @Word

PRINT 'Iterated word: ' + @IteratedWord

SET @LetterPosition = 1

PRINT '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'

While @LetterPosition <= @Length

BEGIN

SET @PreviousPosition = Case When @LetterPosition > 1 Then @LetterPosition-1 Else 1 END

SET @NextPosition = Case When @LetterPosition < @Length Then @LetterPosition+1 Else @Length END

IF substring(@IteratedWord,@LetterPosition,1) = 'H' and

(substring(@IteratedWord,@PreviousPosition,1) not in ('A','E','I','O','U') or

substring(@IteratedWord,@NextPosition,1) not in ('A','E','I','O','U'))

BEGIN

PRINT 'STUFF H!' + ' ' + @Word

SET @Word = stuff(@Word,@LetterPosition,1,'')

PRINT 'New Word: ' + isnull(@Word,'Null')

END

SET @LetterPosition = @LetterPosition + 1

SET @Length = len(@Word)

END

--Iterate through each letter for "W" conversion

SET @IteratedWord = @Word

PRINT 'Iterated word: ' + @IteratedWord

SET @LetterPosition = 1

PRINT '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'

While @LetterPosition <= @Length

BEGIN

SET @PreviousPosition = Case When @LetterPosition > 1 Then @LetterPosition-1 Else 1 END

SET @NextPosition = Case When @LetterPosition < @Length Then @LetterPosition+1 Else @Length END

SET @Length = len(@Word)

IF substring(@IteratedWord,@LetterPosition,1) = 'W' and

substring(@IteratedWord,@PreviousPosition,1) in ('A','E','I','O','U')

BEGIN

PRINT 'STUFF W!' + ' ' + @Word

SET @Word = stuff(@Word,@LetterPosition,1,'')

PRINT 'New Word: ' + isnull(@Word,'Null')

END

SET @LetterPosition = @LetterPosition + 1

SET @Length = len(@Word)

END

/* 5. If last character is S, remove it */

IF right(@Word,1) = 'S'

BEGIN

PRINT 'Remove last letter S'

SET @Word = left(@Word,@Length-1)

SET @Length = len(@Word)

END

PRINT '5: ' + @Word

/* 6.If last characters are AY, replace with Y */

IF right(@Word,2) = 'AY'

BEGIN

PRINT 'Remove last characters AY'

SET @Word = stuff(@Word,@Length-1,2,'Y')

SET @Length = len(@Word)

END

PRINT '6: ' + @Word

/* 7.If last character is A, remove it */

IF right(@Word,1) = 'A'

BEGIN

PRINT 'Remove last letter A'

SET @Word = left(@Word,@Length-1)

SET @Length = len(@Word)

END

PRINT '7: ' + @Word

/* 8.Collapse all strings of repeated characters */

PRINT 'Collapse repeated characters...'

SET @LetterPosition = 1

While @LetterPosition < @Length

BEGIN

IF substring(@Word,@LetterPosition,1) = substring(@Word,@LetterPosition+1,1)

BEGIN

PRINT 'Next Letter Matchs. Remove Letter.'

SET @Word = stuff(@Word,@LetterPosition+1,1,'')

END

IF substring(@Word,@LetterPosition,1) <> substring(@Word,@LetterPosition+1,1)

BEGIN

PRINT 'Next Letter does not match. Advance'

SET @LetterPosition = @LetterPosition + 1

END

SET @Length = len(@Word)

END

PRINT '8: ' + @Word

/*9.Add original first character of name as first character of key*/

SET @Word = @FirstCharacter + @Word

PRINT '9: ' + @Word

--Only supposed to use first six characters but display the rest

PRINT left(@Word,6) + Case When @Length > 6 Then '['+ substring(@Word,7,@Length-5)+']' Else '' END

Jeff Moden SSC Guru Points: 1003869 More actions · Answer 15

AVB (12/20/2013)
I had the same problem of getting the dll to work with 2008. I'm in the process of converting a 2000 db to 2008 and I used that xp_NYSIIS a lot. Sooooo I've decided to just write a function that will do it based on the algorithm. Here is a first draft and isn't fully tested. It's also based on the original algorithm and not the modified one. Gonna work on that next.

I'm glad to see someone working on this outside the world of CLR even if I think the CLR will always win performance-wise.

Something to possibly consider. While loops tend to be a bit sluggish in T-SQL. The loops in your code might be an exception because they're "memory only" loops. However, please consider the following. Here's your snippet for getting rid of multiple adjacent identical characters.

/* 8.Collapse all strings of repeated characters */

PRINT 'Collapse repeated characters...'

SET @LetterPosition = 1

While @LetterPosition < @Length

BEGIN

IF substring(@Word,@LetterPosition,1) = substring(@Word,@LetterPosition+1,1)

BEGIN

PRINT 'Next Letter Matchs. Remove Letter.'

SET @Word = stuff(@Word,@LetterPosition+1,1,'')

END

IF substring(@Word,@LetterPosition,1) <> substring(@Word,@LetterPosition+1,1)

BEGIN

PRINT 'Next Letter does not match. Advance'

SET @LetterPosition = @LetterPosition + 1

END

SET @Length = len(@Word)

END

PRINT '8: ' + @Word

Here's a "Tally Table" version that does the same thing.

--===== Remove all duplicated characters (JBM)

DECLARE @Word VARCHAR(1000);

SELECT @Word = 'AAAAABBBBCDEFFFFFFFFFFFFFFGHHHHHHHHIIIIIIIJJKLMNNNNNNNNNNO';

SELECT @Word =

(

SELECT SUBSTRING(@Word,t.N,1)

FROM dbo.Tally t

WHERE t.N <= DATALENGTH(@Word)

AND SUBSTRING(@Word,t.N,1) <> SUBSTRING(@Word,t.N-1,1)

ORDER BY t.N

FOR XML PATH(''),TYPE

).value('(./text())[1]','varchar(max)')

;

PRINT @Word;

Of course, you do lose the capability to the the individual actions taken against each and every character but I suspect you won't want that in the final product anyway.

Looking through the code, there are several places where you can convert the code to similar functionality or even combined prior to the de-duplication of characters.

You can find out more about the Tally Table and how it can be used as a replacement for certain kinds of loops at the following URL.

http://www.sqlservercentral.com/articles/T-SQL/62867/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)