SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Sound Matching and a Phonetic Toolkit


Sound Matching and a Phonetic Toolkit

Author
Message
Mike C
Mike C
SSCertifiable
SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)

Group: General Forum Members
Points: 6693 Visits: 1172

Oh my... I haven't tested any of these on 64-bit Win2003 or SQL 2005. They were only tested on SQL 2000, 32-bit. XP's on SQL 2005 have been deprecated but are still allowed... for now. A better bet might be to convert the routines to .NET assemblies for SQL 2K5. The downloads contain the C++ source code for these routines. If you were using SQL 2000, I'd recommend using the routines here: http://www.sqlservercentral.com/columnists/mcoles/sql2000dbatoolkitpart3.asp, with more optimizations and additional routines added.


Dave Frankson
Dave Frankson
Forum Newbie
Forum Newbie (8 reputation)Forum Newbie (8 reputation)Forum Newbie (8 reputation)Forum Newbie (8 reputation)Forum Newbie (8 reputation)Forum Newbie (8 reputation)Forum Newbie (8 reputation)Forum Newbie (8 reputation)

Group: General Forum Members
Points: 8 Visits: 3

Um... I checked the example given under the NYSIIS and they don't all return JANSAN

Johnson J525 JANSAN
Johnsen J525 JANSAN
Johanson J525 JAHANSAN
Johannsen J525 JAHANSAN


Simon Mallet
Simon Mallet
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 1
Well, this is only a tool to help you out filtering better results than the crappy soundex function, if you make than many mistakes writing down your name, then its normal it doesnt return anything.

If I look for "De la pointe" using soundex, writing: "De la ponte",
it will return a LOT more results than it should...
basically everything starting with DE[SPACE]... will be returned!
Wow.
Will test NYSIIS and let you know how better it is or not.
Mike C
Mike C
SSCertifiable
SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)

Group: General Forum Members
Points: 6693 Visits: 1172
The code was changed to reflect the NYSIIS "standard" a little better after the article was published (see this discussion board for details). The "H" in Johanson and Johannsen is not considered silent because it's surrounded by vowels; therefore it's not dropped. In the original version it was dropped.
Mike C
Mike C
SSCertifiable
SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)SSCertifiable (6.7K reputation)

Group: General Forum Members
Points: 6693 Visits: 1172

Yes, unfortunately the SQL Server SOUNDEX() function stops encoding a name when it encounters a non-alphabetic character. So "DE LA POINTE", "DE LA SOUL", etc. will all encode to "D000". One work-around with Soundex is to eliminate all spaces from a name before encoding with the REPLACE() function:

SELECT SOUNDEX(REPLACE('DE LA POINTE',' ', ''))

Another option (that NARA recommends) is to eliminate common prefixes like 'DE', 'LA', 'VAN', 'CON', etc. from the name before Soundex-encoding it. See http://www.archives.gov/publications/general-info-leaflets/55.html for more info. Also note that the MS SOUNDEX() function does not follow the official NARA Soundex standard, and may return results that don't match up with external data sources that encode correctly.

This NYSIIS implementation ignores invalid characters (spaces, etc.) in the name and encodes until it runs out of alphabetic characters. Also see http://www.sqlservercentral.com/columnists/mcoles/sql2000dbatoolkitpart3.asp for the update to this set. The update includes NYSIIS, Double-Metaphone, Celko Soundex, Daitch-Mokotoff Soundex, Levenshtein Edit Distance and Jaro-Winkler Distance functions.


Ravi-364966
Ravi-364966
Forum Newbie
Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)Forum Newbie (2 reputation)

Group: General Forum Members
Points: 2 Visits: 31
Hi,
Having issues registering the nyssis.dll file on Windows server 2008.
Is there Any updated version of this file is available?
or please tell me how i can use it on windows server 2008 operating system.
Thanks in advance
ravi
citrowske
citrowske
Valued Member
Valued Member (50 reputation)Valued Member (50 reputation)Valued Member (50 reputation)Valued Member (50 reputation)Valued Member (50 reputation)Valued Member (50 reputation)Valued Member (50 reputation)Valued Member (50 reputation)

Group: General Forum Members
Points: 50 Visits: 254
I have had this implemented for about 2 years on a 32-bit sql 2000 server and it's working like a champ. However, we are now implenting on a sql 2005 64-bit server, and it is blowing up on me. My event viewer shows the following error:
"Description:
Could not load the DLL xp_nysiis.dll, or one of the DLLs it references. Reason: 193(xp_nysiis.dll is not a valid Win32 application.)."

Please help.
davemiller3
davemiller3
Forum Newbie
Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)

Group: General Forum Members
Points: 6 Visits: 171
Hello All -

Does anyone have any solutions on how to get NYSIIS.DLL to run on a 64-Bit machine ??
AVB
AVB
SSCrazy
SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)SSCrazy (2.5K reputation)

Group: General Forum Members
Points: 2512 Visits: 820
I had the same problem of getting the dll to work with 2008. I'm in the process of converting a 2000 db to 2008 and I used that xp_NYSIIS a lot. Sooooo I've decided to just write a function that will do it based on the algorithm. Here is a first draft and isn't fully tested. It's also based on the original algorithm and not the modified one. Gonna work on that next. Unfortunately I've found inconsistent results with some of the online tools that I've been using to test this (http://www.dropby.com/NYSIIS.html). Anyway... it's a start for those that want to use it:

Declare
@Word varchar(50),
@FirstCharacter char(1),
@Length int,
@LetterPosition int,
@PreviousPosition int,
@NextPosition int,
@IteratedWord varchar(50)

SET @Word = 'Alexander'
SET @Word = replace(@Word,' ','')
SET @Word = rtrim(ltrim(@Word))
SET @Word = Upper(@Word)
SET @Length = len(@Word)


/*
1. Transcode first characters of name:
MAC » MCC
KN » NN
K » C
PH » FF
PF » FF
SCH » SSS
*/

IF left(@Word,3) = 'MAC' BEGIN SET @Word = 'MCC' + right(@Word,@Length-3) END
IF left(@Word,2) = 'KN' BEGIN SET @Word = 'NN' + right(@Word,@Length-2) END
IF left(@Word,1) = 'K' BEGIN SET @Word = 'C' + right(@Word,@Length-1) END
IF left(@Word,2) in ('PH','PF') BEGIN SET @Word = 'FF' + right(@Word,@Length-2) END
IF left(@Word,3) = 'SCH' BEGIN SET @Word = 'SSS' + right(@Word,@Length-3) END
PRINT '1: ' + @Word

/* 2. Transcode last characters of name:
EE, IE » Y
DT,RT,RD,NT,ND » D */
IF right(@Word,2) in ('EE','IE') BEGIN SET @Word = substring(@Word,1,@Length-2) + 'Y' END
IF right(@Word,2) in ('DT','RT','RD','NT','ND') BEGIN SET @Word = substring(@Word,1,@Length-2) + 'D' END
SET @Length = len(@Word)
PRINT '2: ' + @Word

/*3. First character of key = first character of name. */
SET @FirstCharacter = left(@Word,1)
PRINT '3: ' + @FirstCharacter

--Set @Word to remaining characters for further processing
SET @Word = right(@Word,@Length-1)
SET @Length = len(@Word)
PRINT 'Trimmed of First Character: ' + @Word


/*

4. Transcode remaining characters by following these rules, incrementing by one character each time:
EV » AF else A,E,I,O,U » A
Q » G
Z » S
M » N
KN » N else K » C
SCH » SSS
PH » FF
H » If previous or next is nonvowel, previous
W » If previous is vowel, previous
Add current to key if current != last key character

*/

SET @Word = replace(@Word,'EV','AF')
SET @Word = replace(@Word,'E','A')
SET @Word = replace(@Word,'I','A')
SET @Word = replace(@Word,'O','A')
SET @Word = replace(@Word,'U','A')
SET @Word = replace(@Word,'Q','G')
SET @Word = replace(@Word,'Z','S')
SET @Word = replace(@Word,'M','N')
SET @Word = replace(@Word,'KN','N')
SET @Word = replace(@Word,'K','C')
SET @Word = replace(@Word,'SCH','SSS')
SET @Word = replace(@Word,'PH','FF')


--Iterate through each letter for "H" conversion
SET @IteratedWord = @Word
PRINT 'Iterated word: ' + @IteratedWord
SET @LetterPosition = 1
PRINT '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
While @LetterPosition <= @Length
BEGIN
SET @PreviousPosition = Case When @LetterPosition > 1 Then @LetterPosition-1 Else 1 END
SET @NextPosition = Case When @LetterPosition < @Length Then @LetterPosition+1 Else @Length END

IF substring(@IteratedWord,@LetterPosition,1) = 'H' and
(substring(@IteratedWord,@PreviousPosition,1) not in ('A','E','I','O','U') or
substring(@IteratedWord,@NextPosition,1) not in ('A','E','I','O','U'))
BEGIN
PRINT 'STUFF H!' + ' ' + @Word
SET @Word = stuff(@Word,@LetterPosition,1,'')
PRINT 'New Word: ' + isnull(@Word,'Null')
END

SET @LetterPosition = @LetterPosition + 1
SET @Length = len(@Word)
END


--Iterate through each letter for "W" conversion
SET @IteratedWord = @Word
PRINT 'Iterated word: ' + @IteratedWord
SET @LetterPosition = 1
PRINT '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
While @LetterPosition <= @Length
BEGIN
SET @PreviousPosition = Case When @LetterPosition > 1 Then @LetterPosition-1 Else 1 END
SET @NextPosition = Case When @LetterPosition < @Length Then @LetterPosition+1 Else @Length END
SET @Length = len(@Word)

IF substring(@IteratedWord,@LetterPosition,1) = 'W' and
substring(@IteratedWord,@PreviousPosition,1) in ('A','E','I','O','U')
BEGIN
PRINT 'STUFF W!' + ' ' + @Word
SET @Word = stuff(@Word,@LetterPosition,1,'')
PRINT 'New Word: ' + isnull(@Word,'Null')
END

SET @LetterPosition = @LetterPosition + 1
SET @Length = len(@Word)
END

/* 5. If last character is S, remove it */
IF right(@Word,1) = 'S'
BEGIN
PRINT 'Remove last letter S'
SET @Word = left(@Word,@Length-1)
SET @Length = len(@Word)
END
PRINT '5: ' + @Word

/* 6. If last characters are AY, replace with Y */
IF right(@Word,2) = 'AY'
BEGIN
PRINT 'Remove last characters AY'
SET @Word = stuff(@Word,@Length-1,2,'Y')
SET @Length = len(@Word)
END
PRINT '6: ' + @Word

/* 7. If last character is A, remove it */
IF right(@Word,1) = 'A'
BEGIN
PRINT 'Remove last letter A'
SET @Word = left(@Word,@Length-1)
SET @Length = len(@Word)
END
PRINT '7: ' + @Word

/* 8. Collapse all strings of repeated characters */
PRINT 'Collapse repeated characters...'
SET @LetterPosition = 1
While @LetterPosition < @Length
BEGIN
IF substring(@Word,@LetterPosition,1) = substring(@Word,@LetterPosition+1,1)
BEGIN
PRINT 'Next Letter Matchs. Remove Letter.'
SET @Word = stuff(@Word,@LetterPosition+1,1,'')
END

IF substring(@Word,@LetterPosition,1) <> substring(@Word,@LetterPosition+1,1)
BEGIN
PRINT 'Next Letter does not match. Advance'
SET @LetterPosition = @LetterPosition + 1
END
SET @Length = len(@Word)
END
PRINT '8: ' + @Word

/*9. Add original first character of name as first character of key*/
SET @Word = @FirstCharacter + @Word
PRINT '9: ' + @Word

--Only supposed to use first six characters but display the rest
PRINT left(@Word,6) + Case When @Length > 6 Then '['+ substring(@Word,7,@Length-5)+']' Else '' END
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (222K reputation)SSC Guru (222K reputation)SSC Guru (222K reputation)SSC Guru (222K reputation)SSC Guru (222K reputation)SSC Guru (222K reputation)SSC Guru (222K reputation)SSC Guru (222K reputation)

Group: General Forum Members
Points: 222940 Visits: 42003
AVB (12/20/2013)
I had the same problem of getting the dll to work with 2008. I'm in the process of converting a 2000 db to 2008 and I used that xp_NYSIIS a lot. Sooooo I've decided to just write a function that will do it based on the algorithm. Here is a first draft and isn't fully tested. It's also based on the original algorithm and not the modified one. Gonna work on that next.


I'm glad to see someone working on this outside the world of CLR even if I think the CLR will always win performance-wise.

Something to possibly consider. While loops tend to be a bit sluggish in T-SQL. The loops in your code might be an exception because they're "memory only" loops. However, please consider the following. Here's your snippet for getting rid of multiple adjacent identical characters.

/* 8.   Collapse all strings of repeated characters */
PRINT 'Collapse repeated characters...'
SET @LetterPosition = 1
While @LetterPosition < @Length
BEGIN
IF substring(@Word,@LetterPosition,1) = substring(@Word,@LetterPosition+1,1)
BEGIN
PRINT 'Next Letter Matchs. Remove Letter.'
SET @Word = stuff(@Word,@LetterPosition+1,1,'')
END

IF substring(@Word,@LetterPosition,1) <> substring(@Word,@LetterPosition+1,1)
BEGIN
PRINT 'Next Letter does not match. Advance'
SET @LetterPosition = @LetterPosition + 1
END
SET @Length = len(@Word)
END
PRINT '8: ' + @Word




Here's a "Tally Table" version that does the same thing.

--===== Remove all duplicated characters (JBM)
DECLARE @Word VARCHAR(1000);
SELECT @Word = 'AAAAABBBBCDEFFFFFFFFFFFFFFGHHHHHHHHIIIIIIIJJKLMNNNNNNNNNNO';

SELECT @Word =
(
SELECT SUBSTRING(@Word,t.N,1)
FROM dbo.Tally t
WHERE t.N <= DATALENGTH(@Word)
AND SUBSTRING(@Word,t.N,1) <> SUBSTRING(@Word,t.N-1,1)
ORDER BY t.N
FOR XML PATH(''),TYPE
).value('(./text())[1]','varchar(max)')
;
PRINT @Word;




Of course, you do lose the capability to the the individual actions taken against each and every character but I suspect you won't want that in the final product anyway.

Looking through the code, there are several places where you can convert the code to similar functionality or even combined prior to the de-duplication of characters.

You can find out more about the Tally Table and how it can be used as a replacement for certain kinds of loops at the following URL.
http://www.sqlservercentral.com/articles/T-SQL/62867/

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search