Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««1234»»

Sound Matching and a Phonetic Toolkit Expand / Collapse
Author
Message
Posted Thursday, July 13, 2006 10:39 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135

Oh my...  I haven't tested any of these on 64-bit Win2003 or SQL 2005.  They were only tested on SQL 2000, 32-bit.  XP's on SQL 2005 have been deprecated but are still allowed... for now.  A better bet might be to convert the routines to .NET assemblies for SQL 2K5.  The downloads contain the C++ source code for these routines.  If you were using SQL 2000, I'd recommend using the routines here:  http://www.sqlservercentral.com/columnists/mcoles/sql2000dbatoolkitpart3.asp, with more optimizations and additional routines added.

Post #294306
Posted Monday, August 14, 2006 2:59 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, July 18, 2007 12:01 PM
Points: 2, Visits: 1

Um... I checked the example given under the NYSIIS and they don't all return JANSAN

Johnson J525  JANSAN   
Johnsen J525  JANSAN   
Johanson J525  JAHANSAN 
Johannsen J525  JAHANSAN 

Post #301718
Posted Friday, September 15, 2006 2:54 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, November 16, 2006 8:55 AM
Points: 1, Visits: 1
Well, this is only a tool to help you out filtering better results than the crappy soundex function, if you make than many mistakes writing down your name, then its normal it doesnt return anything.

If I look for "De la pointe" using soundex, writing: "De la ponte",
it will return a LOT more results than it should...
basically everything starting with DE[SPACE]... will be returned!
Wow.
Will test NYSIIS and let you know how better it is or not.
Post #309096
Posted Friday, September 15, 2006 5:41 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135
The code was changed to reflect the NYSIIS "standard" a little better after the article was published (see this discussion board for details).  The "H" in Johanson and Johannsen is not considered silent because it's surrounded by vowels; therefore it's not dropped.  In the original version it was dropped.
Post #309116
Posted Friday, September 15, 2006 5:51 PM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 1:20 PM
Points: 1,276, Visits: 1,135

Yes, unfortunately the SQL Server SOUNDEX() function stops encoding a name when it encounters a non-alphabetic character.  So "DE LA POINTE", "DE LA SOUL", etc. will all encode to "D000".  One work-around with Soundex is to eliminate all spaces from a name before encoding with the REPLACE() function:

SELECT SOUNDEX(REPLACE('DE LA POINTE',' ', ''))

Another option (that NARA recommends) is to eliminate common prefixes like 'DE', 'LA', 'VAN', 'CON', etc. from the name before Soundex-encoding it.  See http://www.archives.gov/publications/general-info-leaflets/55.html for more info.  Also note that the MS SOUNDEX() function does not follow the official NARA Soundex standard, and may return results that don't match up with external data sources that encode correctly.

This NYSIIS implementation ignores invalid characters (spaces, etc.) in the name and encodes until it runs out of alphabetic characters.  Also see http://www.sqlservercentral.com/columnists/mcoles/sql2000dbatoolkitpart3.asp for the update to this set.  The update includes NYSIIS, Double-Metaphone, Celko Soundex, Daitch-Mokotoff Soundex, Levenshtein Edit Distance and Jaro-Winkler Distance functions.

Post #309121
Posted Monday, April 13, 2009 6:25 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: 2 days ago @ 10:58 PM
Points: 2, Visits: 25
Hi,
Having issues registering the nyssis.dll file on Windows server 2008.
Is there Any updated version of this file is available?
or please tell me how i can use it on windows server 2008 operating system.
Thanks in advance
ravi

Post #696253
Posted Tuesday, October 27, 2009 3:46 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, September 16, 2014 9:05 AM
Points: 4, Visits: 181
I have had this implemented for about 2 years on a 32-bit sql 2000 server and it's working like a champ. However, we are now implenting on a sql 2005 64-bit server, and it is blowing up on me. My event viewer shows the following error:
"Description:
Could not load the DLL xp_nysiis.dll, or one of the DLLs it references. Reason: 193(xp_nysiis.dll is not a valid Win32 application.)."

Please help.
Post #809628
Posted Tuesday, May 31, 2011 10:20 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, November 5, 2012 9:59 AM
Points: 6, Visits: 171
Hello All -

Does anyone have any solutions on how to get NYSIIS.DLL to run on a 64-Bit machine ??
Post #1117548
Posted Friday, December 20, 2013 3:02 PM


SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Monday, June 16, 2014 11:58 AM
Points: 496, Visits: 816
I had the same problem of getting the dll to work with 2008. I'm in the process of converting a 2000 db to 2008 and I used that xp_NYSIIS a lot. Sooooo I've decided to just write a function that will do it based on the algorithm. Here is a first draft and isn't fully tested. It's also based on the original algorithm and not the modified one. Gonna work on that next. Unfortunately I've found inconsistent results with some of the online tools that I've been using to test this (http://www.dropby.com/NYSIIS.html). Anyway... it's a start for those that want to use it:

Declare
@Word varchar(50),
@FirstCharacter char(1),
@Length int,
@LetterPosition int,
@PreviousPosition int,
@NextPosition int,
@IteratedWord varchar(50)

SET @Word = 'Alexander'
SET @Word = replace(@Word,' ','')
SET @Word = rtrim(ltrim(@Word))
SET @Word = Upper(@Word)
SET @Length = len(@Word)


/*
1. Transcode first characters of name:
MAC » MCC
KN » NN
K » C
PH » FF
PF » FF
SCH » SSS
*/

IF left(@Word,3) = 'MAC' BEGIN SET @Word = 'MCC' + right(@Word,@Length-3) END
IF left(@Word,2) = 'KN' BEGIN SET @Word = 'NN' + right(@Word,@Length-2) END
IF left(@Word,1) = 'K' BEGIN SET @Word = 'C' + right(@Word,@Length-1) END
IF left(@Word,2) in ('PH','PF') BEGIN SET @Word = 'FF' + right(@Word,@Length-2) END
IF left(@Word,3) = 'SCH' BEGIN SET @Word = 'SSS' + right(@Word,@Length-3) END
PRINT '1: ' + @Word

/* 2. Transcode last characters of name:
EE, IE » Y
DT,RT,RD,NT,ND » D */
IF right(@Word,2) in ('EE','IE') BEGIN SET @Word = substring(@Word,1,@Length-2) + 'Y' END
IF right(@Word,2) in ('DT','RT','RD','NT','ND') BEGIN SET @Word = substring(@Word,1,@Length-2) + 'D' END
SET @Length = len(@Word)
PRINT '2: ' + @Word

/*3. First character of key = first character of name. */
SET @FirstCharacter = left(@Word,1)
PRINT '3: ' + @FirstCharacter

--Set @Word to remaining characters for further processing
SET @Word = right(@Word,@Length-1)
SET @Length = len(@Word)
PRINT 'Trimmed of First Character: ' + @Word


/*

4. Transcode remaining characters by following these rules, incrementing by one character each time:
EV » AF else A,E,I,O,U » A
Q » G
Z » S
M » N
KN » N else K » C
SCH » SSS
PH » FF
H » If previous or next is nonvowel, previous
W » If previous is vowel, previous
Add current to key if current != last key character

*/

SET @Word = replace(@Word,'EV','AF')
SET @Word = replace(@Word,'E','A')
SET @Word = replace(@Word,'I','A')
SET @Word = replace(@Word,'O','A')
SET @Word = replace(@Word,'U','A')
SET @Word = replace(@Word,'Q','G')
SET @Word = replace(@Word,'Z','S')
SET @Word = replace(@Word,'M','N')
SET @Word = replace(@Word,'KN','N')
SET @Word = replace(@Word,'K','C')
SET @Word = replace(@Word,'SCH','SSS')
SET @Word = replace(@Word,'PH','FF')


--Iterate through each letter for "H" conversion
SET @IteratedWord = @Word
PRINT 'Iterated word: ' + @IteratedWord
SET @LetterPosition = 1
PRINT '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
While @LetterPosition <= @Length
BEGIN
SET @PreviousPosition = Case When @LetterPosition > 1 Then @LetterPosition-1 Else 1 END
SET @NextPosition = Case When @LetterPosition < @Length Then @LetterPosition+1 Else @Length END

IF substring(@IteratedWord,@LetterPosition,1) = 'H' and
(substring(@IteratedWord,@PreviousPosition,1) not in ('A','E','I','O','U') or
substring(@IteratedWord,@NextPosition,1) not in ('A','E','I','O','U'))
BEGIN
PRINT 'STUFF H!' + ' ' + @Word
SET @Word = stuff(@Word,@LetterPosition,1,'')
PRINT 'New Word: ' + isnull(@Word,'Null')
END

SET @LetterPosition = @LetterPosition + 1
SET @Length = len(@Word)
END


--Iterate through each letter for "W" conversion
SET @IteratedWord = @Word
PRINT 'Iterated word: ' + @IteratedWord
SET @LetterPosition = 1
PRINT '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
While @LetterPosition <= @Length
BEGIN
SET @PreviousPosition = Case When @LetterPosition > 1 Then @LetterPosition-1 Else 1 END
SET @NextPosition = Case When @LetterPosition < @Length Then @LetterPosition+1 Else @Length END
SET @Length = len(@Word)

IF substring(@IteratedWord,@LetterPosition,1) = 'W' and
substring(@IteratedWord,@PreviousPosition,1) in ('A','E','I','O','U')
BEGIN
PRINT 'STUFF W!' + ' ' + @Word
SET @Word = stuff(@Word,@LetterPosition,1,'')
PRINT 'New Word: ' + isnull(@Word,'Null')
END

SET @LetterPosition = @LetterPosition + 1
SET @Length = len(@Word)
END

/* 5. If last character is S, remove it */
IF right(@Word,1) = 'S'
BEGIN
PRINT 'Remove last letter S'
SET @Word = left(@Word,@Length-1)
SET @Length = len(@Word)
END
PRINT '5: ' + @Word

/* 6. If last characters are AY, replace with Y */
IF right(@Word,2) = 'AY'
BEGIN
PRINT 'Remove last characters AY'
SET @Word = stuff(@Word,@Length-1,2,'Y')
SET @Length = len(@Word)
END
PRINT '6: ' + @Word

/* 7. If last character is A, remove it */
IF right(@Word,1) = 'A'
BEGIN
PRINT 'Remove last letter A'
SET @Word = left(@Word,@Length-1)
SET @Length = len(@Word)
END
PRINT '7: ' + @Word

/* 8. Collapse all strings of repeated characters */
PRINT 'Collapse repeated characters...'
SET @LetterPosition = 1
While @LetterPosition < @Length
BEGIN
IF substring(@Word,@LetterPosition,1) = substring(@Word,@LetterPosition+1,1)
BEGIN
PRINT 'Next Letter Matchs. Remove Letter.'
SET @Word = stuff(@Word,@LetterPosition+1,1,'')
END

IF substring(@Word,@LetterPosition,1) <> substring(@Word,@LetterPosition+1,1)
BEGIN
PRINT 'Next Letter does not match. Advance'
SET @LetterPosition = @LetterPosition + 1
END
SET @Length = len(@Word)
END
PRINT '8: ' + @Word

/*9. Add original first character of name as first character of key*/
SET @Word = @FirstCharacter + @Word
PRINT '9: ' + @Word

--Only supposed to use first six characters but display the rest
PRINT left(@Word,6) + Case When @Length > 6 Then '['+ substring(@Word,7,@Length-5)+']' Else '' END












Post #1525186
Posted Friday, December 20, 2013 6:41 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 11:28 PM
Points: 35,263, Visits: 31,750
AVB (12/20/2013)
I had the same problem of getting the dll to work with 2008. I'm in the process of converting a 2000 db to 2008 and I used that xp_NYSIIS a lot. Sooooo I've decided to just write a function that will do it based on the algorithm. Here is a first draft and isn't fully tested. It's also based on the original algorithm and not the modified one. Gonna work on that next.


I'm glad to see someone working on this outside the world of CLR even if I think the CLR will always win performance-wise.

Something to possibly consider. While loops tend to be a bit sluggish in T-SQL. The loops in your code might be an exception because they're "memory only" loops. However, please consider the following. Here's your snippet for getting rid of multiple adjacent identical characters.

/* 8.	Collapse all strings of repeated characters */
PRINT 'Collapse repeated characters...'
SET @LetterPosition = 1
While @LetterPosition < @Length
BEGIN
IF substring(@Word,@LetterPosition,1) = substring(@Word,@LetterPosition+1,1)
BEGIN
PRINT 'Next Letter Matchs. Remove Letter.'
SET @Word = stuff(@Word,@LetterPosition+1,1,'')
END

IF substring(@Word,@LetterPosition,1) <> substring(@Word,@LetterPosition+1,1)
BEGIN
PRINT 'Next Letter does not match. Advance'
SET @LetterPosition = @LetterPosition + 1
END
SET @Length = len(@Word)
END
PRINT '8: ' + @Word


Here's a "Tally Table" version that does the same thing.

--===== Remove all duplicated characters (JBM)
DECLARE @Word VARCHAR(1000);
SELECT @Word = 'AAAAABBBBCDEFFFFFFFFFFFFFFGHHHHHHHHIIIIIIIJJKLMNNNNNNNNNNO';

SELECT @Word =
(
SELECT SUBSTRING(@Word,t.N,1)
FROM dbo.Tally t
WHERE t.N <= DATALENGTH(@Word)
AND SUBSTRING(@Word,t.N,1) <> SUBSTRING(@Word,t.N-1,1)
ORDER BY t.N
FOR XML PATH(''),TYPE
).value('(./text())[1]','varchar(max)')
;
PRINT @Word;


Of course, you do lose the capability to the the individual actions taken against each and every character but I suspect you won't want that in the final product anyway.

Looking through the code, there are several places where you can convert the code to similar functionality or even combined prior to the de-duplication of characters.

You can find out more about the Tally Table and how it can be used as a replacement for certain kinds of loops at the following URL.
http://www.sqlservercentral.com/articles/T-SQL/62867/


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1525229
« Prev Topic | Next Topic »

Add to briefcase «««1234»»

Permissions Expand / Collapse