Sound Matching and a Phonetic Toolkit

  • Hi,  

    I downloaded the update, and replaced the dll's. I guess that's all I need to do...

    I have another issue (more like a confession), possibly out of NYSIIS and LEVENSHTEIN context, but might help others...

    I was using soundex to filter out the results, and udf_levenshtein to sort them, which gave me quite a bit of performance increase, and sounded like a great idea... I was thinking that NYSIIS is an improved soundex, so filtering with soundex would do just fine...

    Well,  it does UNLESS you have punctuation characters such as ' (as in O'brian) ...

    declare @myname varchar(255)

    declare @mydex varchar(4)

    declare @mydex2 varchar(10)

    set @myname = 'o''brian'

    set @mydex = soundex(@myname)

    set @mydex2 = dbo.udf_nysiis(@myname)

    select @mydex --( returns O000)

    select @mydex2 -- (returns OBRAN)

    however

    declare @mydex varchar(4)

    declare @mydex2 varchar(10)

    set @myname = 'obrian'

    set @mydex = soundex(@myname)

    set @mydex2 = dbo.udf_nysiis(@myname)

    select @mydex --( returns O165)

    select @mydex2 --( returns OBRAN )

    So, soundex does NOT eliminate the punctuation errors, but NYSIIS does.

    Just fyi...

    I guess I will be researching some more "optimization"

    Thank you again...

  • That's it, just copy the new DLLs into the MSSQL\BINN directory (may require you to stop and restart MSSQLServer service).

    That's a very good point.  The handling of punctuation is implementation-specific.  There's really no 'standards body' or organization that sets specific rules for implementation of the various phonetic algorithms.  (This is really apparent with newer algorithms, like Double Metaphone!)  It appears that the MS implementation of soundex stops encoding when it hits a punctuation mark.  You'll get 'O000' if you try to encode SOUNDEX('O.BRIEN'), SOUNDEX('O''BRIEN') or SOUNDEX ('O BRIEN').

    The NYSIIS implementation in the toolkit, on the other hand, ignores punctuation altogether.  One way to deal with it on the SOUNDEX side is to eliminate all non-alphabetic characters from your string before attempting to encode it with SOUNDEX.  This could be done in a UDF, but keep in mind that string manipulations and looping in T-SQL are not extremely efficient (that's one of the main reasons I implemented NYSIIS and Levenshtein Edit Distance as XP's in C++).

    I would recommend that you create a separate table with the proper SOUNDEX encoding and the NYSIIS encoding, and relate it to your main table.  That way you would only have to perform these string manipulations once; also, you'll be able to take advantage of proper indexes on the Encodings table in your queries.  All of the encoding overhead would be handled at load time, and your queries would run a lot more efficiently.  The down-side is that you'll be using up more space to store these encodings, but it shouldn't be significantly more.  For 1 million rows, you're looking at about ~10 MB to store the Soundex encodings and six-byte NYSIIS encodings.  Proper non-clustered indexes will take up additional space, of course.

    You could cut down the amount of space used in the Encodings table by using the actual name as a foreign key, to avoid duplicate entries for "JOHNSON, JAMES" and other common names (example).

    Alternatively, since NYSIIS encodes the first letter as itself (as does SOUNDEX), you could narrow down your searches by just comparing the first letter.

    SELECT * FROM People

    WHERE Name LIKE 'J%'

    AND dbo.udf_NYSIIS('JOHNSON') = dbo.udf_NYSIIS(Name)

    Thanks!

  • Hi all,

    I hope this doesn't fall on deaf ears. There's a memory leak in both of these extended procedures. In proc.cpp, you allocate new params structures but don't delete them.

    in proc.cpp in the xp_levenshtein dll you need to add the lines

    delete P1;

    delete P2;

    delete P3;

    before the line

    return sqlerr;

    and in proc.cpp in the xp_nysiis project likewise, you need to add the lines

    delete P1;

    delete P2;

    before the same line;

    return sqlerr;

    Hope that helps someone, as the provided dlls helped bring down one of our production sql servers!

    All the best,

    Leon

  • Hi Leon,

    Thanks for pointing that out.  I actually included that in the clean-up of the most recent version, but sent the SSC guys an older version.  I've asked them to repost the updated version that handles the memory leak, a potential memory reclamation issue, and includes a DROP_XP.SQL script as well as the REDIST directory with support DLL's required by some installations (some installations might have difficulty locating the ODS and VC++ Runtime libraries if they're not in the \Binn directory).

    Thanks again for the feedback, and sorry for the inconvenience.

  • This is exactly what I've been looking for; thank you.  However I am running SQLServer 2005 32-bit (Std Edition) on Win2003 Server 64-bit.  Should I be compatable with the DLL's and toolkit?

    Thanks again...

  • Oh my...  I haven't tested any of these on 64-bit Win2003 or SQL 2005.  They were only tested on SQL 2000, 32-bit.  XP's on SQL 2005 have been deprecated but are still allowed... for now.  A better bet might be to convert the routines to .NET assemblies for SQL 2K5.  The downloads contain the C++ source code for these routines.  If you were using SQL 2000, I'd recommend using the routines here:  http://www.sqlservercentral.com/columnists/mcoles/sql2000dbatoolkitpart3.asp, with more optimizations and additional routines added.

  • Um... I checked the example given under the NYSIIS and they don't all return JANSAN

    Johnson J525  JANSAN   

    Johnsen J525  JANSAN   

    Johanson J525  JAHANSAN 

    Johannsen J525  JAHANSAN 

  • Well, this is only a tool to help you out filtering better results than the crappy soundex function, if you make than many mistakes writing down your name, then its normal it doesnt return anything.

    If I look for "De la pointe" using soundex, writing: "De la ponte",

    it will return a LOT more results than it should...

    basically everything starting with DE[SPACE]... will be returned!

    Wow.

    Will test NYSIIS and let you know how better it is or not.

  • The code was changed to reflect the NYSIIS "standard" a little better after the article was published (see this discussion board for details).  The "H" in Johanson and Johannsen is not considered silent because it's surrounded by vowels; therefore it's not dropped.  In the original version it was dropped.

  • Yes, unfortunately the SQL Server SOUNDEX() function stops encoding a name when it encounters a non-alphabetic character.  So "DE LA POINTE", "DE LA SOUL", etc. will all encode to "D000".  One work-around with Soundex is to eliminate all spaces from a name before encoding with the REPLACE() function:

    SELECT

    SOUNDEX(REPLACE('DE LA POINTE',' ', ''))

    Another option (that NARA recommends) is to eliminate common prefixes like 'DE', 'LA', 'VAN', 'CON', etc. from the name before Soundex-encoding it.  See http://www.archives.gov/publications/general-info-leaflets/55.html for more info.  Also note that the MS SOUNDEX() function does not follow the official NARA Soundex standard, and may return results that don't match up with external data sources that encode correctly.

    This NYSIIS implementation ignores invalid characters (spaces, etc.) in the name and encodes until it runs out of alphabetic characters.  Also see http://www.sqlservercentral.com/columnists/mcoles/sql2000dbatoolkitpart3.asp for the update to this set.  The update includes NYSIIS, Double-Metaphone, Celko Soundex, Daitch-Mokotoff Soundex, Levenshtein Edit Distance and Jaro-Winkler Distance functions.

  • Hi,

    Having issues registering the nyssis.dll file on Windows server 2008.

    Is there Any updated version of this file is available?

    or please tell me how i can use it on windows server 2008 operating system.

    Thanks in advance

    ravi

  • I have had this implemented for about 2 years on a 32-bit sql 2000 server and it's working like a champ. However, we are now implenting on a sql 2005 64-bit server, and it is blowing up on me. My event viewer shows the following error:

    "Description:

    Could not load the DLL xp_nysiis.dll, or one of the DLLs it references. Reason: 193(xp_nysiis.dll is not a valid Win32 application.)."

    Please help.

  • Hello All -

    Does anyone have any solutions on how to get NYSIIS.DLL to run on a 64-Bit machine ??

  • I had the same problem of getting the dll to work with 2008. I'm in the process of converting a 2000 db to 2008 and I used that xp_NYSIIS a lot. Sooooo I've decided to just write a function that will do it based on the algorithm. Here is a first draft and isn't fully tested. It's also based on the original algorithm and not the modified one. Gonna work on that next. Unfortunately I've found inconsistent results with some of the online tools that I've been using to test this (http://www.dropby.com/NYSIIS.html). Anyway... it's a start for those that want to use it:

    Declare

    @Word varchar(50),

    @FirstCharacter char(1),

    @Length int,

    @LetterPosition int,

    @PreviousPosition int,

    @NextPosition int,

    @IteratedWord varchar(50)

    SET @Word = 'Alexander'

    SET @Word = replace(@Word,' ','')

    SET @Word = rtrim(ltrim(@Word))

    SET @Word = Upper(@Word)

    SET @Length = len(@Word)

    /*

    1.Transcode first characters of name:

    MAC»MCC

    KN»NN

    K»C

    PH»FF

    PF»FF

    SCH»SSS

    */

    IF left(@Word,3) = 'MAC' BEGIN SET @Word = 'MCC' + right(@Word,@Length-3) END

    IF left(@Word,2) = 'KN' BEGIN SET @Word = 'NN' + right(@Word,@Length-2) END

    IF left(@Word,1) = 'K' BEGIN SET @Word = 'C' + right(@Word,@Length-1) END

    IF left(@Word,2) in ('PH','PF') BEGIN SET @Word = 'FF' + right(@Word,@Length-2) END

    IF left(@Word,3) = 'SCH' BEGIN SET @Word = 'SSS' + right(@Word,@Length-3) END

    PRINT '1: ' + @Word

    /* 2.Transcode last characters of name:

    EE, IE»Y

    DT,RT,RD,NT,ND»D */

    IF right(@Word,2) in ('EE','IE') BEGIN SET @Word = substring(@Word,1,@Length-2) + 'Y' END

    IF right(@Word,2) in ('DT','RT','RD','NT','ND') BEGIN SET @Word = substring(@Word,1,@Length-2) + 'D' END

    SET @Length = len(@Word)

    PRINT '2: ' + @Word

    /*3.First character of key = first character of name. */

    SET @FirstCharacter = left(@Word,1)

    PRINT '3: ' + @FirstCharacter

    --Set @Word to remaining characters for further processing

    SET @Word = right(@Word,@Length-1)

    SET @Length = len(@Word)

    PRINT 'Trimmed of First Character: ' + @Word

    /*

    4.Transcode remaining characters by following these rules, incrementing by one character each time:

    EV»AFelse A,E,I,O,U » A

    Q»G

    Z»S

    M»N

    KN»Nelse K » C

    SCH»SSS

    PH»FF

    H»If previous or next is nonvowel, previous

    W»If previous is vowel, previous

    Add current to key if current != last key character

    */

    SET @Word = replace(@Word,'EV','AF')

    SET @Word = replace(@Word,'E','A')

    SET @Word = replace(@Word,'I','A')

    SET @Word = replace(@Word,'O','A')

    SET @Word = replace(@Word,'U','A')

    SET @Word = replace(@Word,'Q','G')

    SET @Word = replace(@Word,'Z','S')

    SET @Word = replace(@Word,'M','N')

    SET @Word = replace(@Word,'KN','N')

    SET @Word = replace(@Word,'K','C')

    SET @Word = replace(@Word,'SCH','SSS')

    SET @Word = replace(@Word,'PH','FF')

    --Iterate through each letter for "H" conversion

    SET @IteratedWord = @Word

    PRINT 'Iterated word: ' + @IteratedWord

    SET @LetterPosition = 1

    PRINT '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'

    While @LetterPosition <= @Length

    BEGIN

    SET @PreviousPosition = Case When @LetterPosition > 1 Then @LetterPosition-1 Else 1 END

    SET @NextPosition = Case When @LetterPosition < @Length Then @LetterPosition+1 Else @Length END

    IF substring(@IteratedWord,@LetterPosition,1) = 'H' and

    (substring(@IteratedWord,@PreviousPosition,1) not in ('A','E','I','O','U') or

    substring(@IteratedWord,@NextPosition,1) not in ('A','E','I','O','U'))

    BEGIN

    PRINT 'STUFF H!' + ' ' + @Word

    SET @Word = stuff(@Word,@LetterPosition,1,'')

    PRINT 'New Word: ' + isnull(@Word,'Null')

    END

    SET @LetterPosition = @LetterPosition + 1

    SET @Length = len(@Word)

    END

    --Iterate through each letter for "W" conversion

    SET @IteratedWord = @Word

    PRINT 'Iterated word: ' + @IteratedWord

    SET @LetterPosition = 1

    PRINT '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'

    While @LetterPosition <= @Length

    BEGIN

    SET @PreviousPosition = Case When @LetterPosition > 1 Then @LetterPosition-1 Else 1 END

    SET @NextPosition = Case When @LetterPosition < @Length Then @LetterPosition+1 Else @Length END

    SET @Length = len(@Word)

    IF substring(@IteratedWord,@LetterPosition,1) = 'W' and

    substring(@IteratedWord,@PreviousPosition,1) in ('A','E','I','O','U')

    BEGIN

    PRINT 'STUFF W!' + ' ' + @Word

    SET @Word = stuff(@Word,@LetterPosition,1,'')

    PRINT 'New Word: ' + isnull(@Word,'Null')

    END

    SET @LetterPosition = @LetterPosition + 1

    SET @Length = len(@Word)

    END

    /* 5. If last character is S, remove it */

    IF right(@Word,1) = 'S'

    BEGIN

    PRINT 'Remove last letter S'

    SET @Word = left(@Word,@Length-1)

    SET @Length = len(@Word)

    END

    PRINT '5: ' + @Word

    /* 6.If last characters are AY, replace with Y */

    IF right(@Word,2) = 'AY'

    BEGIN

    PRINT 'Remove last characters AY'

    SET @Word = stuff(@Word,@Length-1,2,'Y')

    SET @Length = len(@Word)

    END

    PRINT '6: ' + @Word

    /* 7.If last character is A, remove it */

    IF right(@Word,1) = 'A'

    BEGIN

    PRINT 'Remove last letter A'

    SET @Word = left(@Word,@Length-1)

    SET @Length = len(@Word)

    END

    PRINT '7: ' + @Word

    /* 8.Collapse all strings of repeated characters */

    PRINT 'Collapse repeated characters...'

    SET @LetterPosition = 1

    While @LetterPosition < @Length

    BEGIN

    IF substring(@Word,@LetterPosition,1) = substring(@Word,@LetterPosition+1,1)

    BEGIN

    PRINT 'Next Letter Matchs. Remove Letter.'

    SET @Word = stuff(@Word,@LetterPosition+1,1,'')

    END

    IF substring(@Word,@LetterPosition,1) <> substring(@Word,@LetterPosition+1,1)

    BEGIN

    PRINT 'Next Letter does not match. Advance'

    SET @LetterPosition = @LetterPosition + 1

    END

    SET @Length = len(@Word)

    END

    PRINT '8: ' + @Word

    /*9.Add original first character of name as first character of key*/

    SET @Word = @FirstCharacter + @Word

    PRINT '9: ' + @Word

    --Only supposed to use first six characters but display the rest

    PRINT left(@Word,6) + Case When @Length > 6 Then '['+ substring(@Word,7,@Length-5)+']' Else '' END

  • AVB (12/20/2013)


    I had the same problem of getting the dll to work with 2008. I'm in the process of converting a 2000 db to 2008 and I used that xp_NYSIIS a lot. Sooooo I've decided to just write a function that will do it based on the algorithm. Here is a first draft and isn't fully tested. It's also based on the original algorithm and not the modified one. Gonna work on that next.

    I'm glad to see someone working on this outside the world of CLR even if I think the CLR will always win performance-wise.

    Something to possibly consider. While loops tend to be a bit sluggish in T-SQL. The loops in your code might be an exception because they're "memory only" loops. However, please consider the following. Here's your snippet for getting rid of multiple adjacent identical characters.

    /* 8.Collapse all strings of repeated characters */

    PRINT 'Collapse repeated characters...'

    SET @LetterPosition = 1

    While @LetterPosition < @Length

    BEGIN

    IF substring(@Word,@LetterPosition,1) = substring(@Word,@LetterPosition+1,1)

    BEGIN

    PRINT 'Next Letter Matchs. Remove Letter.'

    SET @Word = stuff(@Word,@LetterPosition+1,1,'')

    END

    IF substring(@Word,@LetterPosition,1) <> substring(@Word,@LetterPosition+1,1)

    BEGIN

    PRINT 'Next Letter does not match. Advance'

    SET @LetterPosition = @LetterPosition + 1

    END

    SET @Length = len(@Word)

    END

    PRINT '8: ' + @Word

    Here's a "Tally Table" version that does the same thing.

    --===== Remove all duplicated characters (JBM)

    DECLARE @Word VARCHAR(1000);

    SELECT @Word = 'AAAAABBBBCDEFFFFFFFFFFFFFFGHHHHHHHHIIIIIIIJJKLMNNNNNNNNNNO';

    SELECT @Word =

    (

    SELECT SUBSTRING(@Word,t.N,1)

    FROM dbo.Tally t

    WHERE t.N <= DATALENGTH(@Word)

    AND SUBSTRING(@Word,t.N,1) <> SUBSTRING(@Word,t.N-1,1)

    ORDER BY t.N

    FOR XML PATH(''),TYPE

    ).value('(./text())[1]','varchar(max)')

    ;

    PRINT @Word;

    Of course, you do lose the capability to the the individual actions taken against each and every character but I suspect you won't want that in the final product anyway.

    Looking through the code, there are several places where you can convert the code to similar functionality or even combined prior to the de-duplication of characters.

    You can find out more about the Tally Table and how it can be used as a replacement for certain kinds of loops at the following URL.

    http://www.sqlservercentral.com/articles/T-SQL/62867/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply