Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Sound Matching and a Phonetic Toolkit

By Michael Coles,

Introduction

SQL Server 2000 provides little support for phonetic matching, which is commonly needed functionality when dealing with genealogical, law enforcement, and other databases used primarily for name searches. This article describes the basic SQL Server 2000 functionality and offers tools to provide enhanced functionality via a SQL Server Phonetic Toolkit.

Click here to download the Toolkit

Soundex

Soundex is a 120+ year old system of phonetic encoding. It was first applied to the 1880 U.S. Census to help researchers quickly locate similar surnames. The basics of phonetic encoding are the same today as they were in 1880; however, the methods and implementations have vastly improved.

The basic idea behind Soundex is that certain consonants have similar sounds, and these consonants are grouped together. Examples of this are 'D' and 'T', which form a single grouping. Soundex eliminates most vowels as being unnecessary.

In Soundex encoding, 'Smith', 'Smyth' and 'Smythe' all return an encoded value of 'S530'. SQL Server provides basic Soundex functionality via the SOUNDEX() function. All Soundex-encoded strings are 4 digits long, and follow the pattern: Xnnn, where X is an alphabetic character and nnn is three numeric digits.

SQL Server 2000 has a built-in function for phonetic name matching. SOUNDEX() returns a 4-character Soundex phonetic pattern for a word.

    Usage:  SOUNDEX (@character_expression)

NYSIIS

The major problem with Soundex is that it returns a high number of 'false-positives'. To overcome this problem, New York State commissioned a study of phonetic encodings in 1970. They came up with the New York State Identification and Intelligence System (NYSIIS). The creators of the NYSIIS system overcame many of the problems with Soundex by converting groups of letters such as 'EV' and 'SCH' into equivalent codes. NYSIIS also maintains the relative positioning of vowels in the final encoded string, eliminates trailing S's and vowels from the string during encoding. This all makes matches more precise.

Examples of NYSIIS encoding include: 'johnson', 'johnsen', 'johansen' and 'johannsen' all equal 'JANSAN' in NYSIIS. This implementation of NYSIIS returns a 10-character result. If the result is less than 10 characters, it is right-padded out to 10 characters with spaces. Some implementations provide only the first 6 characters of the NYSIIS encoding. This might be a consideration for your database.

xp_nysiis and udf_nysiis are included in the Toolkit.

    Usage:  udf_nysiis (@character_expression)
            EXEC xp_nysiis @character_expression [, @nysiis_code OUTPUT]

Levenshtein Edit Distance

Levenshtein Edit Distance provides a means for determining how different two strings are. The basic idea behind this edit distance algorithm is to determine how many changes are required to change a string to another string. For instance, to change 'MISTER' to 'MISTRESS' requires 3 changes. We need to: 1) swap the 'E' and the 'R' in 'MISTER'; 2) add one 'S' to the end; 3) add a second 'S' to the end.

Levenshtein Edit Distance is handy in uncovering typographical errors, and is often used in spell-checking algorithms. We can see, for instance, that 'SMIT' requires just one edit to become 'SMITH'. The edit distance between these two words makes it a simple matter for us to suggest potential correct spellings, or to include potential correct matches in our searches.

xp_levenshtein and udf_levenshtein are included in the Toolkit.

    Usage:  udf_levenshtein (@character_expression1, @character_expression2)
            EXEC xp_levenshtein @character_expression1, @character_expression2 [, @edit_distance OUTPUT]

Install and Use

To install and use the Toolkit:

  • Download the Toolkit from this site by clicking on the click here to download link
  • Extract the files from the ZIP file
  • Copy the DLL files in the \Install directory to your Microsoft SQL Server\MSSQL\Binn directory
  • Run the ADD_XP.SQL script in Query Analyzer

Sample scripts are included in the \SQL_Samples directory, and source code for the extended stored procedures are included in the ZIP file for those who want to improve on the code, or if you just want to see how it does what it does.

Click here to download the Toolkit

Conclusions

SQL Server 2000 provides basic phonetic matching ability via the SOUNDEX() function. However, this functionality is limited and inexact. There are several other algorithms that can take advantage of modern computer speed and processing power to provide much more precise phonetic name matching ability.

Further Reading

Total article views: 13817 | Views in the last 30 days: 10
 
Related Articles
FORUM

soundex function

soundex function

ARTICLE

Soundex - Experiments with SQL CLR

Using test driven development to turbocharge Soundex

FORUM

Soundex-Like functionality

Spelled like.... != Soundex

FORUM

SOUNDEX

Comments posted to this topic are about the item [B]SOUNDEX[/B] Interesting question, thanks. Keep i...

ARTICLE

SQL 2000 DBA Toolkit Part 3

Phonetic matching, working on finding terms that are misspelled, is an art and there are a number of...

Tags
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones