SQL Server Central is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
Search:  
 
 

Sound Matching and a Phonetic Toolkit

By Michael Coles, 2005/08/03

Total article views: 12387 | Views in the last 30 days: 105

Introduction

SQL Server 2000 provides little support for phonetic matching, which is commonly needed functionality when dealing with genealogical, law enforcement, and other databases used primarily for name searches. This article describes the basic SQL Server 2000 functionality and offers tools to provide enhanced functionality via a SQL Server Phonetic Toolkit.

Click here to download the Toolkit

Soundex

Soundex is a 120+ year old system of phonetic encoding. It was first applied to the 1880 U.S. Census to help researchers quickly locate similar surnames. The basics of phonetic encoding are the same today as they were in 1880; however, the methods and implementations have vastly improved.

The basic idea behind Soundex is that certain consonants have similar sounds, and these consonants are grouped together. Examples of this are 'D' and 'T', which form a single grouping. Soundex eliminates most vowels as being unnecessary.

In Soundex encoding, 'Smith', 'Smyth' and 'Smythe' all return an encoded value of 'S530'. SQL Server provides basic Soundex functionality via the SOUNDEX() function. All Soundex-encoded strings are 4 digits long, and follow the pattern: Xnnn, where X is an alphabetic character and nnn is three numeric digits.

SQL Server 2000 has a built-in function for phonetic name matching. SOUNDEX() returns a 4-character Soundex phonetic pattern for a word.

    Usage:  SOUNDEX (@character_expression)

NYSIIS

The major problem with Soundex is that it returns a high number of 'false-positives'. To overcome this problem, New York State commissioned a study of phonetic encodings in 1970. They came up with the New York State Identification and Intelligence System (NYSIIS). The creators of the NYSIIS system overcame many of the problems with Soundex by converting groups of letters such as 'EV' and 'SCH' into equivalent codes. NYSIIS also maintains the relative positioning of vowels in the final encoded string, eliminates trailing S's and vowels from the string during encoding. This all makes matches more precise.

Examples of NYSIIS encoding include: 'johnson', 'johnsen', 'johansen' and 'johannsen' all equal 'JANSAN' in NYSIIS. This implementation of NYSIIS returns a 10-character result. If the result is less than 10 characters, it is right-padded out to 10 characters with spaces. Some implementations provide only the first 6 characters of the NYSIIS encoding. This might be a consideration for your database.

xp_nysiis and udf_nysiis are included in the Toolkit.

    Usage:  udf_nysiis (@character_expression)
            EXEC xp_nysiis @character_expression [, @nysiis_code OUTPUT]

Levenshtein Edit Distance

Levenshtein Edit Distance provides a means for determining how different two strings are. The basic idea behind this edit distance algorithm is to determine how many changes are required to change a string to another string. For instance, to change 'MISTER' to 'MISTRESS' requires 3 changes. We need to: 1) swap the 'E' and the 'R' in 'MISTER'; 2) add one 'S' to the end; 3) add a second 'S' to the end.

Levenshtein Edit Distance is handy in uncovering typographical errors, and is often used in spell-checking algorithms. We can see, for instance, that 'SMIT' requires just one edit to become 'SMITH'. The edit distance between these two words makes it a simple matter for us to suggest potential correct spellings, or to include potential correct matches in our searches.

xp_levenshtein and udf_levenshtein are included in the Toolkit.

    Usage:  udf_levenshtein (@character_expression1, @character_expression2)
            EXEC xp_levenshtein @character_expression1, @character_expression2 [, @edit_distance OUTPUT]

Install and Use

To install and use the Toolkit:

  • Download the Toolkit from this site by clicking on the click here to download link
  • Extract the files from the ZIP file
  • Copy the DLL files in the \Install directory to your Microsoft SQL Server\MSSQL\Binn directory
  • Run the ADD_XP.SQL script in Query Analyzer

Sample scripts are included in the \SQL_Samples directory, and source code for the extended stored procedures are included in the ZIP file for those who want to improve on the code, or if you just want to see how it does what it does.

Click here to download the Toolkit

Conclusions

SQL Server 2000 provides basic phonetic matching ability via the SOUNDEX() function. However, this functionality is limited and inexact. There are several other algorithms that can take advantage of modern computer speed and processing power to provide much more precise phonetic name matching ability.

Further Reading

By Michael Coles, 2005/08/03

Total article views: 12387 | Views in the last 30 days: 105
Your response
 
 
Related tags
 
Like this? Try these...

Free Encryption

By Michael Coles | Category: Security
| 66,565 reads
Already registered?  

Free registration required

To read the rest of this article, and access thousands of other articles, we ask you to register on the site and subscribe to our newsletters.

Register

E-mail address:
Password:
Password (confirm):

  

Subscriptions

We ask you to register on the site and subscribe to our newsletters. Subscribing to our newsletters gets you:

  • ALL of our content (thousands of articles, scripts, and forum postings)
  • A daily newsletter (example)
  • A weekly news round up (example)
  • The opportunity to ask and answer questions in our forums
  • A daily Question of the Day to test and help you increase your knowledge of SQL Server.

We ask that you give the newsletter a try for a week. Over 200,000 SQL Server Professionals a day find it entertaining and useful. If not, you are welcome to unsubscribe at anytime.

Steve Jones
Editor, SQLServerCentral.com