Sound Matching and a Phonetic Toolkit

Introduction

SQL Server 2000 provides little support for phonetic matching, which is commonly needed functionality when dealing with genealogical, law enforcement, and other databases used primarily for name searches. This article describes the basic SQL Server 2000 functionality and offers tools to provide enhanced functionality via a SQL Server Phonetic Toolkit.

Click here to download the Toolkit

Soundex

Soundex is a 120+ year old system of phonetic encoding. It was first applied to the 1880 U.S. Census to help researchers quickly locate similar surnames. The basics of phonetic encoding are the same today as they were in 1880; however, the methods and implementations have vastly improved.

The basic idea behind Soundex is that certain consonants have similar sounds, and these consonants are grouped together. Examples of this are 'D' and 'T', which form a single grouping. Soundex eliminates most vowels as being unnecessary.

In Soundex encoding, 'Smith', 'Smyth' and 'Smythe' all return an encoded value of 'S530'. SQL Server provides basic Soundex functionality via the SOUNDEX() function. All Soundex-encoded strings are 4 digits long, and follow the pattern: Xnnn, where X is an alphabetic character and nnn is three numeric digits.

SQL Server 2000 has a built-in function for phonetic name matching. SOUNDEX() returns a 4-character Soundex phonetic pattern for a word.

    Usage:  SOUNDEX (@character_expression)

NYSIIS

The major problem with Soundex is that it returns a high number of 'false-positives'. To overcome this problem, New York State commissioned a study of phonetic encodings in 1970. They came up with the New York State Identification and Intelligence System (NYSIIS). The creators of the NYSIIS system overcame many of the problems with Soundex by converting groups of letters such as 'EV' and 'SCH' into equivalent codes. NYSIIS also maintains the relative positioning of vowels in the final encoded string, eliminates trailing S's and vowels from the string during encoding. This all makes matches more precise.

Examples of NYSIIS encoding include: 'johnson', 'johnsen', 'johansen' and 'johannsen' all equal 'JANSAN' in NYSIIS. This implementation of NYSIIS returns a 10-character result. If the result is less than 10 characters, it is right-padded out to 10 characters with spaces. Some implementations provide only the first 6 characters of the NYSIIS encoding. This might be a consideration for your database.

xp_nysiis and udf_nysiis are included in the Toolkit.

    Usage:  udf_nysiis (@character_expression)
            EXEC xp_nysiis @character_expression [, @nysiis_code OUTPUT]

Levenshtein Edit Distance

Levenshtein Edit Distance provides a means for determining how different two strings are. The basic idea behind this edit distance algorithm is to determine how many changes are required to change a string to another string. For instance, to change 'MISTER' to 'MISTRESS' requires 3 changes. We need to: 1) swap the 'E' and the 'R' in 'MISTER'; 2) add one 'S' to the end; 3) add a second 'S' to the end.

Levenshtein Edit Distance is handy in uncovering typographical errors, and is often used in spell-checking algorithms. We can see, for instance, that 'SMIT' requires just one edit to become 'SMITH'. The edit distance between these two words makes it a simple matter for us to suggest potential correct spellings, or to include potential correct matches in our searches.

xp_levenshtein and udf_levenshtein are included in the Toolkit.

    Usage:  udf_levenshtein (@character_expression₁, @character_expression₂)
            EXEC xp_levenshtein @character_expression₁, @character_expression₂ [, @edit_distance OUTPUT]

Install and Use

To install and use the Toolkit:

Download the Toolkit from this site by clicking on the click here to download link
Extract the files from the ZIP file
Copy the DLL files in the \Install directory to your Microsoft SQL Server\MSSQL\Binn directory
Run the ADD_XP.SQL script in Query Analyzer

Sample scripts are included in the \SQL_Samples directory, and source code for the extended stored procedures are included in the ZIP file for those who want to improve on the code, or if you just want to see how it does what it does.

Click here to download the Toolkit

Conclusions

SQL Server 2000 provides basic phonetic matching ability via the SOUNDEX() function. However, this functionality is limited and inexact. There are several other algorithms that can take advantage of modern computer speed and processing power to provide much more precise phonetic name matching ability.

Understanding SQL Server Full-Text Search, Part I

by Additional Articles

Redgate

Full-Text Search

SQL Full-text Search (SQL FTS) is an optional component of SQL Server 7 and later, which allows fast and efficient querying when you have large amounts of unstructured data. This is the first of a two-part article that explores the full-text language features that ship with SQL Server versions 7, 2000, and 2005, with particular focus on the new language features in SQL 2005.

2006-05-02

3,641 reads

Discuss

Meeting business needs with SQL Server full-text search

by Additional Articles

SearchSQLServer

Full-Text Search

With the full-text search features in SQL Server, along with several hybrid search methods for comprehensive table scanning, companies can better meet specific business needs.

2009-02-20

2,448 reads

Using full-text search for symbols in SQL Server

by Additional Articles

SQLServerCentral.com

Full-Text Search

Here's how to use full-text search in SQL Server when querying exact SQL expressions that include symbols. Use these steps to search symbols in SQL with these steps.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2008-08-27

3,524 reads

Share Database Knowledge with Office and SQL Server

by Bill Wunder

SQLServerCentral.com

Full-Text Search

Longtime SQL Server expert Bill Wunder has written a very interesting article on how to share knowledge among IT staff using the tools you already have in your organization.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

3.75 (12)

You rated this post out of 5. Change rating

2008-08-06

6,663 reads

Discuss

Full Text Search Noise Words and Thesaurus Configurations

by Additional Articles

MSSQLTips.com

Full-Text Search

I have heard that Full Text Search uses a noise words to eliminate meaningless words in searches. I have also heard that a thesaurus is used, but I am not exactly sure how. Can you provide some details related to how both of these technologies are used in Full Text Search? I am interested in more of a background on these specific Full Text Search technologies as well as where the files are located and how I can update them.

2008-05-19

2,630 reads

Sound Matching and a Phonetic Toolkit

Introduction

Soundex

NYSIIS

Levenshtein Edit Distance

Install and Use

Conclusions

Further Reading

Rate

Share

Categories

Share

Rate

Sound Matching and a Phonetic Toolkit

Introduction

Soundex

NYSIIS

Levenshtein Edit Distance

Install and Use

Conclusions

Further Reading

Rate

Share

Categories

Share

Rate

Related content

Understanding SQL Server Full-Text Search, Part I

Meeting business needs with SQL Server full-text search

Using full-text search for symbols in SQL Server

Share Database Knowledge with Office and SQL Server

Full Text Search Noise Words and Thesaurus Configurations