SQL Server 2000 provides little support for phonetic matching, which is commonly needed functionality when dealing with genealogical, law enforcement, and other databases used primarily for name searches. This article describes the basic SQL Server 2000 functionality and offers tools to provide enhanced functionality via a SQL Server Phonetic Toolkit.
Soundex is a 120+ year old system of phonetic encoding. It was first applied to the 1880 U.S. Census to help researchers quickly locate similar surnames. The basics of phonetic encoding are the same today as they were in 1880; however, the methods and implementations have vastly improved.
The basic idea behind Soundex is that certain consonants have similar sounds, and these consonants are grouped together. Examples of this are 'D' and 'T', which form a single grouping. Soundex eliminates most vowels as being unnecessary.
In Soundex encoding, 'Smith', 'Smyth' and 'Smythe' all return an encoded value of 'S530'. SQL Server provides basic Soundex functionality via the SOUNDEX() function. All Soundex-encoded strings are 4 digits long, and follow the pattern: Xnnn, where X is an alphabetic character and nnn is three numeric digits.
|SQL Server 2000 has a built-in function for phonetic name matching. SOUNDEX() returns a 4-character Soundex phonetic pattern for a word.
Usage: SOUNDEX (@character_expression)
The major problem with Soundex is that it returns a high number of 'false-positives'. To overcome this problem, New York State commissioned a study of phonetic encodings in 1970. They came up with the New York State Identification and Intelligence System (NYSIIS). The creators of the NYSIIS system overcame many of the problems with Soundex by converting groups of letters such as 'EV' and 'SCH' into equivalent codes. NYSIIS also maintains the relative positioning of vowels in the final encoded string, eliminates trailing S's and vowels from the string during encoding. This all makes matches more precise.
Examples of NYSIIS encoding include: 'johnson', 'johnsen', 'johansen' and 'johannsen' all equal 'JANSAN' in NYSIIS. This implementation of NYSIIS returns a 10-character result. If the result is less than 10 characters, it is right-padded out to 10 characters with spaces. Some implementations provide only the first 6 characters of the NYSIIS encoding. This might be a consideration for your database.
|xp_nysiis and udf_nysiis are included in the Toolkit.
Usage: udf_nysiis (@character_expression) EXEC xp_nysiis @character_expression [, @nysiis_code OUTPUT]
Levenshtein Edit Distance
Levenshtein Edit Distance provides a means for determining how different two strings are. The basic idea behind this edit distance algorithm is to determine how many changes are required to change a string to another string. For instance, to change 'MISTER' to 'MISTRESS' requires 3 changes. We need to: 1) swap the 'E' and the 'R' in 'MISTER'; 2) add one 'S' to the end; 3) add a second 'S' to the end.
Levenshtein Edit Distance is handy in uncovering typographical errors, and is often used in spell-checking algorithms. We can see, for instance, that 'SMIT' requires just one edit to become 'SMITH'. The edit distance between these two words makes it a simple matter for us to suggest potential correct spellings, or to include potential correct matches in our searches.
|xp_levenshtein and udf_levenshtein are included in the Toolkit.
Usage: udf_levenshtein (@character_expression1, @character_expression2) EXEC xp_levenshtein @character_expression1, @character_expression2 [, @edit_distance OUTPUT]
Install and Use
To install and use the Toolkit:
- Download the Toolkit from this site by clicking on the click here to download link
- Extract the files from the ZIP file
- Copy the DLL files in the \Install directory to your Microsoft SQL Server\MSSQL\Binn directory
- Run the ADD_XP.SQL script in Query Analyzer
Sample scripts are included in the \SQL_Samples directory, and source code for the extended stored procedures are included in the ZIP file for those who want to improve on the code, or if you just want to see how it does what it does.
SQL Server 2000 provides basic phonetic matching ability via the SOUNDEX() function. However, this functionality is limited and inexact. There are several other algorithms that can take advantage of modern computer speed and processing power to provide much more precise phonetic name matching ability.