After a recent SQL Server Central article on the improvements in
2005 User-Defined Data Types (UDT's) by author
Dinesh Priyankara, I set out to research
practical applications for SQL 2005 UDT's. The result of my research is this implementation of the Double
Metaphone algorithm for SQL 2005.
Click here to download the source code and binaries for this article
From Soundex to Metaphone to DoubleMetaphone
In late 1990, Lawrence Phillips introduced the Metaphone phonetic matching algorithm. This algorithm provided
several advantages over Soundex, the de facto standard for phonetic matching at the time. One of the major
advantages of Metaphone is that it encoded certain groups of letters together, whereas Soundex encoded only
a single letter at a time. For instance, the word "GNU" encodes in Soundex as 'G500' indicating that
the 'G' and 'N' are pronounced separately. In Metaphone "GNU" it encodes as 'N', indicating that the
'G' in 'GN' is silent1.
In June 2000, Mr. Phillips improved phonetic matching again with the introduction of the Double Metaphone
algorithm2,3. This algorithm recognized that different letter combinations can be pronounced differently
in different circumstances. For instance, in Spanish the name "VILLA" would be pronounced "VEE-YAH",
however the 'LL' combination can also cause "VILLA" to be pronounced "VIL-UH" in
English. Double Metaphone accounts for this discrepancy in pronunciation by giving us a primary encoding of
'FL' and an alternate encoding of 'F' for the word "VILLA". Soundex returns the single encoding
of 'V400' for the same word.
The User-Defined Data Type
When designing this application, the first thing we need to do is determine the language we want to write
code in. For this example, we'll give VB Express 2005 (Beta 2)4 along with SQL Express 2005 (June CTP)5
a shot. Note that the versions are very important, since we're still in Beta-ville and things are constantly
With that out of the way, we can get into the implementation details and decide what type of result(s)
we want to store. A good starting point is the output we can expect from Double Metaphone. According to
Mr. Phillips, Double Metaphone should return two encoded strings, each 4 characters. If an encoded string is not
4 characters long, it should be right padded with spaces to make it 4 characters long. To this end, we'll
implement a SQL Server 2005 .NET User-Defined Data Type to hold both of our encodings in a single object. Here
are some of the highlights of this UDT (See the download file for complete source code):
' This UDT is DoubleMetaphoneResults. It contains the results of a Double Metaphone
' Encoding. The namespace for both the DoubleMetaphoneResults UDT and DoubleMetaphone
' class is Phonetics.Tools
' Here we define it as Format.UserDefined. This is a requirement for UDT's that
' expose string or non-value (reference) properties. We also declared it Serializable(),
' although MS has stated this will not be a requirement in the future.
SqlUserDefinedType(Format.UserDefined, IsByteOrdered:=True, MaxByteSize:=16), _
Public Class DoubleMetaphoneResult
Implements INullable, IBinarySerialize
' SQL 2005 .NET UDT's need to Implement INullable. Since it is Format.UserDefined,
' we have to implement IBinarySerialize as well
' IBinarySerialize requires that we implement our own Read and Write Methods
Public Sub Write(ByVal w As System.IO.BinaryWriter) Implements IBinarySerialize.Write
Public Sub Read(ByVal r As System.IO.BinaryReader) Implements IBinarySerialize.Read
' INullable requires us to implement a read-only Null and IsNull properties
Private _IsNull As Boolean
Public Shared ReadOnly Property Null() As DoubleMetaphoneResult
Public ReadOnly Property IsNull() As Boolean Implements INullable.IsNull
' Parse allows us to parse an input string into a DoubleMetaphoneResult
' In our case we accept two 4 byte double-metaphone strings concatenated
' into one 8-byte string as input
Public Shared Function Parse(ByVal data As SqlString) As DoubleMetaphoneResult
' ToString() is another handy function that converts our DoubleMetaphoneResult
' into one 8-byte string. The first four bytes represent the first, or primary,
' encoding. The last four bytes represent the second, or alternate, encoding
Public Overrides Function ToString() As String
' Now for the properties that actually make up the meat of what we're trying to
' do. There is a First property that exposes the first, or primary, encoding
' and a Second property that exposes the second, or alternate, encoding.
' Note that I did not name these properties Primary and Alternate because
' Primary is a reserved keyword.
Private _First As String
Private _Second As String
Public Property First() As String
Public Property Second() As String
If you decide to explore the source code, you'll notice that the actual public properties
we want to expose take up very little code. The majority of the code in our .NET UDT is pretty much
"boiler-plate" interface implementation code. Since we have exposed string type properties in our
UDT we have to use Format.UserDefined serialization. If we exposed only value types (Integers, etc.), we
could have used the Format.Native serialization, which is faster, instead. This would have made things
a little simpler since we wouldn't have had to implement the IBinarySerialize interface either.
Here's our SQL 2005/.NET UDT:
|Overview of Phonetic.Tools.DoubleMetaphoneResult UDT Class|
|First||The first, or primary, Double Metaphone encoded string|
|Second||The second, or alternate, Double Metaphone encoded string|
|Null||A Null DoubleMetaphoneResult UDT object (Read Only)|
|IsNull||Returns True if a DoubleMetaphoneResult UDT object is Null (Read Only)|
|Read()||Implements the IBinarySerialize.Read() method|
|Write()||Implements the IBinarySerialize.Write() method|
|Parse()||Parses a string into a DoubleMetaphoneResult object|
|ToString()||Converts a DoubleMetaphoneResult object into a string|
The User-Defined Functions
Now that we have our UDT defined, we need to implement the class that will utilize our UDT. We will implement two
functionas: 1) Generate a DoubleMetaphoneResult from an input string and 2) Compare two DoubleMetaphoneResult
Double Metaphone is a fairly complex algorithm that takes literally hundreds of combinations of letters, exceptions
to normal pronunciation and other special criteria into account. The implementation and description of Double Metaphone
could fill an entire article by itself. For the implementation, I ported Mr. Phillips' C++ version directly over to
VB Express 2005. So to stay on target, we'll skip most of the implementation details in this article. There are links in
the references section, as well as the downloadable source code, that provide grater detail.
In order to implement Double Metaphone encoding functionality, we will create a class called Phonetic.Tools.DoubleMetaphone.
This class will have two public methods: DoubleMetaphoneEncode() and DoubleMetaphoneCompare(). These two methods will be the
basis for our SQL Server 2005 Double Metaphone User-Defined Functions.
There are a couple of items that are noteworthy during the implementation of the public methods:
- Our methods must be "Shared" (or "static" for you C# programmers out there). This isn't
that big of a deal for us to change in this instance.
- The methods are preceded with the <SqlFunction(IsDeterministic:=True)> attribute, indicating that they are functionally enabled
for SQL Server 2005 and that they are Deterministic functions.
- Additionally, the .NET String data type corresponds to the SQL 2005 NVARCHAR data type, and cannot accept VARCHAR values.
We will have to make sure that our SQL-side function definitions account for this.
Apart from those items, implementing a SQL Server 2005 UDF in .NET was as simple as writing a Function in Visual Basic.
Our DoubleMetaphone class exposes only two public methods:
|Overview of Phonetic.Tools.DoubleMetaphoneResult UDT Class|
|DoubleMetaphoneEncode()||Accepts an NVARCHAR value and encodes it using Double Metaphone.|
Returns a DoubleMetaphoneResult object.
|DoubleMetaphoneCompare()||Accepts two DoubleMetaphoneResult objects as inputs, and compares them.|
Returns an Integer between 0 and 3 indicating the match score.
The DoubleMetaphoneEncode() function accepts an NVARCHAR string as input and then encodes it using the Double Metaphone
algorithm. The result is returned in a DoubleMetaphoneResult UDT object. The DoubleMetaphoneCompare() function accepts
two encoded DoubleMetaphoneResult UDT objects and returns a match score based on the following table6:
|3||Strong Match. Primary1 = Primary2|
|2||Medium Match. Primary1 = Alternate2|
|2||Medium Match. Primary2 = Alternate1|
|1||Weak Match. Alternate1 = Alternate2|
The value returned by the DoubleMetaphoneCompare() function can be used to filter matches and narrow down search result
Installation and Usage
To install Double Metaphone, just follow these steps:
- Copy the DoubleMetaphone.dll file from the DoubleMetaphone\Install directory of the ZIP file to your SQL Server 2005
- Open and run the DoubleM_Install.SQL script (located in the DoubleMetaphone\SQL_Queries directory of the ZIP file). You
may have to modify the directory path of the CREATE ASSEMBLY statement.
To install the sample [People] table with about 4,500 surnames for testing, run the Create_People_Table.SQL
script. Once the [People] table is installed, you can run DoubleM_Query_Demo.SQL which demonstrates one possible
method of using the UDF's and UDT's.
To uninstall, just run the DoubleM_Drop.SQL script. It removes the [People] table, drops the UDF's and UDT, and drops the