SQLServerCentral Article

Soundex - Experiments with SQLCLR Part 2

,

When I wrote "Soundex - Experiments with SQLCLR" the resulting forum posts drew my attention to a number of weaknesses with the Soundex algorithm.

Soundex as implemented in SQL Server has around 7,000 possible values where as the English language has over 1 million words.  Clearly there is huge scope for false matches as many words can generate the same Soundex code.  In addition it makes no concessions to the fact that combinations of letters can make different sounds so will miss occurrences such as "Philip" and "Fillip".

Clearly something better than Soundex is required and fortunately there are quite a few good resources to look at.

  • Nikita's blog lists a number of alternative phonetic algorithms
  • The Apache Commons library contains actual Java implementations of the algorithms and adds ColognePhonetic
  • Utilitymill.com allows you to test words against Soundex and NYSIIS and also take a look at the source code.
  • Rosettacode.org provides many programming language implementations of common algorithms.

I decided to see if I could extend my "SoundexLong" function to take these into account.  I also wanted to see if continuing the Test Driven Development approach helped or hindered further development.

The first step in doing so was to perform some basic analysis into the way that these algorithms work in order to understand the scope of the changes that would have to be made.

Basic analysis of the algorithms

A good starting point is to understand what algorithms are available and what they are used for.  Those of interest are shown in the table below:-

Algorithm Purpose
Soundex Simple character substitution algorithm originally designed for a manual name indexing system.
Refined Soundex Slight enhancement to the original routine intended to reduce the number of false positives.  Still a simple character substitution algorithm
ColognePhonetic Very similar to Soundex but the encoding is to suit German pronounciation.
NYSIIS

New York State Identification and Intelligence System.  Designed to cater for various Americanisations of European names.

Starts to consider combinations (n-Grams) of letters and their changing role depending on their relative positions within a word.

Daitch-Mokotoff

Intended to cater for Jewish and Eastern European names.

Again, starts to consider combinations of letters and their changing role depending on their relative positions within a word.

Introduces the concept that multiple values may be returned

Caverphone Intended to cater for New Zealand names.  Largely a character substitution algorithm
Metaphone

Sophisticated algorithm sensitive to both combinations of letters and their relative positions in a word.  Rules vary in complexity but there a large number to cater for multi-lingual usage.

Double-metaphone Returns two phonetic codes for a given input to allow for multi-lingual pronounciations.  Double-metaphone will match Schmidt and Smith

I also characterised the types of transformations that each algorithm supported.  Again these are summarised in the table below:-

Algorithm

Character

Substitution

Start

Of Word

End Of

Word

After

Vowel

Combinations Case

Sensitive

Input

<

Output

Soundex Y Y
Refined Soundex Y Y
CoognePhonetic Y Y
NYSIIS Y Y Y Y Y
Daitch-Mokotoff Y Y Y Y Y
Caverphone Y Y Y Y Y
Metaphone Y Y Y Y Y Y
Double-Metaphone Y Y Y Y Y Y

The right-most column indicates that the number of substituted characters can exceed the number of characters being replaced.  This is important as in my original article I looked to use a single character array iterating through deciding whether to ignore or overwrite characters as appropriate.  An example of which is shown below.

Daitch-Mokotoff, ColognePhonetic and the Metaphone algorithms can replace a single 'X' with two characters.  If we were to use a single character array then we would risk prematurely overwriting characters rather than considering them appropriately.

This means that we are certainly going to need to have a separate input array from an output array.

Design Decisions

The original function from my "Soundex - Experiments with SQLCLR" article had the following characteristics:

  • Accepted a single argument, a string
  • Converted the input argument to upper case effectively making the function case insensitive
  • Returned a single string value
  • Named the function SoundexLong

If I want to cater for a range of phonetic algorithms then I face a number of choices.

  • Do I refactor my original code or build a new function that duplicates the functionality of the existing algorithm?
  • Do I produce multiple separate functions thereby duplicating code or a single function with a parameter to indicate what I want the function to do?

All the decisions have undesirable implications either from a code complexity, code duplication or alteration of established code perspective.  I am going to do the following:

  • My original SoundexLong will remain as an immutable set of code
  • I will have a single "LongPhonetic" function rather than multiple functions as I have learnt that end users lean towards simplicity and ease of use.
  • I am going to duplicate my SoundexLong code to form the starting point for LongPhonetic and refactor it to achieve the desired result

It is quite possible that these decisions will return to haunt me, however, we have to start somewhere.   Attempt a course of action but do so in such a way that identifies as quickly as possible if it is a dead end. This is the principle of "failing fast" and allows us to develop a strategy for failure that removes fear and doubt.  Such a strategy gives us a number of advantages:

  • Failure has to be viewed as a learning experience.  A success is great but we learn more from our failures.
  • Avoids analysis paralysis.  Fear and doubt can lead to analysis paralysis in an attempt to avoid failure
  • Allows us to take less timid steps and therefore open up potentially greater rewards.

Step One - Addressing some technical debt

jimbobmcgee gave constructive feedback to my original article pointing out that my implementation of an IsLetter() function was unnecessary as a char object has an inbuilt IsLetter() function. To correct this I simply changed both occurences of IsLetter in my code with char.IsLetter

        if(!char.IsLetter(inputArray[0]))
and
        if (char.IsLetter(inputArray[currentCharacterPosition]))
A quick run of the tests confirmed that the change had been carried out corrrectly and that nothing had broken.

Step Two - Refactoring for an outputArray

Given that Daitch-Mokotoff and Metaphone necessitate a 2nd array the next step was to change my code to read from the inputArray[] and write to the outputArray[]. At this stage I don't need to know how long I to make the outputArray as I am simply refactoring my code.  Making it the same length as the inputArray[] will suffice.

          char[] inputArray = InputString.ToUpper().ToCharArray();
          char[] outputArray = new String(' ', inputArray.Length).ToCharArray();
Obviously I need to change from 
        return new string(inputArray).Trim();
to
        return new string(outputArray).Trim();
As my code uses two position variables refactoring to use the outputArray is straight forward.
  • currerntCharacterPosition = The position we read from within the original input string
  • validCharacterPosition = The position we write to in the target array, initially inputArray[] but to be changed to outputArray[]
This means that the refactoring should simply be a case of replacing inputArray[validCharacterPosition] with outputArray[validCharacterPosition].
We rerun the tests and FAIL!

It turns out we need a bit of code to cater for the first character of the input argument being a letter.

        if (!char.IsLetter(inputArray[0]))
        {
            inputArray[0] = ' ';
        }
        else
        {
            outputArray[0] = inputArray[0];
        }
A quick rerun and all tests pass.
At this point I have carried out two refactoring exercises using just my original tests and they have proven that the code still works as intended.

Valid criticism of Test Driven Development

Having reached the point where the refactored code is working we are now at the point where we can consider making the LongPhoneticfunction accept an additional argument to determine which phonetic algorithm it should run.
It is here that we run into one of the biggest flaws in TDD using nUnit/JUnit etc.  The steps you should go through when applying TDD is as follows:-
  • Answer the question "What is the acceptance criteria that would demonstrate that this piece of functionality will deliver what it should"?  
  • Develop a test that expresses that acceptance criteria and will only pass if your code meets that criteria.  
  • Write the code to pass the test
  • Refactor mercilessly
The problem is that when you change the signature of a function or have a new function, method or object your test code cannot complile.  The most we can do is create a stub of the method or class in order to allow the test code to compile.

Frameworks such as Cucumber with its Gherkin language address these points because the common language in which tests are defined is separate from the implementation of the items that for which the tests apply.  By having the separation between the language defining the tests and the language implementing the tests it means that if you took the decision to replatform your application from one language to another all your tests would remain valid for your new application.

Cucumber/Gherkin describes tests in an ubiquitus language that is meaningful to business people so business analysts and testers can work together to write tests that are effectively cataloguing business requirements in the language that the business person can undersstand.  This is known as Behaviour Driven Development (BDD).
It is beyond the scope of this article to demonstrate BDD but a useful research topic would be SpecFlow which is effectively Cucumber for .NET.

Step Three - Refactoring my function to accept an additional argument

So far my LongPhonetic() function (refactored from SoundexLong in my original article) accepts a single string argument.  All I am going to do at this stage is change the signature of my function and refactor all the tests so they are proven to work.  I am NOT going to implement the selection of phonetic algorithms at this stage.  

James Shore and Shane Warden believe that modifying code in small quick steps testing after each step is the fastest way to develop quality working code and my personal experience would back this up. Change two things and the combinations of success/failure for two objects mean that of the four possible outcomes you have one way for it to work and three that could go wrong.
Sometimes it is obvious what isn't working but just as often it is a bit of a head scratcher. The more changes you make in a single step the more permutations and combinations there are of something going wrong. This is one of the cases where refactoring the changes takes more effort than refactoring the code. The code itself is as follows:
    [Microsoft.SqlServer.Server.SqlFunction (IsDeterministic=true,IsPrecise=true)]
    public static SqlString LongPhonetic(short PhoneticType,String InputString)
Each of the test calls to LongPhonetic has to be changed to incorporate the additional argument.
        #region LongPhonetic() tests
        [TestMethod]
        public void StartUpTest()
        {
            Assert.AreEqual("D", Phonetics.LongPhonetic(0, "D"));
       }
        [TestMethod]
        ..... 
       #endregion

Step Four - Acting on the additional argument

I could implement the selection of a phonetic algorithm by building up a complicated series of if() statements.  Although I could make it work the code would become increasingly unwieldy as more algorithms are implemented.

Ideally what I need is a variable that acts as if it were a function so I could set this at the start of the function and keep the heavy lifting as simple as possible.  Fortunately we can do this by using a .NET delegate as shown in the code below.

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
delegate Char PhoneticAlgorithm(Char a);
public partial class Phonetics
{
    [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic=true,IsPrecise=true)]
    public static SqlString LongPhonetic(short PhoneticType,String InputString)
    {
        PhoneticAlgorithm phoneticAlgorithm;
        switch (PhoneticType)
        {
            case 0:
                phoneticAlgorithm = new PhoneticAlgorithm(SoundexChar);
                break;
            default:
                phoneticAlgorithm = new PhoneticAlgorithm(SoundexChar);
                PhoneticType = 0;
                break;
        }

The code works as follows:-

  • The "PhoneticAlgorithm" delegate is simply a signature of a function accepting a Char and returning a Char
  • We declare the variable phoneticAlgorithm to be of type "PhoneticAlgorithm"
  • We instantiate the variable phoneticAlgorithm as a function object matching the signature and return for the delegate.  In this case SoundexChar (as described in my original article) irrespective of what the value of our additional "PhoneticType" argument may be.

Compile and run the tests to check everything is working before moving onto the next stage.

Step Five - Implementing our delegate?

I eventually want to be able to use my additional argument to specify which phonetic algorithm I will use but right now I want to run my existing code and also to default to using my existing algorithm should any other value be specified.  Again this means refactoring my tests to prove that other values for the phonetic type argument work as expected.

        #region LongPhonetic() tests
        [TestMethod]
        public void StartUpTest()
        {
            Assert.AreEqual("D", Phonetics.LongPhonetic(0, "D"));
            Assert.AreEqual("D", Phonetics.LongPhonetic(99, "D"));
       }
        [TestMethod]
        ..... 
       #endregion
Then I simply change
            // If not a double occurrence of either characters or phonetic codes
            if (!((outputArray[validCharacterPosition] == inputArray[currentCharacterPosition]) ||
                outputArray[validCharacterPosition] == SoundexChar(inputArray[currentCharacterPosition])))
            {
                // Only increment the validCharacterPosition if it is not a vowel
                if (outputArray[validCharacterPosition] != '0')
                {
                    validCharacterPosition++;
                }
                outputArray[validCharacterPosition] = SoundexChar(inputArray[currentCharacterPosition]);
            }
 to
            // If not a double occurrence of either characters or phonetic codes
            if (!((outputArray[validCharacterPosition] == inputArray[currentCharacterPosition]) ||
                outputArray[validCharacterPosition] == phoneticAlgorithm(inputArray[currentCharacterPosition])))
            {
                // Only increment the validCharacterPosition if it is not a vowel
                if (outputArray[validCharacterPosition] != '0')
                {
                    validCharacterPosition++;
                }
                outputArray[validCharacterPosition] = phoneticAlgorithm(inputArray[currentCharacterPosition]);
            }
Again, compile and run the tests again to prove that everything is working.

Step Six - Adding a new phonetic algorithm

So far we have made quite a few changes simply to refactor our code to the point where it delivers exactly the same as it did before we started.  We are now ready to implement a 2nd phonetic algorithm that will better demonstrate what we are trying to achieve.

By far the simplest algorithm to implement is the refined Soundex algorithm as it is the same method with additional character encodings.

Encoded

Value

Soundex Refined Soundex
1 B,P,F,V B,P
2 C,S,G,J,K,Q,X,Z F,V
3 D,T C,K,S
4 L G,J
5 M,N Q,X,Z
6 R D,T
7 L
8 M,N
9 R

Again, the starting point is to refactor our tests so a phonetic type of one indicates that refined Soundex should be used.  An example of a refactored test is shown below.

        [TestMethod]
        public void LlangollenHeritageRailwayTest()
        {
            Assert.AreEqual("L5245 H632 R4", Phonetics.LongPhonetic(0, "llangollen heritage railway"));
            Assert.AreEqual("L8478 H964 R7", Phonetics.LongPhonetic(1, "llangollen heritage railway"));
            Assert.AreEqual("L5245 H632 R4", Phonetics.LongPhonetic(99, "llangollen heritage railway"));
        }
The test will fail until I implement the refined Soundex functionality so on failure I start by adding the RefinedSoundexChar() function as follows:-
   /// <summary>
    /// For a given character return the raw refined soundex value.  If the argument is not a letter then return a space.
    /// </summary>
    /// <remarks>Refined Soundex works in the same way as normal Soundex but assigns digits from 1 to 9 as opposed to 1-6.</remarks>
    /// <param name="a">The character to be evalutated.</param>
    /// <returns></returns>
    private static Char RefinedSoundexChar(Char a)
    {
        switch (a)
        {
            case 'A':
            case 'E':
            case 'I':
            case 'O':
            case 'U':
            case 'W':
            case 'H':
            case 'Y':
                a = '0';
                break;
            case 'B':
            case 'P':
                a = '1';
                break;
            case 'F':
            case 'V':
                a = '2';
                break;
            case 'C':
            case 'K':
            case 'S':
                a = '3';
                break;
            case 'G':
            case 'J':
                a = '4';
                break;
            case 'Q':
            case 'X':
            case 'Z':
                a = '5';
                break;
            case 'D':
            case 'T':
                a = '6';
                break;
            case 'L':
                a = '7';
                break;
            case 'N':
            case 'M':
                a = '8';
                break;
            case 'R':
                a = '9';
                break;
            default:
                a = ' ';
                break;
        }
        return a;
    }
After checking that everything compiles I now refactor the switch statement that choose which phonetic algorithmn to implement.
        PhoneticAlgorithm phoneticAlgorithm;
        switch (PhoneticType)
        {
            case 0:
                phoneticAlgorithm = new PhoneticAlgorithm(SoundexChar);
                break;
            case 1:
                phoneticAlgorithm = new PhoneticAlgorithm(RefinedSoundexChar);
                break;
            default:
                phoneticAlgorithm = new PhoneticAlgorithm(SoundexChar);
                PhoneticType = 0;
                break;
        }
If everything has been done correctly then my refactored tests will work confirming that I have two implemented phonetic algorithms.

Step Seven - Retiring unnecessary code

In my SoundexLong original I was writing my results back into the original array as I iterated through it.  In the vast majority of cases my encoded string would be considerably shorter than the string in the originating argument, meaning that the character after the encoding were spurious.
In my original I dealt with this by overwriting all characters after the encoded result with spaces and then trimming the result.  Given that in this version I have a separate inputArray and outputArray I no longer need to do this.
I can simply get rid of the block of code shown below.
        while (validCharacterPosition < currentCharacterPosition)
        {
            outputArray[validCharacterPosition++] = ' ';
        }

Having done so I can recompile and run the tests and they confirm that the code was indeed redundant.

Step Eight - Further refactoring

What we have so far is code that will implement two versions of Soundex whose only difference is the encoding of the letters.
If we want to implement NYSIIS, Daitch-Mokotoff, Metaphone and Caverphone algorithms then at this stage will still intend to iterate through the inputArray but the body the actual processing and writing out to the outputArray are going to be different.  For this reason we are going to take the processing block of code out of the main loop
while (currentCharacterPosition < inputArray.Length)
{
  ...
}
This will be put in a separate private method that is called from within the While Loop.
Ideally we would like our currentPosition and validPosition to be properties of a class so their values can be reused by the new private method however our LongPhonetic() function has to be a static function so this is not possible.  We will have to pass our currentPosition and validPosition to the method.  In fact we are going to have to pass the following values by reference to our function:
  • validCharacterPosition
  • currentCharacterPosition
  • inputArray
  • outputArray
  • phoneticAlgoritm
As our validCharacterPosition and currentCharacterPosition are primitive types we have to be explicit in passing these by reference.  The remaining arguments are reference types in any case.
By passing validCharacterPosition and currentCharacterPosition by reference it means that any alterations made to these variables by our private method will be made to the source variables and not a copied value limited to the scope of the method.  Our while loop simplifies down to the one shown below.
while (currentCharacterPosition < inputArray.Length)
{
  SimpleSoundex(ref validCharacterPosition,ref currentCharacterPosition,inputArray,outputArray,phoneticAlgorithm);
  currentCharacterPosition++;
}
The old body of the while loop finds its way into the new SimpleSoundex method.
 
private static void SimpleSoundex(ref int validCharacterPosition, ref int currentCharacterPosition, char[] inputArray, char[] outputArray,PhoneticAlgorithm phoneticAlgorithm)
    {
        // If start of word
        if (outputArray[validCharacterPosition] == ' ')
        {
           ......
        }
    }
 

Again, recompling and running the tests should confirm if this has been done correctly.

Achievements so far

It may seem that we have made a lot of changes so far to achieve very little.  We have actually achieved more than would appear at first glance.  

  • Addressed some technical debt
  • Refactored our code to split the inputArray from the outputArray and used existing tests to show the refactoring has been successful
  • Refactored our code and tests to accept an additional argument and act on it.  The tests show that the code produces the desired results
  • Refactored our code to lay the foundation stones to be able to plug in further phonetic algorithms.

In short we have taken a series of small steps to get to the point where we have a working function in a fit state for deployment that implements two phonetic algorithms. 

New algorithms, new Challenges 

One challenge we have to face when implementing the more complex phonetic algorithms is that we have to consider more character relationships.

  • We need to be capable of matching a contiguous occurrence
  • We need to determine the location of the occurrence (beginning of word, end of word, after vowel, relative to other combinations
  • We need to be capable of deciding appropriate encoding for that occurence
  • We need to consider cases such as the letter 'X' which encodes to two characters

Problems with the current approach

Looking at the code I have written it is becoming apparent that the approach I have used up to now is not suitable to meet the challenges for the more complex phonetic algorithms.

It is clear that a number of functions and methods are needed.  As the functions are static methods of a static class we cannot maintain state for our validCharacterPosition and currentCharacterPosition variables.  Although we are addressing this by passing the variables between the various functions by reference it will make messy and convoluted code if we continue to do so.

Clearly a more radical rethink is needed and that will be a subject for the next article.

Rate

4.67 (6)

You rated this post out of 5. Change rating

Share

Share

Rate

4.67 (6)

You rated this post out of 5. Change rating