Soundex - Experiments with SQL CLR

  • Comments posted to this topic are about the item Soundex - Experiments with SQL CLR

  • I feel like a bit of a troll for this -- I don't mean to be -- but...

    I can never truly put my finger on the exact problems I have with the TDD approach, but something always feels off about it, whenever I read an article such as yours (well-written, though it is).

    Even in this small, demonstrative function, you have ultimately written an significant quantity of non-functioning code and, in all fairness, what have you actually got to show for it? A handful of boolean checks that only really test for the specific microscopic input/output validations that you were able to think of prior to coding.

    Very little actual 'end-user' functionality has been provided, for possibly 3x the effort. It looks good on your lines-per-day KPI (I feel for you if you have one of these), but most of those are not furthering the project. If you've been told to implement a soundex function in C# and, at the end of the day, you have 20 failing tests and a method stub...well, that's not going to hold you in high stead.

    You touch upon one thing quite nicely when you suggest that many people might find it 'unnatural'. It is jarring to jump back and forth between the method that you are trying to implement and the tests that you write (even when reading the article). I would rather concentrate on implementing the method, then concentrate on confirming that it is correct.

    Not to mention seeing everything fail all the time. I certainly know that I don't like writing things that I know will fail. I know (and work with) many other developers that feel the same. Some of them (I try very hard not to lump myself in the same category but I expect, in some cases, I am included) will start writing their code purely to pass the tests, rather than really understanding why that test is there (I know, bad developers do bad development, move along, nothing to see here!). It can be particularly insidious, especially when [font="Courier New"]Assert.AreEqual('a', func('b'));[/font] is met with [font="Courier New"]if (input == 'b') return 'a';[/font], just so that red X goes away.

    (That said, countering the possibility of a null or empty input parameter with two tests and a substitution that still goes through the loop, rather than a basic [font="Courier New"]if (string.IsNullOrEmpty(input)) return string.Empty;[/font] -- or similar -- at the top of the method further cements the inefficiency of the approach in my mind!)

    I should point out that I'm not opposed to unit tests, automated or otherwise. I use them often. They are, as you say, particularly helpful in protecting existing code from future changes (although, not so much when requirements changes result in having to change both the functional code and the test cases). I just think that 'writing-first' and 'writing-to-fail' are not always the best approach.

    In this particular case, I would be more inclined to have a single test method comprising a number of input/output assertion pairs (including null and empty inputs) -- and I'd write it immediately after the method that actually paid the bills. Not strictly in the spirit of the whole unit-test 'movement', but functional enough to provide a smoke-test for future changes, separate enough to not break concentration during the crux of the task, and typically quick enough not to irk the guy paying my wages!


    PS: Oh, and, by the way...

    David.Poole (9/5/2013)

    - An IsLetter() function is generically useful and would probably sit in a centralised maintained code library rather than in an individual solution

    - This implementation simply checks for capital letters as we know our inputString is being forced to upper case. A more robust implementation would also check for lower case letters.

    See Char.IsLetter -- a generically useful, centrally maintained library method, robust enough to check for lower-case letters. 😉

  • jimbobmcgee (9/12/2013)

    See Char.IsLetter -- a generically useful, centrally maintained library method, robust enough to check for lower-case letters. 😉

    Oh bugger, you can tell I don't do much .Net work!

    This was probably the least troll like and most constructive criticism I have ever received. I agree with much of what you are saying. To be honest what I've done feels like a form of automated walk through break points.

    I am not 100% clear on the difference between TDD and unit testing. I think TDD is simply writing a unit test (too early:hehe:) before coding where as it is perfectly possible to write it afterwards.

    Some developers I know argue like cat, dog and ferret over the merit of BDD vs TDD vs building tests after coding. I don't think there is any clear winner. Frankly, I come down on the side that gets stuff done which ever one that might be.

    For me the trick with TDD is coming up with a test that genuinely does add value and does drive the development forward. It's a tool that helped me by helping me think the problem through. I'm not going to evangelise it because I know not everyone gets the same benefit.

  • David, nice walk through. I've never fully adopted test first, but I don't code every day either, perhaps making it harder to build a new habit. I'm more likely to write unit tests once I've evolved the code a bit, I find (and again, I don't code every day) that as I work the problem the packaging evolves - maybe I split or combine a method, or move something to a shared library, etc, and having too many tests too early just makes me reluctant to tear things up. Maybe a full adoption of test first would remove some of that, make me think more? Dont know.

    Jim, nice tip on IsLetter - I didn't know is there either!

  • I manage a largish custom corporate CRM system, built specifically for our needs.

    We used to use SOUNDEX for the first few years, as it was what SQL Server offered for name searching. But, as the contact database grew, and the call centers moved offshore, we needed a batter name matching systems. In the end we have implemented two CLR functions

    * NYSIIS - which is an encoder similar to SOUNDEX, but more modern, and does a better job of the sound of names

    * LevenshteinDistance - which compares who strings, and gives you the number of differences (missing chars, added chars, transposed chars etc)

    For our contacts database, we cache the NYSIIS values for last_name and first_name + ' ' + last_name. When we search for a name, we do the NYSIIS function on the search expression, we do a direct compare from the cached values to the search NYSIIS result, and return the results. So users can search for last_name, last_name, first name or first_name last_name.

    Note. NYSIIS arranged each encoded word alphabetically after doing the algorithm.

    We then default sort the results by the LevenshteinDistance of the real names, theory being the user spelt it right, so the name is at the top, but the names get progressively different down the list.

    We are transitioning to EF5 at the moment, and all this worked seamlessly as a EF function, which was nice. performance is as good as soundex - as we are cached, but the results are far superior.

    There are other algorithms out there, some much better ones if you want to match Dutch names, but NYSIIS proved ideal for our set of names.

  • Take a look at

    The Rolls Royce solution is Metaphone3 which I believe is in the latest Apache SOLR.

    I had planned a followup article based on the link above. Ultimately the reason I looked at these functions was to experiment with beefing up the search capabiliity of SQL Server.

    Yes, I could export the data to Apache SOLR but there are various reasons why I wouldn't do this.

  • I am interested in Soundex only as a curiosity so I read this article mostly for entertainment. That said, I think the title might have reached a broader audience as "Test twice, code once" because it's a great example of the process. I appreciate the revealed 'oops' that you didn't redact for the sake of appearing 'expert' on the subject.

    I don't have much more to add on the Soundex discussion though - just wanted to say thanks for this article and encourage you to write (and release) another soon. 🙂

  • A few years ago, I worked extensively with Double Metaphone code in straight SQL - it provided very good results for U.S. names and addresses[/url].

    SOUNDEX, however, I rate as of limited value, only useful if you're at a stage where you're willing to accept many false positives.

    select Soundex('LLoyds'),DoubleMetaPhone('LLoyds') -- returns L432, LTS LTS

    select Soundex('Loyds'),DoubleMetaPhone('Loyds') -- returns L320, LTS LTS

    select Soundex('Brighton'),DoubleMetaPhone('Brighton') -- returns B623, PRTN PRTN

    select Soundex('Bristol'),DoubleMetaPhone('Bristol') -- returns B623, PRSTL PRSTL

    select Soundex('Bob'),DoubleMetaPhone('Bob') -- returns B100, PP PP

    select Soundex('Bob Harvey'),DoubleMetaPhone('Bob Harvey') -- returns B100, PPRF PPRF

    select Soundex('Shwardseneger'),generic.dbo.gefn_FUZZY_DoubleMetaPhone('Shwardseneger') -- returns S632, XRTSNJR XRTSNKR

    select Soundex('Schwarzenegger'),generic.dbo.gefn_FUZZY_DoubleMetaPhone('Schwarzenegger') -- returns S262, XRSNKR XRTSNKR

    -- Note the match on the second metaphone interpretation.

    P.S. I also started looking at Jaro Winkler for distance matching, i.e. typing errors - again, a straight SQL implementation is available[/url]

  • Don't put Soundex inside the difference function as Soundex comparison is implicit in the function.

    select Difference(Soundex('Brighton'),Soundex('Bob')) -- Returns 4

    select Difference('Brighton','Bob') --Returns 1

  • David.Poole (9/12/2013)

    Don't put Soundex inside the difference function as Soundex comparison is implicit in the function.

    select Difference(Soundex('Brighton'),Soundex('Bob')) -- Returns 4

    select Difference('Brighton','Bob') --Returns 1

    My mistake! Editing post.

  • Interesting article although it dwells more on C# than SQL. I too find Agile and Extreme Programming books fascinating but never found a way to make them useful in programming T-SQL.

    Application and database programming are too different to treat the same and use XP techniques. Applications are snapshots of exe bytes. Developers only care about the final compile before release.

    Database upgrades contain the entire evolution of changes between releases. The overwriting of new and improved stored procedures is the tiny part of the upgrade; the big part of upgrading is bringing the data up to the new schemata without loss or corruption.

    Applications have objects and methods that have loose coupling and tightly defined interfaces. Stored procedures have parameters (interfaces) but they also access data directly (tight coupling to all tables and columns in a database) without going through a controlled interface. Because of this, I've always found it impossible to write test suites for sql.

  • The hardest part of TDD for me was the change in mind set I needed to make. Couple that to the difference between app development and db development and it was a stretch.

    One of our teams managed to put together a test framework that straddles ssis, app and db. It is a work of art.

    At one point I couldn't see how agile would work with the database world. Ditto continuous integration. Now both are pretty much so-what items.

    I think Red-gate sql test is very promising. I think the barrier is where dba and developer are very seperate roles.

  • Nice article David.

    Pardon my getting off topic a wee bit but since I see the article hits CLR procedures, test driven development, and SOUNDEX it's just one more pellet in a shot gun blast.

    I wonder about adding a Soundex or MetaPhone column to my structure. I know that CLR code runs bloody fast but putting these functions in predicates would make for table scans, right? So if I had a column to store the soundex I could index that and speed up searches.

    ATBCharles Kincaid

  • @Charles, exactly so.

    One of the implementations of Soundex stored the soundex value of a holiday destinations in an indexed field.

    When the customer entered a location the system would take the soundex value of their entry and look up the stored soundex value. This was more likely to retrieve a value or values.

    The app itself would then prioritise an entry that was an exact match if one existed.

    Where there was no match, soundex or exact then the mismatch was stored in a "MissedSearchTerms" table to be analysed later.

    In some cases the search term revealled a gap in our product offering, in others a spelling mistake that Soundex could not handle.

    What we found was that we had to build a thesaurus of place names over and above the soundex keys.

Viewing 14 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply