Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Soundex - Experiments with SQL CLR Expand / Collapse
Author
Message
Posted Thursday, September 5, 2013 3:07 PM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Today @ 3:07 AM
Points: 2,916, Visits: 1,854
Comments posted to this topic are about the item Soundex - Experiments with SQL CLR

LinkedIn Profile
Newbie on www.simple-talk.com
Post #1491963
Posted Thursday, September 12, 2013 6:15 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Today @ 2:43 PM
Points: 64, Visits: 801
I feel like a bit of a troll for this -- I don't mean to be -- but...

I can never truly put my finger on the exact problems I have with the TDD approach, but something always feels off about it, whenever I read an article such as yours (well-written, though it is).

Even in this small, demonstrative function, you have ultimately written an significant quantity of non-functioning code and, in all fairness, what have you actually got to show for it? A handful of boolean checks that only really test for the specific microscopic input/output validations that you were able to think of prior to coding.

Very little actual 'end-user' functionality has been provided, for possibly 3x the effort. It looks good on your lines-per-day KPI (I feel for you if you have one of these), but most of those are not furthering the project. If you've been told to implement a soundex function in C# and, at the end of the day, you have 20 failing tests and a method stub...well, that's not going to hold you in high stead.

You touch upon one thing quite nicely when you suggest that many people might find it 'unnatural'. It is jarring to jump back and forth between the method that you are trying to implement and the tests that you write (even when reading the article). I would rather concentrate on implementing the method, then concentrate on confirming that it is correct.

Not to mention seeing everything fail all the time. I certainly know that I don't like writing things that I know will fail. I know (and work with) many other developers that feel the same. Some of them (I try very hard not to lump myself in the same category but I expect, in some cases, I am included) will start writing their code purely to pass the tests, rather than really understanding why that test is there (I know, bad developers do bad development, move along, nothing to see here!). It can be particularly insidious, especially when Assert.AreEqual('a', func('b')); is met with if (input == 'b') return 'a';, just so that red X goes away.

(That said, countering the possibility of a null or empty input parameter with two tests and a substitution that still goes through the loop, rather than a basic if (string.IsNullOrEmpty(input)) return string.Empty; -- or similar -- at the top of the method further cements the inefficiency of the approach in my mind!)

I should point out that I'm not opposed to unit tests, automated or otherwise. I use them often. They are, as you say, particularly helpful in protecting existing code from future changes (although, not so much when requirements changes result in having to change both the functional code and the test cases). I just think that 'writing-first' and 'writing-to-fail' are not always the best approach.

In this particular case, I would be more inclined to have a single test method comprising a number of input/output assertion pairs (including null and empty inputs) -- and I'd write it immediately after the method that actually paid the bills. Not strictly in the spirit of the whole unit-test 'movement', but functional enough to provide a smoke-test for future changes, separate enough to not break concentration during the crux of the task, and typically quick enough not to irk the guy paying my wages!

J.

PS: Oh, and, by the way...

David.Poole (9/5/2013)

- An IsLetter() function is generically useful and would probably sit in a centralised maintained code library rather than in an individual solution
- This implementation simply checks for capital letters as we know our inputString is being forced to upper case. A more robust implementation would also check for lower case letters.


See Char.IsLetter -- a generically useful, centrally maintained library method, robust enough to check for lower-case letters.
Post #1494123
Posted Thursday, September 12, 2013 6:56 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Today @ 3:07 AM
Points: 2,916, Visits: 1,854
jimbobmcgee (9/12/2013)
See Char.IsLetter -- a generically useful, centrally maintained library method, robust enough to check for lower-case letters.


Oh bugger, you can tell I don't do much .Net work!



This was probably the least troll like and most constructive criticism I have ever received. I agree with much of what you are saying. To be honest what I've done feels like a form of automated walk through break points.

I am not 100% clear on the difference between TDD and unit testing. I think TDD is simply writing a unit test (too early) before coding where as it is perfectly possible to write it afterwards.

Some developers I know argue like cat, dog and ferret over the merit of BDD vs TDD vs building tests after coding. I don't think there is any clear winner. Frankly, I come down on the side that gets stuff done which ever one that might be.

For me the trick with TDD is coming up with a test that genuinely does add value and does drive the development forward. It's a tool that helped me by helping me think the problem through. I'm not going to evangelise it because I know not everyone gets the same benefit.





LinkedIn Profile
Newbie on www.simple-talk.com
Post #1494143
Posted Thursday, September 12, 2013 7:02 AM
SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: Moderators
Last Login: 2 days ago @ 9:27 AM
Points: 6,800, Visits: 1,917
David, nice walk through. I've never fully adopted test first, but I don't code every day either, perhaps making it harder to build a new habit. I'm more likely to write unit tests once I've evolved the code a bit, I find (and again, I don't code every day) that as I work the problem the packaging evolves - maybe I split or combine a method, or move something to a shared library, etc, and having too many tests too early just makes me reluctant to tear things up. Maybe a full adoption of test first would remove some of that, make me think more? Dont know.

Jim, nice tip on IsLetter - I didn't know is there either!


Andy
SQLAndy - My Blog!
Connect with me on LinkedIn
Follow me on Twitter
Post #1494150
Posted Thursday, September 12, 2013 7:03 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Wednesday, July 16, 2014 7:37 AM
Points: 83, Visits: 50
I manage a largish custom corporate CRM system, built specifically for our needs.

We used to use SOUNDEX for the first few years, as it was what SQL Server offered for name searching. But, as the contact database grew, and the call centers moved offshore, we needed a batter name matching systems. In the end we have implemented two CLR functions

* NYSIIS - which is an encoder similar to SOUNDEX, but more modern, and does a better job of the sound of names
* LevenshteinDistance - which compares who strings, and gives you the number of differences (missing chars, added chars, transposed chars etc)

For our contacts database, we cache the NYSIIS values for last_name and first_name + ' ' + last_name. When we search for a name, we do the NYSIIS function on the search expression, we do a direct compare from the cached values to the search NYSIIS result, and return the results. So users can search for last_name, last_name, first name or first_name last_name.

Note. NYSIIS arranged each encoded word alphabetically after doing the algorithm.

We then default sort the results by the LevenshteinDistance of the real names, theory being the user spelt it right, so the name is at the top, but the names get progressively different down the list.

We are transitioning to EF5 at the moment, and all this worked seamlessly as a EF function, which was nice. performance is as good as soundex - as we are cached, but the results are far superior.

There are other algorithms out there, some much better ones if you want to match Dutch names, but NYSIIS proved ideal for our set of names.
Post #1494151
Posted Thursday, September 12, 2013 8:31 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Today @ 3:07 AM
Points: 2,916, Visits: 1,854
Take a look at http://ntz-develop.blogspot.co.uk/2011/03/phonetic-algorithms.html

The Rolls Royce solution is Metaphone3 which I believe is in the latest Apache SOLR.

I had planned a followup article based on the link above. Ultimately the reason I looked at these functions was to experiment with beefing up the search capabiliity of SQL Server.

Yes, I could export the data to Apache SOLR but there are various reasons why I wouldn't do this.


LinkedIn Profile
Newbie on www.simple-talk.com
Post #1494215
Posted Thursday, September 12, 2013 9:13 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Monday, November 10, 2014 6:19 AM
Points: 262, Visits: 919
I am interested in Soundex only as a curiosity so I read this article mostly for entertainment. That said, I think the title might have reached a broader audience as "Test twice, code once" because it's a great example of the process. I appreciate the revealed 'oops' that you didn't redact for the sake of appearing 'expert' on the subject.

I don't have much more to add on the Soundex discussion though - just wanted to say thanks for this article and encourage you to write (and release) another soon. :)
Post #1494239
Posted Thursday, September 12, 2013 9:33 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: 2 days ago @ 2:10 PM
Points: 889, Visits: 2,460
A few years ago, I worked extensively with Double Metaphone code in straight SQL - it provided very good results for U.S. names and addresses http://www.sqlservercentral.com/scripts/Miscellaneous/30219/.

SOUNDEX, however, I rate as of limited value, only useful if you're at a stage where you're willing to accept many false positives.


select Soundex('LLoyds'),DoubleMetaPhone('LLoyds') -- returns L432, LTS LTS
select Soundex('Loyds'),DoubleMetaPhone('Loyds') -- returns L320, LTS LTS
select Soundex('Brighton'),DoubleMetaPhone('Brighton') -- returns B623, PRTN PRTN
select Soundex('Bristol'),DoubleMetaPhone('Bristol') -- returns B623, PRSTL PRSTL
select Soundex('Bob'),DoubleMetaPhone('Bob') -- returns B100, PP PP
select Soundex('Bob Harvey'),DoubleMetaPhone('Bob Harvey') -- returns B100, PPRF PPRF
select Soundex('Shwardseneger'),generic.dbo.gefn_FUZZY_DoubleMetaPhone('Shwardseneger') -- returns S632, XRTSNJR XRTSNKR
select Soundex('Schwarzenegger'),generic.dbo.gefn_FUZZY_DoubleMetaPhone('Schwarzenegger') -- returns S262, XRSNKR XRTSNKR
-- Note the match on the second metaphone interpretation.


P.S. I also started looking at Jaro Winkler for distance matching, i.e. typing errors - again, a straight SQL implementation is available http://www.sqlservercentral.com/articles/Fuzzy+Match/65702/
Post #1494248
Posted Thursday, September 12, 2013 9:53 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Today @ 3:07 AM
Points: 2,916, Visits: 1,854
Don't put Soundex inside the difference function as Soundex comparison is implicit in the function.

select Difference(Soundex('Brighton'),Soundex('Bob'))  -- Returns 4
select Difference('Brighton','Bob') --Returns 1



LinkedIn Profile
Newbie on www.simple-talk.com
Post #1494260
Posted Thursday, September 12, 2013 10:22 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: 2 days ago @ 2:10 PM
Points: 889, Visits: 2,460
David.Poole (9/12/2013)
Don't put Soundex inside the difference function as Soundex comparison is implicit in the function.

select Difference(Soundex('Brighton'),Soundex('Bob'))  -- Returns 4
select Difference('Brighton','Bob') --Returns 1



My mistake! Editing post.
Post #1494277
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse