Soundex - Experiments with SQLCLR Part 3

  • David.Poole

    SSC Guru

    Points: 75191

    Sergiy - Sunday, November 25, 2018 6:16 PM

    Must be a fascinating technical thing which is absolutely useless for actual users.
    May be even worse - damaging for real life applications.

    Soundex was invented as a manual substitution code to help office clerks find names on a Rolodex card system.  It predates electronic computers by a good few decades.  As you say, it is extremely vague and as I said in the original articles it produces around 8000 combinations to represent the millions of English words.

    Most of the algorithms have something to do with finding the names of people so their scope is limited.  Lawrence Philips metaphone algorithms go beyond simple name matching.  Metaphone3 is a commercial algorithm which significantly improves match rates while diminishing false positives.

    Not sure who you consider to be "users" so the only comment I can offer as to its "absolutely useless" nature is that it improves the performance of a search facility in connecting people to products for which they wish to exchange money.  People don't buy what they can't find,

    No idea what you mean by "damaging for real life applications".

  • Sergiy

    SSC Guru

    Points: 109670

    simon.hewitt - Monday, November 26, 2018 4:24 AM

    Sergiy:
    Most of David's code is general C# so can be used at the front end. The code to use it as back-end is only a few additional lines.

    For the problem you describe, no algorithm will be able to do what you want I suspect.
    My solution (again for front end) is simple: I create an alias list containing the 'real' or 'known' text and a list of known aliases.

    Yep, that's a pretty obvious solution, and, of course, I used this approach back then.
    There was also a protocol for collecting "unrecognized" variations within data uploads and inform users about new names to be allocated.

    As for front/back end - this solution (I mean the approach, not the implementation) is to unreliable to be used anywhere
    Just not good enough.
    It must contain way more intelligence to be trusted.
    It must have a way to figure out that "Lond" must be rather mistyped "Pond" (letters .P and L are pretty close) than "Lord" or "London"or "Long" which are way more closer by the sound of it.

  • Sergiy

    SSC Guru

    Points: 109670

    David.Poole - Monday, November 26, 2018 6:30 AM

    Not sure who you consider to be "users" so the only comment I can offer as to its "absolutely useless" nature is that it improves the performance of a search facility in connecting people to products for which they wish to exchange money.  People don't buy what they can't find,

    No idea what you mean by "damaging for real life applications".

    I don't think it's actually used in popular search engines.
    Type the sequence of symbols "cthdth" into Google Search inbox - it will return you articles about "Ñервер", which is Russian for "server", because that's how English/Russian keyboards are mapped. And there would not be any suggestion based on SOUNDEX mapping.

    Using it may be damaging, if placed in some automated interface.
    And because you posted it on SQL Server site it's natural to assume the method to be used in a back end, within some automated process.
    I saw the consequences of mistakes made by this kind of methods, so I can tell you - it may be very much damaging.

  • David.Poole

    SSC Guru

    Points: 75191

    Sergiy - Monday, November 26, 2018 7:05 AM

    Type the sequence of symbols "cthdth" into Google Search inbox - it will return you articles about "Ñервер", which is Russian for "server", because that's how English/Russian keyboards are mapped. And there would not be any suggestion based on SOUNDEX mapping.

    Using it may be damaging, if placed in some automated interface.
    And because you posted it on SQL Server site it's natural to assume the method to be used in a back end, within some automated process.
    I saw the consequences of mistakes made by this kind of methods, so I can tell you - it may be very much damaging.

    Can you give an example of the automated interface you have in mind and what the scenario was where the consequences were damaging?  Context is all.

    ElasticSearch and Solr both give options to use a range of phonetic algorithms.

  • Sergiy

    SSC Guru

    Points: 109670

    I gave an example in the post above.

    Recognising buyer names in suppliers' data feeds.

    If done wrong - invoices generated for wrong customers, processed according to inappropriate rules in a wrong context, payments done by wrong organisations to wrong organisations.

    Considering - big organisations (like universities) pay only once a month, - the financial penalties (fees and interest) would be significant, in our case they could easily exceed the service fee they paid to us.

    And they would probably start looking for another service provider.

    Is it damaging enough?

  • David.Poole

    SSC Guru

    Points: 75191

    Sergiy - Monday, November 26, 2018 12:47 PM

    I gave an example in the post above.Recognising buyer names in suppliers' data feeds.If done wrong - invoices generated for wrong customers, processed according to inappropriate rules in a wrong context, payments done by wrong organisations to wrong organisations. Considering - big organisations (like universities) pay only once a month, - the financial penalties (fees and interest) would be significant, in our case they could easily exceed the service fee they paid to us.And they would probably start looking for another service provider.Is it damaging enough?

    Yes but I can't imagine anyone being daft enough to use it like that.  It's for searching for things like holidays in Fuerteventura or pre raphaelite wallpaper.

  • Sergiy

    SSC Guru

    Points: 109670

    David.Poole - Monday, November 26, 2018 2:15 PM

    Yes but I can't imagine anyone being daft enough to use it like that.  It's for searching for things like holidays in Fuerteventura or pre raphaelite wallpaper.

    OK, so the use cases are about searching for some name misheard on the radio, and it must be in relatively small database, as running SOUNDEX against any other word stored in a large database (I cannot see how any kind of indexing could apply here) would take forever.
    To me - it looks like extremely rare use case.

  • David.Poole

    SSC Guru

    Points: 75191

    Sergiy - Saturday, December 1, 2018 8:23 AM

    OK, so the use cases are about searching for some name misheard on the radio, and it must be in relatively small database, as running SOUNDEX against any other word stored in a large database (I cannot see how any kind of indexing could apply here) would take forever.
    To me - it looks like extremely rare use case.

    The algorithms are deterministic so values can be precomputed. its as fast as any other index on a text based term such as postcode

    Levenshtein, on the other hand is a bit of a sod.

  • Sergiy

    SSC Guru

    Points: 109670

    David.Poole - Saturday, December 1, 2018 3:08 PM

    Sergiy - Saturday, December 1, 2018 8:23 AM

    OK, so the use cases are about searching for some name misheard on the radio, and it must be in relatively small database, as running SOUNDEX against any other word stored in a large database (I cannot see how any kind of indexing could apply here) would take forever.
    To me - it looks like extremely rare use case.

    The algorithms are deterministic so values can be precomputed. its as fast as any other index on a text based term such as postcode

    Levenshtein, on the other hand is a bit of a sod.

    You know, you can easily decrypt MD5 hashes - by generating them for all possible password combinations.
    There are even hard drives for sale with tables matching words with hashes.
    Work is still in progress - the servers are still to go through all word variations.

  • David.Poole

    SSC Guru

    Points: 75191

    Sergiy - Saturday, December 1, 2018 7:13 PM

    You know, you can easily decrypt MD5 hashes - by generating them for all possible password combinations.
    There are even hard drives for sale with tables matching words with hashes.
    Work is still in progress - the servers are still to go through all word variations.

    Yes, there's a YouTube video somewhere about someone using GPUs to brute force passwords. The conclusion from that is that anything that relies purely on an encrypted password is dead in the water.  Hackers just need to win once, security guys need to win every time.  This is made even more difficult when the vulnerability might be in a connected system rather than the one you have control over.
    Then there are those who persist in building their own encryption and then use it to obfuscate production data. I'm glad i'm not the compliance officer fighting that battle

  • Sergiy

    SSC Guru

    Points: 109670

    David.Poole - Sunday, December 2, 2018 4:44 AM

    Sergiy - Saturday, December 1, 2018 7:13 PM

    You know, you can easily decrypt MD5 hashes - by generating them for all possible password combinations.
    There are even hard drives for sale with tables matching words with hashes.
    Work is still in progress - the servers are still to go through all word variations.

    Yes, there's a YouTube video somewhere about someone using GPUs to brute force passwords. The conclusion from that is that anything that relies purely on an encrypted password is dead in the water.  Hackers just need to win once, security guys need to win every time.  This is made even more difficult when the vulnerability might be in a connected system rather than the one you have control over.
    Then there are those who persist in building their own encryption and then use it to obfuscate production data. I'm glad i'm not the compliance officer fighting that battle

    My point is - those tables of hashes being sold and delivered on physical hard drives, last time I checked it was 3 or 4 TB of data. Just for the table of 2 columns with certain and simple correspondence between the values.
    For SOUNDEX you'll have multiple matches, with different levels of similarity. so the database will be more complicated and bigger in size.
    I don't think indexing such a  massive database is a winning game.

  • David.Poole

    SSC Guru

    Points: 75191

    Wait until i write up the experiment for which I went to all this trouble to build these functions.  All will become clear

Viewing 12 posts - 16 through 27 (of 27 total)

You must be logged in to reply to this topic. Login to reply