Soundex - Experiments with SQLCLR Part 3

Question

Post reply

Soundex - Experiments with SQLCLR Part 3

David.Poole

SSC Guru

Points: 76419
More actions
August 27, 2015 at 12:13 am

#99010

Comments posted to this topic are about the item Soundex - Experiments with SQLCLR Part 3
LinkedIn Profile

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply

akljfhnlaflkj SSC Guru Points: 76202 More actions · Answer 1

Thanks for the great article on phonetics. It's one that has to be reread.

marc 28082 SSC Enthusiast Points: 153 More actions · Answer 2

Thanks for the fine series of articles.

I don't believe Array.Clear() is doing what you expect it to do. The MS documentation says it sets a range of array elements to a default value. I don't think it will delete the array itself (I'm not sure if there even is a way to do that in C#). The GC generally does a very good job managing such things.

I was wondering, is the use of the IDisposable interface solely because of the Array.Clear(), or was there another reason?

David.Poole SSC Guru Points: 76419 More actions · Answer 3

I don't believe Array.Clear() is doing what you expect it to do. The MS documentation says it sets a range of array elements to a default value. I don't think it will delete the array itself (I'm not sure if there even is a way to do that in C#). The GC generally does a very good job managing such things.
I was wondering, is the use of the IDisposable interface solely because of the Array.Clear(), or was there another reason?

You can put down used Array.Clear and IDisposable as my obsolescence and of ignorance of the current state of play with the .NET framework and garbage collection.

The epoch where I learnt to program you had to be careful to clear up resources after you and even though garbage collection existed it got a bit flustered when it came to identifying when larger objects were out of scope and therefore disposing of them. Old objects used to hang around a lot longer than you'd expect leading to memory pressures.

LinkedIn Profile

Andy Warren SSC Guru Points: 119922 More actions · Answer 4

David, I appreciated the candid approach you used in discussing your mistakes. That's how most of us work/learn and it's good to reinforce it, and I liked that showed how hard it can be to do it right even when you're trying to do it right!

Andy
Connect with me on LinkedIn

Solomon Rutzky SSCoach Points: 16431 More actions · Answer 5

David.Poole (8/27/2015)
I don't believe Array.Clear() is doing what you expect it to do. The MS documentation says it sets a range of array elements to a default value. I don't think it will delete the array itself (I'm not sure if there even is a way to do that in C#). The GC generally does a very good job managing such things.
I was wondering, is the use of the IDisposable interface solely because of the Array.Clear(), or was there another reason?
You can put down used Array.Clear and IDisposable as my obsolescence and of ignorance of the current state of play with the .NET framework and garbage collection.
The epoch where I learnt to program you had to be careful to clear up resources after you and even though garbage collection existed it got a bit flustered when it came to identifying when larger objects were out of scope and therefore disposing of them. Old objects used to hang around a lot longer than you'd expect leading to memory pressures.

Excellent article :-). Just a few notes to clarify some items, starting with the Array / Dispose discussion quoted above:

It probably isn't necessary to implement your own Dispose() method, though likely also no harm done. But I agree with marc about Array.Clear() probably not truly getting rid of the object. I think _inputArray = null; would do what you were hoping for.
Regarding marc's comment about "The GC generally does a very good job managing such things": true that the .NET GC does a good job managing it, but also keep in mind that the Garbage Collector running in SQL Server's CLR host doesn't run as often.
Regarding the chart of "SQLCLR function attributes": IsPrecise is also required to be true in order to be indexed.
Regarding the "SQL Types" information:
* first, thanks for mentioning this stuff :-).
* The SqlString type does carry with it the LCID and SqlCompareOptions (i.e. case sensitive, accent sensitive, etc), but the actual string value doesn't have a collation. However, there are certain methods within the SqlString struct that will use the LCID and/or SqlCompareOptions, such as CompareTo, GetNonUnicodeBytes, CompareInfo (translated from SqlCompareOptions), CultureInfo (created from the LCID), and maybe something else.
* The LCID / SqlCompareOptions, and hence the derived CultureInfo / CompareInfo, reflect the collation of the database in which the Assembly resides. This is rather unfortunate as it might be completely different than the collation of the field (if passing in a field from a table and not a variable or string literal).
* String comparisons in .NET are not always case sensitive. That is just the default for Unicode sorting. But using String.Compare you can pass in a StringComparison enum or CompareOptions enum to tailor the sorting, just like the CI / CS, AI / AS, WS, and KS options of the collations. Check out these two overloads of String.Compare: using StringComparison and using CultureInfo and CompareOptions
* Regarding the statement of "my function forces upper case thereby negating the SQL server collation issues": forcing upper case effects only a small portion of the collation. It effectively makes it "IgnoreCase" but doesn't alter any of the other aspects, such as Kana, Width, Accent, or Symbols. And that is something to consider, perhaps, for these types of phonetic comparisons: what aspects of a string really matter? Should "John" and "Jo-hn" be considered the same or different? What about "John" and "Jöhn"? I think I need to revisit this issue as well for the LevenshteinDistance functions in SQL# ;-).

Take care,

Solomon...

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

David.Poole SSC Guru Points: 76419 More actions · Answer 6

Thanks Solomon. The strange additions to the characters in the alphabet can be dealt with by a function called ASCII folding. In the Apache Commons library the Daitch Mokotoff function does have a substitution table to do this.

I did consider building this into the BasePhonetic class but decided against it because while researching it I found that the same character could fold to different characters depending on the language.

In addition I think ASCII folding is probably worthy of its own SQL CLR function.

LinkedIn Profile

Solomon Rutzky SSCoach Points: 16431 More actions · Answer 7

David.Poole (8/29/2015)
Thanks Solomon. The strange additions to the characters in the alphabet can be dealt with by a function called ASCII folding. In the Apache Commons library the Daitch Mokotoff function does have a substitution table to do this.
I did consider building this into the BasePhonetic class but decided against it because while researching it I found that the same character could fold to different characters depending on the language.
In addition I think ASCII folding is probably worthy of its own SQL CLR function.

Hi David. I have also been doing a bit of research lately on Unicode and encodings in general. I think you were right to not include ASCII folding into your algorithms because it appears to be based on a misunderstanding of how Unicode data works. In addition to variations between languages, there are also sequences of characters that can combine into a single visible form (i.e. "glyph"). Here is an example that I worked up for a presentation I am creating on this topic:

SELECT NCHAR(0x00FC) AS [ü],

NCHAR(0x0075) AS ,

NCHAR(0x0075) + NCHAR(0x0308) AS ,

LEN(NCHAR(0x0075) + NCHAR(0x0308)) AS [HowManyChars?]

-- üuu¨2

SELECT 1 WHERE NCHAR(0x00FC) COLLATE Latin1_General_100_CI_AS -- ü

= NCHAR(0x0075) COLLATE Latin1_General_100_CI_AS -- u

-- nothing due to being Accent Sensitive

SELECT 2 WHERE NCHAR(0x00FC) COLLATE Latin1_General_100_CI_AI -- ü

= NCHAR(0x0075) COLLATE Latin1_General_100_CI_AI -- u

-- 2 due to being Accent Insensitive

SELECT 3 WHERE NCHAR(0x00FC) COLLATE Latin1_General_100_CI_AS -- ü

= NCHAR(0x0075) + NCHAR(0x0308) COLLATE Latin1_General_100_CI_AS

-- u + combining diaeresis

-- 3, even being diff number of characters AND Accent Sensitive, due to combination

The follow page shows the official normalizations for various languages at Unicode.org: Normalization Charts

The Unicode algorithms for sorting and comparison are highly complex. However, and fortunately, we don't need ASCII folding because the Unicode algorithms are built into .NET, probably through accessing the Windows Collations. So if you use the CompareInfo class that I noted before, or String.Compare (passing in a CultureInfo), then you can do the same sorting that is being done in SQL Server (when using the XML and N-prefixed datatypes). Want to do Case Insensitive AND Accent Insensitive? Just pass in a CompareOption of:

IgnoreCase | IgnoreNonSpace

Just like the T-SQL tests shown above, doing the "IgnoreNonSpace" alone will equate 0x00FC and 0x0075. But not specifying that option will equate 0x00FC and 0x0075 + 0x0308.

Take care,

Solomon...

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

SumOfDavid Ten Centuries Points: 1357 More actions · Answer 8

David, I appreciate the candor and articulation of the process that you went through. I realize that writing/formatting/proofreading the article was probably just as time consuming as writing the code, so THANK YOU!

David S.

David

simon.hewitt SSC Rookie Points: 42 More actions · Answer 9

Hi David
I'm a bit late coming to this discussion - I'm not really a full on SQL developer - I mainly use SQL for storage - but I do like collecting 'libraries' in case I find a use for them later so I thought I'd have a look at this one.
I'm also not too familiar with phonetic algorithms though I remember the built-in basic Soundex from years ago.
So having looked at the code, I have some comments that aren't meant to flame but be genuine constructive criticism.
The first thing I noticed is that there is some over-engineering - partly due to you wanted all the algorithms in a single tree, partly due to wanting to use char[]s, and partly due to writing Tests that need internal access to the workings. The latter I feel is a case of the tail wagging the dog.
My view (based solely on the source code) is that you need an IPhoneticPhraseSource interface containing a single string CreatePhrase(string) method. That means that the SoundEx implementations and the others can be completely separated.
The next thing I noticed was the use of char[] for the input. String is a char[] so can be used directly which saves making a duplicate string. I also think that a StringBuilder for the output string would help though I've not got to refactoring that just yet.
There is also evidence of premature optimization. Implementing IDisposable is definitely unnecessary here - you don't need to release the resources here as they are all .Net and the garbage collector knows best for what and when.
char[][][] is not the best type to use for the substitution data IMHO. My refactoring uses a small immutable class with two string properties - one for Source and one for Replacement. The advantage is the the .NET compiler will reuse the same string whereas it won't for char[]s. So the generated code is smaller and as I already mentioned the strings are effectively usable as char[] anyway.
I also found it hard to read the code because of the many char[] array checking and accesses. My refactoring is adding protected base methods such as GetLastInputChar() and IsCurrentCharALetter() and wrapping up all access to the input and output allowing it to be changed later without touching the algorithm. Doing this also makes it more self-documenting.

Question I do have:
- I looked up Daitch-Mokotoff and it said that it returns multiple results for the same input string. Is this something that needs to be supported?
- Fixed length outputs. Daitch-Mokotoff always returns 6 characters and Sounds 4 characters, so why the need to create an output array the size of the input string?

David.Poole SSC Guru Points: 76419 More actions · Answer 10

simon.hewitt - Sunday, November 25, 2018 2:46 AM
Question I do have:
- I looked up Daitch-Mokotoff and it said that it returns multiple results for the same input string. Is this something that needs to be supported?
- Fixed length outputs. Daitch-Mokotoff always returns 6 characters and Sounds 4 characters, so why the need to create an output array the size of the input string?

Thanks for the feedback, I'd love to see your refactored code. Have you put it on Github? I'm self taught in C# and .NET so it is likely that the code is suboptimal.

To answer your questions, i avoided capping the string lengths because capping produces a lot of false positives.
Double metaphone also returns multiple values. Strictly speaking yes, any phonetic function returning multiple values should be catered for. I was unsure how to do this. Ultimately what I really meant to do was find ways to use phonetics with full-text search. To do this I had to go through the steps in these 3 articles first as a learning exercise

LinkedIn Profile

simon.hewitt SSC Rookie Points: 42 More actions · Answer 11

Still playing about with it but I've only looked at refactoring what is already there - ie single results - anything returning multiple results would have to return a string[] and those Phonetic classes would have to be rewritten from scratch to be able to produce multiple results from the same input string.
Having said that, what I have so far now uses a separate and independent Processor class with lots of simple properties and functions on it which each Phonetic class uses to do its stuff without even knowing whether a char[] or StringBuilder is being used. So it should be much write additional Phonetic classes or modify existing ones to produce multiple results later.

Sergiy SSC Guru Points: 110209 More actions · Answer 12

Must be a fascinating technical thing which is absolutely useless for actual users.
May be even worse - damaging for real life applications.

In no way I'm critisizing the quality of the technical solution in the article.
I suspect it must be good, may be even the word "brilliant" could be used here.
But it does not matter.
Because the whole idea of SOUNDEX is faulty, and the best way of implementing not to approach it at all.

It might be used and a good illustration of how very good engineers effectively destroyed Apple Inc.
They produced very good and effective solutions, but they've solved the problems they should not have even trying to solve.

I had a practical task of recognising Buyer Names in the data feeds coming from different suppliers.
Of course, every supplier spelled the names differently.
Amongst those names were these:

University of Auckland
Auckland University
Massey Auckland University
Auckland University of Technology
Unitec
Otago University
University of Otago
Otago Uni Auckland
UOA
u.o.a.
AUT
MasseyUni
Auckland Massey
UOO
OU
and some other variations.

My interface had to recornise unique buyer entities under those names and allocate incoices accordingly.
Can SOUNDEX identity how many unique buyer entities are listed above?
Would it be able to assign the names to those entities correctly?

Tools like SOUNDEX may have their place at the front end, at the data source, where they may suggest spelling corrrections, but even there the suggestions must be based more on keyboard layouts than on similarities of the words.
And no way they should be put into automatic mode. Because, like in the commercial offer written by my boss, it may one day replace misspelled "maturation" with "masturbation".
Therefore, it does not have a place in the back end.

_____________
Code for TallyGenerator

simon.hewitt SSC Rookie Points: 42 More actions · Answer 13

Sergiy:
Most of David's code is general C# so can be used at the front end. The code to use it as back-end is only a few additional lines.

For the problem you describe, no algorithm will be able to do what you want I suspect.
My solution (again for front end) is simple: I create an alias list containing the 'real' or 'known' text and a list of known aliases.
It is a simple text file with the 'required' text beginning at the start of new line with correct casing etc. and subsequent lines, indented by a tab, containing aliases (case is not important for these)...

A&C Black
->A & C Black
->A and C Black
->A. & C. Black
->A. C. Black
->A.& C. Black

BBC
->B. B. C.
->b.b.c.

(etc. '->' represents tab - this forum strips out leading whitespace!)

My AliasList class loads the file, ignoring comments and blank lines and creates a case-insensitive Dictionary lookup to translate aliases back to the required text. When I come across a new alias, I just add it to the text file.
The AliasList class is only 120 lines so I'll post it up here if you would find it useful and be able to make it work at the back end.

simon.hewitt SSC Rookie Points: 42 More actions · Answer 14

David:
I've run out of time for now but below is a sample of the direction I've gone.
Note that there is now no inheritance needed - all work is done in the Processor class.
Changes for next weekend:-
1) Add a property to Processor to say whether adjacent duplicates are allowed. If not, they can be skipped before they are added to the output. If that works I can remove the RemoveAdjacentDuplicates() after-processor (not shown here)
2) Change the Iterator to take a bool or enum return indicating whether to MoveNext() or CopyAndMoveNext() or neither (because a replacement has already moved to the next InputChar).
3) Change the method names - there are no char[] used now.

public class ColognePhonetic: IPhoneticProcessor { #region Substitutions static readonly Substitution[] WordMiddleSubstitutions = { new Substitution("CA", "4"), new Substitution("CH", "4"), new Substitution("CK", "4"), new Substitution("CO", "4"), new Substitution("CQ", "4"), new Substitution("CR", "4"), new Substitution("CU", "4"), new Substitution("CX", "4"), new Substitution("DC", "8"), new Substitution("DS", "8"), new Substitution("DZ", "8"), new Substitution("PH", "3"), new Substitution("SC", "8"), new Substitution("TC", "8"), new Substitution("TS", "8"), new Substitution("TZ", "8"), new Substitution("ZC", "8"), new Substitution("A", "0"), new Substitution("B", "1"), new Substitution("C", "8"), new Substitution("D", "2"), new Substitution("E", "0"), new Substitution("F", "3"), new Substitution("G", "4"), new Substitution("H", "-"), new Substitution("H", "-"), new Substitution("I", "0"), new Substitution("J", "0"), new Substitution("K", "4"), new Substitution("L", "5"), new Substitution("M", "6"), new Substitution("N", "6"), new Substitution("O", "0"), new Substitution("P", "1"), new Substitution("Q", "4"), new Substitution("R", "7"), new Substitution("S", "8"), new Substitution("T", "2"), new Substitution("U", "0"), new Substitution("V", "3"), new Substitution("W", "3"), new Substitution("X", "48"), new Substitution("Y", "0"), new Substitution("Z", "8"), }; #endregion Substitutions

public string GetPhrase(string source) { if (string.IsNullOrEmpty(source)) return source;

var processor = new Processor(source.ToUpper());

return processor .SimpleIterate(() => { processor.CalculateNextWordEnd();

if (!processor.ProcessWordPart(WordMiddleSubstitutions)) { processor.CopyCharAndMoveNext(); } }) .StripCharactersFromOutputArray('-') .StripCharactersFromOutputArrayExceptFirst('0') .ToString(); } }