RE: Soundex - Experiments with SQLCLR Part 3

SSCoach

Points: 16430

August 29, 2015 at 2:58 pm

David.Poole (8/29/2015)
Thanks Solomon. The strange additions to the characters in the alphabet can be dealt with by a function called ASCII folding. In the Apache Commons library the Daitch Mokotoff function does have a substitution table to do this.
I did consider building this into the BasePhonetic class but decided against it because while researching it I found that the same character could fold to different characters depending on the language.
In addition I think ASCII folding is probably worthy of its own SQL CLR function.

Hi David. I have also been doing a bit of research lately on Unicode and encodings in general. I think you were right to not include ASCII folding into your algorithms because it appears to be based on a misunderstanding of how Unicode data works. In addition to variations between languages, there are also sequences of characters that can combine into a single visible form (i.e. "glyph"). Here is an example that I worked up for a presentation I am creating on this topic:

SELECT NCHAR(0x00FC) AS [ü],

NCHAR(0x0075) AS ,

NCHAR(0x0075) + NCHAR(0x0308) AS ,

LEN(NCHAR(0x0075) + NCHAR(0x0308)) AS [HowManyChars?]

-- üuu¨2

SELECT 1 WHERE NCHAR(0x00FC) COLLATE Latin1_General_100_CI_AS -- ü

= NCHAR(0x0075) COLLATE Latin1_General_100_CI_AS -- u

-- nothing due to being Accent Sensitive

SELECT 2 WHERE NCHAR(0x00FC) COLLATE Latin1_General_100_CI_AI -- ü

= NCHAR(0x0075) COLLATE Latin1_General_100_CI_AI -- u

-- 2 due to being Accent Insensitive

SELECT 3 WHERE NCHAR(0x00FC) COLLATE Latin1_General_100_CI_AS -- ü

= NCHAR(0x0075) + NCHAR(0x0308) COLLATE Latin1_General_100_CI_AS

-- u + combining diaeresis

-- 3, even being diff number of characters AND Accent Sensitive, due to combination

The follow page shows the official normalizations for various languages at Unicode.org: Normalization Charts

The Unicode algorithms for sorting and comparison are highly complex. However, and fortunately, we don't need ASCII folding because the Unicode algorithms are built into .NET, probably through accessing the Windows Collations. So if you use the CompareInfo class that I noted before, or String.Compare (passing in a CultureInfo), then you can do the same sorting that is being done in SQL Server (when using the XML and N-prefixed datatypes). Want to do Case Insensitive AND Accent Insensitive? Just pass in a CompareOption of:

IgnoreCase | IgnoreNonSpace

Just like the T-SQL tests shown above, doing the "IgnoreNonSpace" alone will equate 0x00FC and 0x0075. But not specifying that option will equate 0x00FC and 0x0075 + 0x0308.

Take care,

Solomon...

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR