• jdamm (12/6/2012)


    Another two that could come up are:

    Þ - the archaic letter thorn - which will come up with 'th' in the select.

    ß - the German "Sharp S" - which will come up with "ss' in the select.

    A few more "undocumented features"...

    I'm guessing on the collation, there are lots of other "opportunities".

    Jim

    You've pushed one of my go buttons there - languages and orthography used to be a big thing for me. (Languages still are.) So now I shall be boring, but perhaps informative.

    In the ISO Latin1 8-bit character set, there are only 5 ligatures: ae,th, AE,TH, and ss.

    Of course this means that Latin1 is missing rather a lot of ligatures which are used by languages which use "roman" characters for writing. Windows Latin1 adds oe and OE (so the French probably prefer Windows Latin1 to ISO, as these are essential for standard French spelling) , so in SQL Server with default collation (or really default code page) we see 7 ligature digraphs in the single byte character set. That's still missing ij and IJ (maybe Hugo will tell us whether Dutch generally uses these ligatures or has generally switched to non-ligatured representation).

    There were once other ligatures in languages that use Latin1 (or a small extension of it). Until a couple of decades ago, Spanish treated CH ch Ll and ll each as an individual character, not each as two characters; German printers once used a ue ligature (originally with the e above the u, instead of to the right of it) in place of ü, but that is not used now (at least I believe not: but people with old typewriters which can't produce ü still sometimes use ue - as two separated characters, not a ligature).

    There 49 (? not sure; I think it's 53 altogether, and as 4 are ligatures .... but I'm not sure 53 is right) other (not ligature) characters used by languages using variants of the roman alphabet that are not in ISO Latin1 (most of them are in Windows Latin1) which has thrown away some obsolete/redundant control characters to make room for them), but they are not ligatures so won't be picked up as matching two adjacent characters. Ten of the 48 are needed for Latin (in the orthography that has vowel length marked), which suggests that Latin1 is a bit of a misnomer.

    Tom