character ordering

  • Interesting remarks form vv-kirov, though l'eomot's querie is not all incorrect.

    The point is that the function char() will give an output based on the character set of the code page in the collation of the database (at least that is the behaviour vv-kirov's example shows).

    So l'eomot's querie will only give the right answer if the collation has already a latin1 codepage base, because when running it on a cyrillic or japanese based collation its output can give totally other characters.

    A "collate" function afterwards (in the "between" statement) cannot change the character itself, only its position in the order.

  • vk-kirov (11/2/2012)


    Incorrect. Try the following script and you'll get 62 characters:

    CREATE DATABASE qod_collation_db COLLATE Japanese_CI_AS_KS_WS;

    GO

    USE qod_collation_db;

    GO

    WITH Numbers AS

    ( SELECT 0 AS n

    UNION ALL

    SELECT n + 1

    FROM Numbers

    WHERE n < 255

    )

    SELECT n AS code, CHAR(n) AS symbol

    FROM Numbers AS n

    WHERE CHAR(n) COLLATE Latin1_General_CI_AS BETWEEN '0' COLLATE Latin1_General_CI_AS AND 'Z' COLLATE Latin1_General_CI_AS

    OPTION(MAXRECURSION 255);

    GO

    USE master;

    GO

    DROP DATABASE qod_collation_db;

    GO

    What's interesting if we execute 'SELECT CHAR(n)' on a Japanese_CI_AS_KS_WS database we'll get NULLs for the values of 'n' from 129 to 159 and from 224 to 252.

    Yes, of course, you are completely right. The CHAR function depends on the effective default collation, not on any other collation however specified. It's fairly easy to get round it, although somewhat verbose: instad of "CHAR(n) collate Latin1_General_CI_AS" we can use "cast(NCHAR(n) COLLATE Latin1_General_CI_AS as CHAR(1))", resulting in

    CREATE DATABASE qod_collation_db COLLATE Japanese_CI_AS_KS_WS;

    GO

    USE qod_collation_db;

    GO

    WITH Numbers AS

    ( SELECT 0 AS n

    UNION ALL

    SELECT n + 1

    FROM Numbers

    WHERE n < 255

    )

    SELECT n AS code, cast(NCHAR(n) COLLATE Latin1_General_CI_AS as CHAR(1)) AS symbol

    FROM Numbers

    WHERE cast(NCHAR(n) COLLATE Latin1_General_CI_AS as CHAR(1))

    BETWEEN '0' COLLATE Latin1_General_CI_AS AND 'Z' COLLATE Latin1_General_CI_AS

    OPTION(MAXRECURSION 255);

    GO

    USE master;

    GO

    DROP DATABASE qod_collation_db;

    GO

    which delivers 132 rows (and displays the Latin1 symbols).

    I ought to have got that right first time, as I had plenty involvement over the years on internationalisation of text in SQL databases involving Latin, Cyrillic, Arabix, and both traditional and simplified Chinese writing. Perhaps I never noticed the behaviousr of the CHAR function because after I decreed that we would use Unicode for all text I didn't bother to look at CHAR - if so that was careless of me, if not then forgetting it was careless of me. :blush:

    Another example (sorry for using a non-Latin character here; I don't know how you will see that character – in its original form, as a rectangle or something else; actually I'm using a cyrillic letter http://en.wikipedia.org/wiki/Tse_(Cyrillic)):

    original form for me, although of course I normally use nchar(1094) rather than char(246) for that character. My default display font is Lucida Sans Unicode so I tend to see most characters (Arial Unicode MS has slightly better coverage, but it makes an ugly mess of quite a lot of symbols - perhaps has bad spacing hints).

    Interesting that the upper-case tse shows as '?' both in Latin and in Japanese, while the lower case tse shows as white space in Japanese.

    Tom

  • Thanks for the QOD, Tom.

    I appreciate the heads-up that a deceptively simple condition (like "where myLetter between 'a' and 'Z'") could result in a surprise if one expects it to filter just the 26 English letters in upper or lower case. The answer of exactly how many it would include is, of course not nearly so important as the possibility that it's greater than 52 (or 62 in your original poser which starts with the numeric digits).

  • I'm supprised no-one has mentioned another implication of collation ordering and special characters that can be quite important; the impact on LIKE clauses that use set's of characters e.g.

    DECLARE @var CHAR(1) = '¾'

    SELECT @var

    WHERE @var LIKE '%[0-9]%'

    Similarly, this applies to the accented characters, so trying to use [a-z0-9] will not yield the standard set of alphanumeric characters that might be expected.

    Which given the low success rate on this question (including my own error), I would suspect that this is a common oversight.

    So thank you for what is actually an excellent question.

  • TomThomson (11/5/2012)


    The CHAR function depends on the effective default collation, not on any other collation however specified. It's fairly easy to get round it, although somewhat verbose: instad of "CHAR(n) collate Latin1_General_CI_AS" we can use "cast(NCHAR(n) COLLATE Latin1_General_CI_AS as CHAR(1))", resulting in

    ...

    ...

    which delivers 132 rows (and displays the Latin1 symbols).

    I must have been on a bad day when I wrote that line without noticing :blush: that 132 and 139 were different.

    Recently I noticed a similar problem (it wasn't with that Japanese collation, but the same misbehaviour of the CHAR and NCHAR functions) and did enough work to find out why. So I remembered that I had written something that claimed the problem could be avoided without switching to a context where the default collation was OK and thought I'd better post a correction, as in fact I now know it can't be done. When I saw that I had claimed code which poducse 132 instead of 139 solved the problem I was rather surprised (more that no-one picked me up on it than that I made a mistake - I do plenty of those).

    Why (with vk-kirov's example Japanes collation) was it missing 7 characters, with all those collate clauses in there? The answer is simple - cast(NCHAR(n) COLLATE Latin1_General_CI_AS as CHAR(1)) might as well (given the range of n is 0 to 255) just be NCHAR(n), the cast is pointless and the collate clause is pointless too, because it's applied to the result of the call on the NCHAR funcion, and the NCHAR(1) value doesn't change just because a COLLATE clause is applied to it; so vk-kirov was right, and to get the correct result you have to run the query in an environment where the default code set doesn't cause NCHAR to map a code to NULL or to NCHAR(63) (the latter is a question mark, used for "I don't know what this code means" as well as for "?") that wouldn't be so mapped with the default collation. The 7 missing characters are all mapped to NCHAR(63) or CHAR(63) by the NCHAR or CHAR function when that Japanese collation is the database default collation, that's why they go missing. This of course means that this collation doesn't allow unicode conformant T-SQL programming. Most collations don't, I believe.

    CHAR of course messes up a lot more stuff than NCHAR does, losing 77 of the "alphnumeric" characters, rather than just 7, by generating a mix of 63s and NULLs, but then CHAR doesn't pretend to do unicode while NCHAR does.

    Tom

Viewing 5 posts - 31 through 34 (of 34 total)

You must be logged in to reply to this topic. Login to reply