SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


character ordering


character ordering

Author
Message
ldemeyer@telenet.be
ldemeyer@telenet.be
SSC-Enthusiastic
SSC-Enthusiastic (192 reputation)SSC-Enthusiastic (192 reputation)SSC-Enthusiastic (192 reputation)SSC-Enthusiastic (192 reputation)SSC-Enthusiastic (192 reputation)SSC-Enthusiastic (192 reputation)SSC-Enthusiastic (192 reputation)SSC-Enthusiastic (192 reputation)

Group: General Forum Members
Points: 192 Visits: 110
Interesting remarks form vv-kirov, though l'eomot's querie is not all incorrect.

The point is that the function char() will give an output based on the character set of the code page in the collation of the database (at least that is the behaviour vv-kirov's example shows).

So l'eomot's querie will only give the right answer if the collation has already a latin1 codepage base, because when running it on a cyrillic or japanese based collation its output can give totally other characters.

A "collate" function afterwards (in the "between" statement) cannot change the character itself, only its position in the order.
Tom Thomson
Tom Thomson
One Orange Chip
One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)

Group: General Forum Members
Points: 25593 Visits: 12494
vk-kirov (11/2/2012)

Incorrect. Try the following script and you'll get 62 characters:

CREATE DATABASE qod_collation_db COLLATE Japanese_CI_AS_KS_WS;
GO
USE qod_collation_db;
GO
WITH Numbers AS
( SELECT 0 AS n
UNION ALL
SELECT n + 1
FROM Numbers
WHERE n < 255
)
SELECT n AS code, CHAR(n) AS symbol
FROM Numbers AS n
WHERE CHAR(n) COLLATE Latin1_General_CI_AS BETWEEN '0' COLLATE Latin1_General_CI_AS AND 'Z' COLLATE Latin1_General_CI_AS
OPTION(MAXRECURSION 255);
GO
USE master;
GO
DROP DATABASE qod_collation_db;
GO


What's interesting if we execute 'SELECT CHAR(n)' on a Japanese_CI_AS_KS_WS database we'll get NULLs for the values of 'n' from 129 to 159 and from 224 to 252.

Yes, of course, you are completely right. The CHAR function depends on the effective default collation, not on any other collation however specified. It's fairly easy to get round it, although somewhat verbose: instad of "CHAR(n) collate Latin1_General_CI_AS" we can use "cast(NCHAR(n) COLLATE Latin1_General_CI_AS as CHAR(1))", resulting in
CREATE DATABASE qod_collation_db COLLATE Japanese_CI_AS_KS_WS;
GO
USE qod_collation_db;
GO
WITH Numbers AS
( SELECT 0 AS n
UNION ALL
SELECT n + 1
FROM Numbers
WHERE n < 255
)
SELECT n AS code, cast(NCHAR(n) COLLATE Latin1_General_CI_AS as CHAR(1)) AS symbol
FROM Numbers
WHERE cast(NCHAR(n) COLLATE Latin1_General_CI_AS as CHAR(1))
BETWEEN '0' COLLATE Latin1_General_CI_AS AND 'Z' COLLATE Latin1_General_CI_AS
OPTION(MAXRECURSION 255);
GO
USE master;
GO
DROP DATABASE qod_collation_db;
GO


which delivers 132 rows (and displays the Latin1 symbols).

I ought to have got that right first time, as I had plenty involvement over the years on internationalisation of text in SQL databases involving Latin, Cyrillic, Arabix, and both traditional and simplified Chinese writing. Perhaps I never noticed the behaviousr of the CHAR function because after I decreed that we would use Unicode for all text I didn't bother to look at CHAR - if so that was careless of me, if not then forgetting it was careless of me. Blush

Another example (sorry for using a non-Latin character here; I don't know how you will see that character – in its original form, as a rectangle or something else; actually I'm using a cyrillic letter http://en.wikipedia.org/wiki/Tse_(Cyrillic)):
original form for me, although of course I normally use nchar(1094) rather than char(246) for that character. My default display font is Lucida Sans Unicode so I tend to see most characters (Arial Unicode MS has slightly better coverage, but it makes an ugly mess of quite a lot of symbols - perhaps has bad spacing hints).
Interesting that the upper-case tse shows as '?' both in Latin and in Japanese, while the lower case tse shows as white space in Japanese.

Tom

john.arnott
john.arnott
Hall of Fame
Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)Hall of Fame (3.5K reputation)

Group: General Forum Members
Points: 3538 Visits: 3059
Thanks for the QOD, Tom.

I appreciate the heads-up that a deceptively simple condition (like "where myLetter between 'a' and 'Z'") could result in a surprise if one expects it to filter just the 26 English letters in upper or lower case. The answer of exactly how many it would include is, of course not nearly so important as the possibility that it's greater than 52 (or 62 in your original poser which starts with the numeric digits).
bochambers
bochambers
SSC-Enthusiastic
SSC-Enthusiastic (193 reputation)SSC-Enthusiastic (193 reputation)SSC-Enthusiastic (193 reputation)SSC-Enthusiastic (193 reputation)SSC-Enthusiastic (193 reputation)SSC-Enthusiastic (193 reputation)SSC-Enthusiastic (193 reputation)SSC-Enthusiastic (193 reputation)

Group: General Forum Members
Points: 193 Visits: 82
I'm supprised no-one has mentioned another implication of collation ordering and special characters that can be quite important; the impact on LIKE clauses that use set's of characters e.g.

DECLARE @var CHAR(1) = '¾'

SELECT @var
WHERE @var LIKE '%[0-9]%'



Similarly, this applies to the accented characters, so trying to use [a-z0-9] will not yield the standard set of alphanumeric characters that might be expected.
Which given the low success rate on this question (including my own error), I would suspect that this is a common oversight.

So thank you for what is actually an excellent question.
Tom Thomson
Tom Thomson
One Orange Chip
One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)One Orange Chip (25K reputation)

Group: General Forum Members
Points: 25593 Visits: 12494
TomThomson (11/5/2012)
The CHAR function depends on the effective default collation, not on any other collation however specified. It's fairly easy to get round it, although somewhat verbose: instad of "CHAR(n) collate Latin1_General_CI_AS" we can use "cast(NCHAR(n) COLLATE Latin1_General_CI_AS as CHAR(1))", resulting in
...
...
which delivers 132 rows (and displays the Latin1 symbols).

I must have been on a bad day when I wrote that line without noticing Blush that 132 and 139 were different.

Recently I noticed a similar problem (it wasn't with that Japanese collation, but the same misbehaviour of the CHAR and NCHAR functions) and did enough work to find out why. So I remembered that I had written something that claimed the problem could be avoided without switching to a context where the default collation was OK and thought I'd better post a correction, as in fact I now know it can't be done. When I saw that I had claimed code which poducse 132 instead of 139 solved the problem I was rather surprised (more that no-one picked me up on it than that I made a mistake - I do plenty of those).

Why (with vk-kirov's example Japanes collation) was it missing 7 characters, with all those collate clauses in there? The answer is simple - cast(NCHAR(n) COLLATE Latin1_General_CI_AS as CHAR(1)) might as well (given the range of n is 0 to 255) just be NCHAR(n), the cast is pointless and the collate clause is pointless too, because it's applied to the result of the call on the NCHAR funcion, and the NCHAR(1) value doesn't change just because a COLLATE clause is applied to it; so vk-kirov was right, and to get the correct result you have to run the query in an environment where the default code set doesn't cause NCHAR to map a code to NULL or to NCHAR(63) (the latter is a question mark, used for "I don't know what this code means" as well as for "?") that wouldn't be so mapped with the default collation. The 7 missing characters are all mapped to NCHAR(63) or CHAR(63) by the NCHAR or CHAR function when that Japanese collation is the database default collation, that's why they go missing. This of course means that this collation doesn't allow unicode conformant T-SQL programming. Most collations don't, I believe.
CHAR of course messes up a lot more stuff than NCHAR does, losing 77 of the "alphnumeric" characters, rather than just 7, by generating a mix of 63s and NULLs, but then CHAR doesn't pretend to do unicode while NCHAR does.

Tom

Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search