Between & Collation

Question

Between & Collation

chgn01

SSCarpal Tunnel

Points: 4423
More actions
August 28, 2013 at 8:18 pm

#276622

Comments posted to this topic are about the item Between & Collation
--------------------------------------
;-)“Everything has beauty, but not everyone sees it.” ― Confucius

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

tilew-948340 Hall of Fame Points: 3431 More actions · Answer 1

Good question. It raises, for me, a question: in the case that the _CS option is not specified, "SQL Server considers the uppercase and lowercase versions of letters to be identical for sorting purposes". So what can be the answer without the _CS option? Not predictable (they are equal, aren't they) OR the uppercase would always be before the lower case (like the default in MS Excel) and the answer would always be "B, b, b., C, c, D, d" (the .B being before B, it is not in the list)?

Thanks!

Ken SSCoach Points: 17191 More actions · Answer 2

tilew-948340 (8/28/2013)
Good question. Thanks.
I have a question (can't try it myself for now): in the case that the _CS option is not specified, "SQL Server considers the uppercase and lowercase versions of letters to be identical for sorting purposes". So what would have been the answer without your _CS option? Not predictable (they are equal, aren't they) or the uppercase would be before the lower case (like the default in MS Excel e.g. .B, B, b, b., C, c, D, d)? And what about the "." (point)?

Without the case sensitivity, assuming the database collation is set for CI. It would be:

B, b, b., C, c, D, d

But this appears to be based more on the order of how the data was inserted into #test than anything. If you change the inserts around to insert the values in a different order like:

insert #test ( name ) values ( 'b');

insert #test ( name ) values ( 'B');

insert #test ( name ) values ( '.B');

insert #test ( name ) values ( 'd');

insert #test ( name ) values ( 'b.');

insert #test ( name ) values ( 'C');

insert #test ( name ) values ( 'c');

insert #test ( name ) values ( 'D');

then the query result is

b, B, b., C, c, D, d

The values will be returned in alphabetical order but in the order in which they were found. I think that is why the lower case d shows up after the upper case D because the scan had to jump over the lower case d to get to C and c and then hit uppercase D and went back for lowercase d. I'll let someone else speak to this point :-).

palotaiarpad SSCertifiable Points: 5617 More actions · Answer 3

palotaiarpad

SSCertifiable

Points: 5617

August 29, 2013 at 12:16 am

#1645834

Please explain, why 'b.' is selected and 'b' not.

Koen Verbeeck SSC Guru Points: 259215 More actions · Answer 4

palotaiarpad (8/29/2013)
Please explain, why 'b.' is selected and 'b' not.

If the sequence is b, B, b., ... and you select data starting from B, the b is dropped.

Interesting question by the way.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

palotaiarpad SSCertifiable Points: 5617 More actions · Answer 5

And why is BETWEEN checked regardless of the collation?

If lowercase characters sort ahead of uppercase ones, then 'B' is greater then 'd', and BETWEEN can't evaluate to TRUE! Like with numbers BETWEEN 10 and 1 evaluates to FALSE

Something is odd here. Hope not my brain. :w00t:

Bangla Hall of Fame Points: 3137 More actions · Answer 6

Bangla

Hall of Fame

Points: 3137

August 29, 2013 at 2:16 am

#1645878

Good question......

Hugo Kornelis SSC Guru Points: 64789 More actions · Answer 7

palotaiarpad (8/29/2013)
And why is BETWEEN checked regardless of the collation?
If lowercase characters sort ahead of uppercase ones, then 'B' is greater then 'd', and BETWEEN can't evaluate to TRUE!

No, that's not how collations work. (Or to be pedantic, that's not how most collations work).

Ordering occurs by alphabet first, regardless of collation. First all strings with a single b, then all strings with b., then all c's, and then all d's.

The CS/CI rules then act as tie breaker. In this case, there are two values for each of the single-letter combinations. With CI, they are considered equal, so left in whatever order they happen to be in (which may or may not be the order in which they were inserted - even if testing appears to suggest a strict rule here, there is none and you should not rely on it). With CS, the order within each group is lowercase first. So the final order is b B b. c C d D. The BETWEEN clause then drops some of the results.

Nice question. I decided to rely on my memory of lowercase/uppercase order instead of double checking. Bad decision. 😉

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Michael_Garrison Hall of Fame Points: 3110 More actions · Answer 8

Ah, thanks Hugo for explaining that. Somehow I knew you would be able to explain this since you answered a Colation issue that I had a few weeks ago concerning the QofD.

tilew-948340 Hall of Fame Points: 3431 More actions · Answer 9

Hugo Kornelis (8/29/2013)
...
Nice question. I decided to rely on my memory of lowercase/uppercase order instead of double checking. Bad decision. 😉

As the statistic on the results shows an almost fifty-fifty good and bad answers, it is a nice question indeed... and it is always nice to know that even expert can make bad decision 😉

Thanks for the explanation. I would add to it the reminder that the point "." makes the data not alphabeticle (I guess) if it precedes the letter and is always place as the last one if it follows the letter

Ed Wagner SSC Guru Points: 287027 More actions · Answer 10

I find it interesting that lower case sorts ahead of upper case. In the ASCII character set I remember working with more than a few years ago, A is 65 while a is 97. To have them sort differently here is somewhat counter-intuitive, but maybe the fact that the different collations, some of which handle Unicode, have them laid out differently. Then again, I may way off base here. If anyone has any thoughts, I'd love to hear them.

Don't get me wrong here, I like that lower case sorts ahead of upper case...it's just different.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

TomThomson SSC Guru Points: 104767 More actions · Answer 11

Good question.

Just to amplify what Hugo said, the wording in BOL (which is also in the explanation) is somewhat misleading. Hugo's statement that the tie break between upper and lower case occurs logically after alphabetic sorting is correct, so the statement in BOL which simply says lower case is treated as sorting before upper case is not strictly correct. If the BOL statement were correct, in case sensitive collations ''sysygy' would sort before 'Sally', but in practice it sorts after 'Sally' regardless of whether the collation is case sensitive or not - if 'b' sorted before 'z' in the same sense that 's' sorts before 'S' in case sensitive collations, then presumably 'by' would sort before 'za' in those collations, so obviously "sorts before" means something different when talking about 's' and 'S' than when talking about 'b' and 'z', and the difference is that logically one is part of a global alphabetical sort and the other is a local tie-break between adjacent entities in the already alphabetically sorted set. It's a pity that BOL doesn't use the sort of wording that Hugo does.

Tom

Thomas Abraham SSChampion Points: 10761 More actions · Answer 12

Good question. I was pondering for a while, as I don't use 2012, and wondered if the point of the question was to show some change that was made in 2012. Got it right, but wasn't sure what would happen when I submitted my answer.

Thanks to Hugo and Tom for their usual excellent corrections/clarifications.

[font="Verdana"]Please don't go. The drones need you. They look up to you.[/font]
Connect to me on LinkedIn

TomThomson SSC Guru Points: 104767 More actions · Answer 13

Ed Wagner (8/29/2013)
I find it interesting that lower case sorts ahead of upper case. In the ASCII character set I remember working with more than a few years ago, A is 65 while a is 97. To have them sort differently here is somewhat counter-intuitive, but maybe the fact that the different collations, some of which handle Unicode, have them laid out differently. Then again, I may way off base here. If anyone has any thoughts, I'd love to hear them.
Don't get me wrong here, I like that lower case sorts ahead of upper case...it's just different.

Most people prefer "Sawyer" and "sawyer" to sort to adjacent positions rather than being separated by all strings beginning with lower case letters alphabetically after 'S' and all strings beginning with letters alphabetically before 's', which is what would happen in a sort based on the ascii (or Unicode) character code.

That's why dictionaries sort with a tie break on case (which way the tie break works depends on the particular dictionary); also many languages sort with a similar tie-break for accented characters: in Spanish dictionaries, for example, the difference between 'a' and 'á' is handled by a tie-break, while the difference between 'ñ' (which is not treated as an accented character) and 'n' is alphabetic.

Tom

tilew-948340 Hall of Fame Points: 3431 More actions · Answer 14

L' Eomot Inversé (8/29/2013)
Good question.
... in case sensitive collations ''sysygy' would sort before 'Sally', but in practice it sorts after 'Sally' regardless of whether the collation is case sensitive or not ...

That is interesting! I did some test, to make sure I understand correctly and I got different answer depending if I specify collation or remove the "order by 1".

insert #test ( name ) values ( 'syst');

insert #test ( name ) values ( 'Sally');

insert #test ( name ) values ( 'sally');

insert #test ( name ) values ( 'Syst');

insert #test ( name ) values ( 'Soap');

insert #test ( name ) values ( 'soap');

Case 1: create table #test(name varchar(10) COLLATE Latin1_General_CS_AS);

sally

Sally

soap

Soap

syst

Syst

Case 2: same as Case1 but with no Collate

Sally

sally

Soap

soap

Syst

syst (Hugo, here is your proof that you can't rely on the order that data has been entered!)

Case 3: same as Case 1 but removed the "order by 1" (result is as the data has been entered)

syst

Sally

sally

Syst

Soap

soap

Never stop learning! Thanks!