Between & Collation

  • Comments posted to this topic are about the item Between & Collation

    --------------------------------------
    ;-)“Everything has beauty, but not everyone sees it.” ― Confucius

  • Good question. It raises, for me, a question: in the case that the _CS option is not specified, "SQL Server considers the uppercase and lowercase versions of letters to be identical for sorting purposes". So what can be the answer without the _CS option? Not predictable (they are equal, aren't they) OR the uppercase would always be before the lower case (like the default in MS Excel) and the answer would always be "B, b, b., C, c, D, d" (the .B being before B, it is not in the list)?

    Thanks!

  • tilew-948340 (8/28/2013)


    Good question. Thanks.

    I have a question (can't try it myself for now): in the case that the _CS option is not specified, "SQL Server considers the uppercase and lowercase versions of letters to be identical for sorting purposes". So what would have been the answer without your _CS option? Not predictable (they are equal, aren't they) or the uppercase would be before the lower case (like the default in MS Excel e.g. .B, B, b, b., C, c, D, d)? And what about the "." (point)?

    Without the case sensitivity, assuming the database collation is set for CI. It would be:

    B, b, b., C, c, D, d

    But this appears to be based more on the order of how the data was inserted into #test than anything. If you change the inserts around to insert the values in a different order like:

    insert #test ( name ) values ( 'b');

    insert #test ( name ) values ( 'B');

    insert #test ( name ) values ( '.B');

    insert #test ( name ) values ( 'd');

    insert #test ( name ) values ( 'b.');

    insert #test ( name ) values ( 'C');

    insert #test ( name ) values ( 'c');

    insert #test ( name ) values ( 'D');

    then the query result is

    b, B, b., C, c, D, d

    The values will be returned in alphabetical order but in the order in which they were found. I think that is why the lower case d shows up after the upper case D because the scan had to jump over the lower case d to get to C and c and then hit uppercase D and went back for lowercase d. I'll let someone else speak to this point :-).

  • Please explain, why 'b.' is selected and 'b' not.

  • palotaiarpad (8/29/2013)


    Please explain, why 'b.' is selected and 'b' not.

    If the sequence is b, B, b., ... and you select data starting from B, the b is dropped.

    Interesting question by the way.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • And why is BETWEEN checked regardless of the collation?

    If lowercase characters sort ahead of uppercase ones, then 'B' is greater then 'd', and BETWEEN can't evaluate to TRUE! Like with numbers BETWEEN 10 and 1 evaluates to FALSE

    Something is odd here. Hope not my brain. :w00t:

  • Good question......

  • palotaiarpad (8/29/2013)


    And why is BETWEEN checked regardless of the collation?

    If lowercase characters sort ahead of uppercase ones, then 'B' is greater then 'd', and BETWEEN can't evaluate to TRUE!

    No, that's not how collations work. (Or to be pedantic, that's not how most collations work).

    Ordering occurs by alphabet first, regardless of collation. First all strings with a single b, then all strings with b., then all c's, and then all d's.

    The CS/CI rules then act as tie breaker. In this case, there are two values for each of the single-letter combinations. With CI, they are considered equal, so left in whatever order they happen to be in (which may or may not be the order in which they were inserted - even if testing appears to suggest a strict rule here, there is none and you should not rely on it). With CS, the order within each group is lowercase first. So the final order is b B b. c C d D. The BETWEEN clause then drops some of the results.

    Nice question. I decided to rely on my memory of lowercase/uppercase order instead of double checking. Bad decision. 😉


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Ah, thanks Hugo for explaining that. Somehow I knew you would be able to explain this since you answered a Colation issue that I had a few weeks ago concerning the QofD.

  • Hugo Kornelis (8/29/2013)

    ...

    Nice question. I decided to rely on my memory of lowercase/uppercase order instead of double checking. Bad decision. 😉

    As the statistic on the results shows an almost fifty-fifty good and bad answers, it is a nice question indeed... and it is always nice to know that even expert can make bad decision 😉

    Thanks for the explanation. I would add to it the reminder that the point "." makes the data not alphabeticle (I guess) if it precedes the letter and is always place as the last one if it follows the letter

  • I find it interesting that lower case sorts ahead of upper case. In the ASCII character set I remember working with more than a few years ago, A is 65 while a is 97. To have them sort differently here is somewhat counter-intuitive, but maybe the fact that the different collations, some of which handle Unicode, have them laid out differently. Then again, I may way off base here. If anyone has any thoughts, I'd love to hear them.

    Don't get me wrong here, I like that lower case sorts ahead of upper case...it's just different.

  • Good question.

    Just to amplify what Hugo said, the wording in BOL (which is also in the explanation) is somewhat misleading. Hugo's statement that the tie break between upper and lower case occurs logically after alphabetic sorting is correct, so the statement in BOL which simply says lower case is treated as sorting before upper case is not strictly correct. If the BOL statement were correct, in case sensitive collations ''sysygy' would sort before 'Sally', but in practice it sorts after 'Sally' regardless of whether the collation is case sensitive or not - if 'b' sorted before 'z' in the same sense that 's' sorts before 'S' in case sensitive collations, then presumably 'by' would sort before 'za' in those collations, so obviously "sorts before" means something different when talking about 's' and 'S' than when talking about 'b' and 'z', and the difference is that logically one is part of a global alphabetical sort and the other is a local tie-break between adjacent entities in the already alphabetically sorted set. It's a pity that BOL doesn't use the sort of wording that Hugo does.

    Tom

  • Good question. I was pondering for a while, as I don't use 2012, and wondered if the point of the question was to show some change that was made in 2012. Got it right, but wasn't sure what would happen when I submitted my answer.

    Thanks to Hugo and Tom for their usual excellent corrections/clarifications.

    [font="Verdana"]Please don't go. The drones need you. They look up to you.[/font]
    Connect to me on LinkedIn

  • Ed Wagner (8/29/2013)


    I find it interesting that lower case sorts ahead of upper case. In the ASCII character set I remember working with more than a few years ago, A is 65 while a is 97. To have them sort differently here is somewhat counter-intuitive, but maybe the fact that the different collations, some of which handle Unicode, have them laid out differently. Then again, I may way off base here. If anyone has any thoughts, I'd love to hear them.

    Don't get me wrong here, I like that lower case sorts ahead of upper case...it's just different.

    Most people prefer "Sawyer" and "sawyer" to sort to adjacent positions rather than being separated by all strings beginning with lower case letters alphabetically after 'S' and all strings beginning with letters alphabetically before 's', which is what would happen in a sort based on the ascii (or Unicode) character code.

    That's why dictionaries sort with a tie break on case (which way the tie break works depends on the particular dictionary); also many languages sort with a similar tie-break for accented characters: in Spanish dictionaries, for example, the difference between 'a' and 'á' is handled by a tie-break, while the difference between 'ñ' (which is not treated as an accented character) and 'n' is alphabetic.

    Tom

  • This was removed by the editor as SPAM

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply