CS Collation

  • Ah: that's basically what Luis already said: sorry Luis!

  • Alex Fekken (9/12/2012)


    Taking a atep back, why should the sort order depend on either SOUNDEX (especially when case sensitive) or gramnatical interpretation at all?

    That sounds like total atupidity to me.

    may be... but not sure if it depending on the soundex or not, i used this to show you the value and the words weight which is seen in the result, check this http://msdn.microsoft.com/en-us/library/ms187384.aspx and read the last line of the "Remarks" section where it stats the soundex is the collation sensitive, so it (collation) may be using the soundex result to place the word when the sort is done. (like i showed in the previous posts) πŸ™‚

    ww; Raghu
    --
    The first and the hardest SQL statement I have wrote- "select * from customers" - and I was happy and felt smart.

  • emiddlebrooks (9/12/2012)


    I'm also a bit confused why peppers is at the end. If lowercase p is before uppercase p, which I knew it would be, why was it at the end? And the, I hope joking, about soundex and plurals is not the reason.

    try this:

    VALUES ('Asdfgh'),('ASDFGH'),('asdfgh'),('a3df3r'),('asdfghi');

    which essentially takes the meaning of the word out of the picture. "asdfghi" still sorts at the end... even after the uppercase "a" words.

    If you have installed the SQL 2008 R2 BOL, then read this link ms-help://MS.SQLCC.v10/MS.SQLSVR.v10.en/s10de_1devconc/html/494c8162-f2e4-44ec-b502-80f792bb2a9e.htm which stats like below...

    Although all these collations use the same set of bits to represent

    non-Unicode character data, the sorting and comparison rules they apply

    are slightly different. This is so they can handle the dictionary definitions

    of the correct sequence of characters in the language or alphabet

    associated with the collation.

    So random words like this wouldn't be considered, as long as the word is legit English word and it should has the proper space in the dictionary too. I don't think this 'asdfgh' holds any meaning/place in the list of english words, this type of sorting may or may not return the result as you are expecting but they are never considered as legit sorting. Ofcourse it will try to sort to its own understanding and it will never be consistent.

    (I am not implying plural is only the case for this, but may be "one of the..." along with several other considerations)

    ww; Raghu
    --
    The first and the hardest SQL statement I have wrote- "select * from customers" - and I was happy and felt smart.

  • So random words like this wouldn't be considered, as long as the word is legit English word and it should has the proper space in the dictionary too. I don't think this 'asdfgh' holds any meaning/place in the list of english words, this type of sorting may or may not return the result as you are expecting but they are never considered as legit sorting. Ofcourse it will try to sort to its own understanding and it will never be consistent.

    (I am not implying plural is only the case for this, but may be "one of the..." along with several other considerations)

    I think you have been seriously misled by the "dictionary" analogy and are taking it way too far.

    SQL Server does not go through the same process as somebody who compiles a dictionary. It would not only be far too difficult and ambiguous, it would also be inappropriate (a lot of SQL text data is not "legit English" or "legit anything" and still needs to be sorted consistently and intuitively, even when mixed with "legit" stuff) and biased (why English US/UK/... grammar?).

    Unless I am seriously mistaken, Collation Orders are about the sort order and "equivalence" of individual characters (and perhaps ligatures, i.e. "compound characters") only. No "interpretation" or "understanding" is attempted beyond the character level because that would open up too many cans of worms.

    Having said that, there is nothing in the documentation that I have seen so far in this discussion or elsewhere that defines the precise rules and explains the results that we saw.

  • Wow, interesting discussion. I see a lot of explanations that miss the ball by a long shot (hopefully, some of them were posted in jest!), some that almost or even completely hit the mark, but then add some incorrect sauce. I did not read every bit of every reaction, so maybe the actual true answer has been posted and I have overlooked it.

    First, let's dispell some myths.

    1. Sort order has nothing to do with singular or plural forms of the words, or in fact with any aspect of the meaning at all. For ordering purposes, string values are just a bunch of characters, whosse meaning is unimportant.

    2. SOUNDEX() is not used in any way for ordering. The SOUNDEX() function is intended as a help for finding matches when a name may have been misheard (think: callcenter).

    3. The "LATIN" in the collation name has nothing to do with the extinct Latin language. It refers to the latin alphabet, which is the alphabet used in a.o. Anglo-saxon languages. Special collations are required for countries that have different alphabets - even when the difference is too subtle to see. For Chinese and Arab alphabets, the difference is obvious, but there are also different rules for the "a, b, c" alphabet we use here. For instance, Germans have a special character "ß", which can also be represented as "ss", and which sorts (if I recall correctly) between s and t. So the German collations ensure that both ß and ss sort at that location, and that words with those different spelling compare as equal. Similarly, "ae" and "oe" are considered as equal to "À" and "â", with their appropriate sort locations. In some Scandinavion languages, however, the same "ae" and "oe" translate to other special characters: "æ" and "ø".

    So, how does sorting work?

    Most collations (I think all except the binary collations, but not 100% sure) sort in what's called "dictionary order". That is not related to whether the strings would actually be in a dictionary (see myth 1 above), but to how they would be ordered if they were in a dictionary or encyclopaedia. That is always the first rule. If you look up something in a dictionary, or in a (paper) encyclopaedia, you just check the palce of the term in the alphabet, without considering upper- or lowercase. If you'd want to look up "Netherlands", you'd expect to find it somewhere between "nether" and "nethermore", even though "Netherlands" is capitalized and the other words are not.

    Also, in dictionaries, if a word is the same as another word but with a suffix, the word with suffix comes later. (In computer terms, this is implemented by padding the shorter words with spaces and ensuring the space character sorts first). So you expect "but" before "butter".

    Finally, at the last position, case sensitivity comes into play. In a case insensitive collation, "pepper", "PEPPER" and "pEpPeR" are all the same word; they compare equal and their relative sort order is undefined. In a case sensitive collation, they are considered not to be equal, and a choice had to be made as to their relative order when sorting. The choice made was that lowercase comes before uppercase. Why? No idea - at some point in time, someone somewhere had to make a choice, and this is what came out of it.

    I hope this helps!


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Thanks Hugo,

    so is it correct to summarise this as: the sort occurs (logically) in two phases. The first phase is always case/accent/etc (?) INsensitive and the (optional) second phase only sorts each group of ties resulting from the first phase internally according to any case/accent/etc sensitivity that needs to be observed?

    Second question: where on earth is this documented?

  • Alex Fekken (9/13/2012)


    Thanks Hugo,

    so is it correct to summarise this as: the sort occurs (logically) in two phases. The first phase is always case/accent/etc (?) INsensitive and the (optional) second phase only sorts each group of ties resulting from the first phase internally according to any case/accent/etc sensitivity that needs to be observed?

    Second question: where on earth is this documented?

    Hi Alex,

    First answer: Yes, that sounds about right. With a possible exception for collations where accents or case influence dictionary sorting order (no idea if such collations exist, and if so how exactly they are handled).

    Second answer: No idea. I didn't make it up, so I must have read it once, somewhere (maybe even different places), but I can't give you links without first having to hunt them down myself - and for that, I have no time at the moment.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis (9/13/2012)


    First, let's dispell some myths.

    .

    .

    .

    I hope this helps!

    Hugo,

    As usual, thank you for the BRILLIANT and detailed explanation on this topic. I found your point #3 most useful

    I had always thought there was some kind of funny or weird method of sorting that did not use the standard ASCII number, but did not know what it was and why. Your explanation made me understand a lot, so I will only complement by reading about collations in BOL.

    "El" Jerry.

    "A watt of Ottawa" - Gerardo Galvan

    To better understand your help request, please follow these best practices.[/url]

  • I think the key to answering my question is along the lines of Hugo's

    Also, in dictionaries, if a word is the same as another word but with a suffix, the word with suffix comes later. (In computer terms, this is implemented by padding the shorter words with spaces and ensuring the space character sorts first). So you expect "but" before "butter".

    But with the key distinction that sql is comparing words without case first for this THEN sorting by case. If it sorted by the literal characters FIRST then peppers would be before Pepper regardless of length.... but by ignoring the case first, then sorting within the words by case I come out with the same answer SQL does.

    So with all words being same length

    VALUES ('apple'), ('Apple'), ('applx')

    If I did character by character "a" being before "A" should be apple, applx, Apple. But you get apple, Apple, applx.

    Interesting. And yes, I read BOL first and didn't feel it actually addressed my question exactly. I can agree with how it sorted them logically, as a person... it is just the programmer in me that wants precision in HOW it does it.

    Edit

    PS. I think this is also what Alex is saying πŸ™‚

  • Hugo Kornelis (9/13/2012)


    hopefully, some of them were posted in jest!

    Mine was. Sorry :blush:

  • Toreador (9/13/2012)


    Hugo Kornelis (9/13/2012)


    hopefully, some of them were posted in jest!

    Mine was. Sorry :blush:

    Don't be: extreme examples are a good and effective way of checking the (non-)validity of an explanation. Especially if they are funny πŸ™‚

  • I have no time at the moment.

    Bedankt Hugo.

    I didn't expect anyone to spend a lot of time on it but mainly tried to confirm that the answer wasn't easily found in the obvious places.

  • Thank you Hugo, the details what I had, like Alex has said possibly I have taken way too far. I need to be more consistent in my saying, as I did not imply plural or soundex is the only way, but I was using it to represent my thought which I had where intending I could have done better.

    (Like I said before It always feels great to interact with experts πŸ™‚ )

    Thank you, All.

    ww; Raghu
    --
    The first and the hardest SQL statement I have wrote- "select * from customers" - and I was happy and felt smart.

  • Great question, thanks!

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • emiddlebrooks (9/13/2012)


    So with all words being same length

    VALUES ('apple'), ('Apple'), ('applx')

    If I did character by character "a" being before "A" should be apple, applx, Apple. But you get apple, Apple, applx.

    Excellent explanation from Hugo.

    To be clear, emiddlebrooks, you could think of it in the following steps:

    1. sort strings based on characters (insensitive to case)

    2. where strings match aside from one being longer, the longer string comes last

    3. next apply the case sensitive sort to break any remaining ties

    Hence, applx comes last because x comes after e. Just as you would expect to find in a dictionary for the same reason that 'and' comes before 'ant'.

    And when different lengths are involved you get 'and', then 'anda', then 'ant', then 'anta'.

    Case sensitivity only affects the sorting after this.

Viewing 15 posts - 31 through 45 (of 45 total)

You must be logged in to reply to this topic. Login to reply