is collate just the language alphabet?

  • 1.is collate just the language alphabet? what other aplhabets can be saved in a sql server column?

     

    2.you use nvarchar for extended alphabets, national variable character, what is unicode. nvarchar uses unicode, what is unicode?

  • 1.  Collation is about what character set can be saved in any column/database....and implicit with that is the default SORT ORDER when sorting character data....ie does "a" come before "b", etc....in some alphabets german/ nordic/ arabic/ chinese etc...the ordering of characters in the alphabet is not the same as in the standard english alphabet.

    2.  unicode is a technique used to cater for alphabets which cannot be saved in the standard ascii character set.  Primarily used for chinese/ japanese regions, because their alphabet sets need far more codes than is available within the original ascii set to cover the full set of available values that they incorporate.  Unicode uses 2-bytes for every character, compared to ascii's 1-byte.  This has obvious implications for the amount of storage required to save the same amount of data.  If you are not dealing with 'beyond the normal' alphabet sets, then there is no need to incur the penalty of working with unicode.  However if you do work with chines/japanese/ etc data then you have NO choice but to work with unicode.

  • super answers, thanks!!!! well written

  • It should also be noted that the collation settings affect how the case of characters (among other things)  is used in comparison operations.

    For example, in a case sensitive collation, I could have both 'Apple' and 'APPLE' as values in a column upon which  a uniqueness constraint is in effect.

    In a case insensitive collation, this would violate the constraint.

    I'd give a look at BOL for more in-depth information.

     

  • Let me correct an earlier oversimplification.

    Unicode is a set of codepoints, and can be encoded in a variety of ways. So, Unicode does not use 2-bytes per character, but there is a very popular encoding of an old Unicode standard which did use 2-bytes per character -- actually, at least two such standards, called "UCS-2LE" and "UCS-2BE".

    Way back when NT 4 was created, there were less than 65536 characters in the Unicode list, so it was possible to do encodings giving exactly 2 bytes to every one of them. Windows NT was built to run natively on one of those encodings, specifically, "UCS-2LE".

    In the 10 years since then, Unicode has grown beyond the 65536 limit, so that it is not possible to encode every character in 2 bytes -- if you know your powers of 2, this will be obvious to you

    However, if you are willing to stick to the old Unicode 2 list (which includes most everything you'd want except all of Chinese), you can use that old UCS-2LE mapping.

    My guess is that most applications just use the old mapping like that.

    If you have to conform to the mainland Chinese encoding GB18030, however, you will probably not be able to get away with sticking with the ten year old list.

    There are several modern choices for Unicode encoding, but the most popular, by far I believe, is UTF-8. It handles all the Unicode characters, and is of variable length, and is perfectly compatible with ASCIII (an advantage not possessed by UCS-2LE), and is very popular on the World Wide Web, to my knowledge.

  • Another point that I'm not sure was mentioned above is that collation affects what characters are considered equal. This includes case-sensitivity (are "a" and "A" equal?) and accent sensitivity (are "a" and "á" equal?).

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply