I had a recent run in with collation problems and it got me started reading. As I read I started to realize that for a seemingly simple subject there was a lot to it. So I thought I would go over some of what I found out. First things first of course.
What is a collation? Collation is a property assigned to any character field that defines the sort and comparison rules for the field. The information I’m going over is mostly found in BOL under Collation and Unicode Support. It’s something like 10 pages long which just goes to show I wasn’t exaggerating about the depth of the subject.
Prior to SQL Server 2012 there were four options associated with collation with a 5th added in SQL Server 2012. They are specified by appending them to the collation name along with a _. So Collation_CS_AS_KS_WS_SC is case sensitive, accent sensitive, kana sensitive, width sensitive and supplementary character sensitive.
- Case sensitivity (CS/CI) – Will the sort/comparison take case (upper/lower) of the characters into account. If CS (case sensitive) it will and if CI (case insensitive) it will not.
- Accent sensitivity (AS/AI) – Will the sort/comparison take into account accented characters (a/?). If AS (accent sensitive) then it will. If AI (accent insensitive) it will not.
- Kana sensitivity (KS) – This option is specific to Japanese kana characters. There are two types of kana characters, Hiragana and Katakana. If KS is specified then the sort/comparison will be treat the kana characters differently. If it is not specified then they are treated the same. There is no KI.
- Width sensitivity (WS) – I find this an interesting option. This determines if the storage (one byte or two) is taken into account in the sort/comparison. There is no WI. If WS is included then the collation is width sensitive. If it isn’t included then the collation is not width sensitive
- UTF-16 Supplementary Characters (SC) – Thanks to Bob Beauchemin for pointing this one out to me. But basically this is new as of SQL 2012 and it supports UTF16 characters that require more than a 16bit (2 byte) word. It is supported in 90 level and up collations.
SQL Server supports three different collation sets. Windows collations, Binary collations, and SQL Server collations. I honestly found the differences rather confusing so I’ll let you read up on them yourself.
In addition to the sort/comparison options, for non-Unicode data types (char/varchar/text) collation also determines the code page. Because non-Unicode types only have one byte per character they are more restricted in the number of characters available. The code page determines what character set is available. So for example Hebrew collations will use an entirely different set of characters than English ones. Unicode datatypes (nchar/nvarchar/ntext) don’t have the same problem. Because they have the additional byte per character they can hold a large enough number of possible characters that a code page isn’t really necessary.
Over the next few weeks I’m going to do a number of different posts around the surprisingly deep subject of collation.
As a start listen to Rob Farley’s hilarious talk on collation
Collation: Correction -> Expression, Column, Database
Collation: Correction -> Instance
Collation: Temporary objects & variables: Demonstration
Filed under: Collation, Microsoft SQL Server, SQLServerPedia Syndication, T-SQL Tagged: code language, collation, language sql, microsoft sql server, T-SQL