# Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2)

, 2019-03-13

(last updated: 2019-03-15 @ 02:45 EST / 2019-03-15 @ 06:45 UTC )

Binary collations are, in many ways, simpler to understand than non-binary collations because they don’t have any complex, culture-based linguistic rules. They just work with the character values and there is little room for mystery: all characters have underlying values; if two characters have the same underlying value then they are equal, else they are not; and the underlying value is the order, right?

Still, there are some complexities related to binary collations that you might not be aware of. To figure out what they are, we need to look at why there are so many binary collations in the first place. I mean, binary collations work on the underlying values of the characters, and comparing numbers doesn’t change between cultures or versions: 12 = 12, 12 > 11, and 12 <13, always. So, then what is the difference between:

• `Latin1_General_100_BIN2` and `Hebrew_100_BIN2` (only the culture is different), or
• `Latin1_General_100_BIN2` and `Latin1_General_BIN2` (only the version is different), or
• `Latin1_General_100_BIN2` and `Latin1_General_100_BIN` (only the binary comparison type is different)

Let’s take a closer look at each of those three cases.

## Different Cultures

From the names alone we can see that `Latin1_General_100_BIN2` and `Hebrew_100_BIN2` are both “`_100_BIN2`” collations, and the difference between them is just “`Latin1_General`” vs “`Hebrew`” (i.e. the culture). For `NVARCHAR` data this difference is entirely meaningless: the behavior of all binary collations of the same version and same binary comparison type is exactly the same.

For `VARCHAR` data, however, there is an important difference. All collations have an associated code page that is the character set used for 8-bit (i.e. `VARCHAR`) data. Even if that code page is `0`, meaning “no code page”, it still determines what characters can be stored in 8-bit datatypes. And in that case, it would be “none”, which is why these collations are called “Unicode-only”, and why they don’t work with `VARCHAR` data:

```SELECT COLLATIONPROPERTY(N'Syriac_100_BIN2', 'CodePage');
-- 0
SELECT 'not gonna work, trust me ;-)' COLLATE Syriac_100_BIN2;
/*
Msg 459, Level 16, State 1, Line XXXXX
Collation 'Syriac_100_BIN2' is supported on Unicode data types
only and cannot be applied to char, varchar or text data types.
*/
```

No code page means that there is no 8-bit (i.e. `VARCHAR`) character set. Of course, the difference in culture will be more obvious if we look at a few non-zero code pages:

```-- Which 8-bit code page is used for each of these collations:
SELECT COLLATIONPROPERTY(N'Latin1_General_100_BIN2', 'CodePage'),
COLLATIONPROPERTY(N'Hebrew_100_BIN2', 'CodePage'),
COLLATIONPROPERTY(N'Korean_100_BIN2', 'CodePage');
-- 1252    1255    949
```

Now we will inject the same underlying values into each of those three code pages. Please note that we need to use values in the range of 128 – 255 (0x80 – 0xFF) since the 0 – 127 range (0x00 – 0x7F) is the same across all of the code pages that can be represented in SQL Server, including Unicode data (both UTF-16 via `NVARCHAR` and, starting in SQL Server 2019, UTF-8 via `VARCHAR`).

```DECLARE @Data TABLE
(
[CodePage-1252]  VARCHAR(50) COLLATE Latin1_General_100_BIN2,
[CodePage-1255]  VARCHAR(50) COLLATE Hebrew_100_BIN2,
[CodePage-949]   VARCHAR(50) COLLATE Korean_100_BIN2
);
INSERT INTO @Data VALUES (0xE1FA, 0xE1FA, 0xE1FA);
INSERT INTO @Data VALUES (0xE4E8, 0xE4E8, 0xE4E8);
SELECT * FROM @Data;
/*
CodePage-1252    CodePage-1255    CodePage-949
áú               בת               守
äè               הט               怏
*/
```

As you can see in the results, the exact same two bytes — `0xE1` and `0xFA` — produce very different characters due to the code pages being different (code page 949 is a Double-Byte Character Set (DBCS) which is why there is only a single character returned for the Korean collation). And the same is true for the `0xE4` and `0xE8` bytes. This is the only reason for having more than a single “`_100_BIN2`” collation. And, this is also why Microsoft could have simply provided a single `UnicodeOnly_90_BIN2` collation and a single `UnicodeOnly_100_BIN2` collation instead of the 15 that exist across both of those versions:

```SELECT ROW_NUMBER() OVER (
PARTITION BY CASE WHEN col.[name] LIKE N'%[_]90[_]%' THEN 1 ELSE 2 END
ORDER BY CASE WHEN col.[name] LIKE N'%[_]90[_]%' THEN 1 ELSE 2 END,
col.[name]
),
col.[name]
FROM   sys.fn_helpcollations() col
WHERE  col.[name] LIKE N'%[_]BIN2'
AND    COLLATIONPROPERTY(col.[name], 'CodePage') = 0;
/*
1      Divehi_90_BIN2
2      Indic_General_90_BIN2
3      Syriac_90_BIN2
1      Assamese_100_BIN2
2      Bengali_100_BIN2
3      Divehi_100_BIN2
4      Indic_General_100_BIN2
5      Khmer_100_BIN2
6      Lao_100_BIN2
7      Maltese_100_BIN2
8      Maori_100_BIN2
9      Nepali_100_BIN2
10     Pashto_100_BIN2
11     Syriac_100_BIN2
12     Tibetan_100_BIN2
*/
```

## Different Versions

From the names alone we can see that `Latin1_General_100_BIN2` and `Latin1_General_BIN2` are both “`Latin1_General_*_BIN2`” collations, and the difference between them is just “{no number ; implied 80}” vs “`100`” (i.e. the version). For `VARCHAR` data this difference is entirely meaningless: the behavior of all binary collations of the same culture is exactly the same.

For Unicode data (mainly `NVARCHAR`, but starting in SQL Server 2019 CTP 2.3, also `VARCHAR` if using the new `UTF8_BIN2` collation), however, there is an important difference. All collations map code points to various properties such as: sort weights (used for sorting and comparisons), numerical values for “numbers”, uppercase and lowercase versions, etc. For example:

```SELECT 1 WHERE N'½' COLLATE Latin1_General_100_CI_AS LIKE N'[3-5]';
-- {no rows returned}
SELECT 2 WHERE N'½' COLLATE Latin1_General_100_CI_AS LIKE N'[0-1]';
-- 2
```

But, it is those uppercase and lowercase mappings that are important here. There are quite a few code points that are missing these mappings, but as each new version of collations is introduced, more and more of the code points are having their mappings added. For example, the following shows a code point that is missing its uppercase mapping in both version 80 and 90 collations (if it’s missing in any particular version, then it’s missing in all prior versions as well, which is why I don’t need a field for the version 80 collation):

```SELECT NCHAR(0x1FF3) AS [TheChar],
UPPER(NCHAR(0x1FF3) COLLATE Japanese_90_BIN2) AS [v90],
UPPER(NCHAR(0x1FF3) COLLATE Japanese_XJIS_100_BIN2) AS [v100],
UPPER(NCHAR(0x1FF3) COLLATE Japanese_XJIS_140_BIN2) AS [v140];
/*
TheChar    v90    v100    v140
ῳ          ῳ      ῼ       ῼ
*/
```

And, the following shows a code point that is missing its uppercase mapping in version 80, 90, and 100 collations:

```SELECT NCHAR(0x0250) AS [TheChar],
UPPER(NCHAR(0x0250) COLLATE Japanese_90_BIN2) AS [v90],
UPPER(NCHAR(0x0250) COLLATE Japanese_XJIS_100_BIN2) AS [v100],
UPPER(NCHAR(0x0250) COLLATE Japanese_XJIS_140_BIN2) AS [v140];
/*
TheChar    v90    v100    v140
ɐ          ɐ      ɐ       Ɐ
*/
```

Ok, so how many of these mappings are missing from each version? I’m glad you asked 😉

The only way we have to determine which mappings are missing is by looking at the differences between the oldest version (highest number of missing mappings) and newest version (least number of missing mappings) collations. Logically, this means that since there are four versions — 80, 90, 100, and 140 — the version 80 data will naturally be “all missing” while the version 140 data will naturally be “none missing”. There might be additional mappings missing from all versions, but a) we have no way to determine that now as that requires cross-referencing the official Unicode Character Database, which is possible but way out of scope, and b) any potential missing mappings aren’t relevant to the question of “what are the differences between the existing collations available in SQL Server”.

The query below finds all code points that return the same character that was passed into the `LOWER` function when using a version 80 collation, but that also return a different character when the same initial character was passed in, but this time using a version 140 collation. For the version 90 and 100 collations, if the `LOWER` function returns a character that is different from what was passed in, the lowercase character is displayed, else it will be `NULL`. The final row is a summary line showing the totals. A field for the version 80 collations is included just to show clearly that there is no lowercase mapping for those characters.

```;WITH nums AS
(
SELECT TOP (65534) ROW_NUMBER() OVER(ORDER BY @@SPID) AS [num]
FROM   master.sys.all_columns ac1
CROSS JOIN master.sys.all_columns ac2
), chars AS
(
SELECT nums.[num] AS [CodePoint],
CONVERT(BINARY(2), nums.[num]) AS [CodePointHex],
NCHAR(nums.[num]) AS [ TheChar ],
LOWER(NCHAR(nums.[num]) COLLATE Japanese_BIN2) AS [v80],
'    ' AS [----],
IIF(LOWER(NCHAR(nums.[num]) COLLATE Japanese_90_BIN2)
<> NCHAR(nums.[num]),
LOWER(NCHAR(nums.[num]) COLLATE Japanese_90_BIN2),
NULL) AS [v90],
IIF(LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_100_BIN2)
<> NCHAR(nums.[num]),
LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_100_BIN2),
NULL) AS [v100],
IIF(LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2)
<> NCHAR(nums.[num]),
LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2),
NULL) AS [v140]
FROM   nums
WHERE  LOWER(NCHAR(nums.[num]) COLLATE Japanese_BIN2)
= NCHAR(nums.[num]) -- mapping not in version 80 collations
AND    LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2)
<> NCHAR(nums.[num]) -- mapping in version 140 collations
)
SELECT * FROM chars
UNION ALL
SELECT NULL, NULL, N'TOTALS:',
CONVERT(NVARCHAR(5), COUNT(chars.[v80])),
'    ' AS [----],
CONVERT(NVARCHAR(5), COUNT(chars.[v90])),
CONVERT(NVARCHAR(5), COUNT(chars.[v100])),
CONVERT(NVARCHAR(5), COUNT(chars.[v140]))
FROM chars;
/*
CodePoint    HexValue     TheChar    v80    --    v90    v100    v140
...
502          0x01F6       Ƕ          Ƕ            NULL   ƕ       ƕ
...
8498         0x2132       Ⅎ          Ⅎ            NULL   ⅎ       ⅎ
...
11374        0x2C6E       Ɱ          Ɱ            NULL   NULL    ɱ
...
42820        0xA744       Ꝅ          Ꝅ            NULL   NULL    ꝅ
...
NULL         NULL         TOTALS:    305           0     200     305
*/
```

I have included a random(ish) sampling of 4 rows from the result set above, plus the final / summary line. The summary shows that there are 305 characters missing their lowercase mapping in the version 80 collations. The version 90 collations add no mappings (sad). But, the version 100 collations add 200 mappings, and the version 140 collations added another 105 mappings, giving it 305 mappings that are missing from the version 80 and 90 collations.

Interesting, but we aren’t done yet. The lowercase mappings are just one direction. We need to include the uppercase mappings as well. And, while it’s tempting, we should not assume that the missing uppercase mappings are merely the uppercase versions of the characters that are missing their lowercase mappings. Each code point has an uppercase or lowercase mapping (or can have), and these are completely independent, so just because one entry was added doesn’t mean that any other mapping has been added, even if it seems like it should have been.

The query below finds all code points that return the same character that was passed into the `UPPER` function when using a version 80 collation, but that also return a different character when the same initial character was passed in, but this time using a version 140 collation. For the version 90 and 100 collations, if the `UPPER` function returns a character that is different from what was passed in, the uppercase character is displayed, else it will be `NULL`. The final row is a summary line showing the totals. A field for the version 80 collations is included just to show clearly that there is no uppercase mapping for those characters.

```;WITH nums AS
(
SELECT TOP (65534) ROW_NUMBER() OVER(ORDER BY @@SPID) AS [num]
FROM   master.sys.all_columns ac1
CROSS JOIN master.sys.all_columns ac2
), chars AS
(
SELECT nums.[num] AS [CodePoint],
CONVERT(BINARY(2), nums.[num]) AS [CodePointHex],
NCHAR(nums.[num]) AS [ TheChar ],
UPPER(NCHAR(nums.[num]) COLLATE Japanese_BIN2) AS [v80],
'    ' AS [----],
IIF(UPPER(NCHAR(nums.[num]) COLLATE Japanese_90_BIN2)
<> NCHAR(nums.[num]),
UPPER(NCHAR(nums.[num]) COLLATE Japanese_90_BIN2),
NULL) AS [v90],
IIF(UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_100_BIN2)
<> NCHAR(nums.[num]),
UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_100_BIN2),
NULL) AS [v100],
IIF(UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2)
<> NCHAR(nums.[num]),
UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2),
NULL) AS [v140]
FROM   nums
WHERE  UPPER(NCHAR(nums.[num]) COLLATE Japanese_BIN2)
= NCHAR(nums.[num]) -- mapping not in version 80 collations
AND    UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2)
<> NCHAR(nums.[num]) -- mapping in version 140 collations
)
SELECT * FROM chars
UNION ALL
SELECT NULL, NULL, N'TOTALS:',
CONVERT(NVARCHAR(5), COUNT(chars.[v80])),
'    ' AS [----],
CONVERT(NVARCHAR(5), COUNT(chars.[v90])),
CONVERT(NVARCHAR(5), COUNT(chars.[v100])),
CONVERT(NVARCHAR(5), COUNT(chars.[v140]))
FROM chars;
/*
CodePoint    HexValue     TheChar    v80    --    v90    v100    v140
...
1165         0x048D       ҍ          ҍ            NULL    Ҍ       Ҍ
...
8097         0x1FA1       ᾡ          ᾡ            NULL    ᾩ       ᾩ
...
42805        0xA735       ꜵ          ꜵ           NULL    NULL    Ꜵ
...
42849        0xA761       ꝡ          ꝡ           NULL    NULL    Ꝡ
...
NULL         NULL         TOTALS:    344          0      238     344
*/
```

I have included a random(ish) sampling of 4 rows from the result set above, plus the final / summary line. The summary shows that there are 344 characters missing their uppercase mapping in the version 80 collations. The version 90 collations add no mappings (sad). But, the version 100 collations add 238 mappings, and the version 140 collations added another 106 mappings, giving it 305 mappings that are missing from the version 80 and 90 collations.

Putting the data from both queries together, the total totals are:

• Version 80: missing 649 mappings (305 + 344)
• Version 90: missing 649 mappings (305 + 344)
• Version 100: added 438 mappings (200 + 238) over versions 80 and 90 ; only missing 211 mappings (105 + 106, or 649 – 438)
• Version 140: added 211 mappings (105 + 106) over version 100 ; added all 649 over versions 80 and 90

This data shows why:

1. if possible, it is important to use the newest version available (version 100 was introduced in SQL Server 2008, 140 was introduced in SQL Server 2017)! Please keep in mind that these mappings are just one difference. There are also many thousands of sort weights missing from the earlier versions (almost 20k missing from version 80, almost 13k missing from version 90, and almost 5k missing from version 100).
2. the `UTF8_BIN2` collation, new in CTP 2.3 of SQL Server 2019, is better than having no binary UTF-8 collation (which was the case prior to CTP 2.3), however, being a version 80 collation, it’s missing 649 uppercase / lowercase mappings 😿

## Different Binary Comparison Types

From the names alone we can see that `Latin1_General_100_BIN2` and `Latin1_General_100_BIN` are both “`Latin1_General_100`” collations, and the difference between them is just “`_BIN`” vs “`_BIN2`” (i.e. the binary comparison type). For non-UTF-8 `VARCHAR` data this difference is entirely meaningless: the behavior of all binary collations of the same culture is exactly the same.

For Unicode data (mainly `NVARCHAR`, but starting in SQL Server 2019 CTP 2.3, also `VARCHAR` if using the new `UTF8_BIN2` collation), however, there is an important difference. When comparing values, the behavior of all binary collations is exactly the same. So, what we are interested in is sorting.

According to the Microsoft documentation page for Collation and Unicode Support:

In a BIN2 collation all characters are sorted according to their code points. In a BIN collation only the first character is sorted according to the code point, and remaining characters are sorted according to their byte values. (Because the Intel platform is a little endian architecture, Unicode code characters are always stored byte-swapped.)

That means: the behavior is exactly the same for the first character only. So, if you are testing the difference between “`_BIN`” and “`_BIN2`” by sorting a list of single characters, then you will never see a difference. In order to see the difference in behavior, you need to skip the first character. The following test shows this by providing both a field with a single character, and a field with a prefix character that is the same for both rows, cancelling out the effect of the first character on the sorting:

```SELECT tab.[ID],
tab.[col] AS [TheChar],
N'_' + tab.[col] AS [PrefixedChar],
'--' AS [--],
ROW_NUMBER() OVER (ORDER BY tab.[col]
COLLATE Latin1_General_100_BIN) AS [TheCharBIN],
ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col]
COLLATE Latin1_General_100_BIN) AS [PrefixedBIN],
'--' AS [--],
ROW_NUMBER() OVER (ORDER BY tab.[col]
COLLATE Latin1_General_100_BIN2) AS [TheCharBIN2],
ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col]
COLLATE Latin1_General_100_BIN2) AS [PrefixedBIN2]
FROM   (VALUES (1, NCHAR(0x1225)), (2, NCHAR(0x2113))) tab([ID], [col])
ORDER BY tab.[ID];
/*
ID   Char   Prefixed   --   Char   Prefixed   --   Char   Prefixed
BIN    BIN             BIN2   BIN2
1    ሥ     _ሥ        --    1      2         --     1      1
2    ℓ     _ℓ         --    2      1          --    2       2
*/
```

The “PrefixedBIN” field shows a difference because starting at byte #3, the “`_BIN`” collation sees `0x25` and `0x13` (due to the byte-swapping of the Little Endian encoding), while the “`_BIN2`” collation sees `0x12` and `0x21`. If that is not clear, maybe the following chart will help. And keep in mind we are working with the “Prefixed” value that effectively skips the first character since there is no sorting difference with that first position. The chart below shows the character and its underlying bytes starting at character position / index #2.

CodePoint Character UTF-16 LE BIN ordering BIN2 ordering
U+1225 `0x2512` `0x25` then `0x12` `0x12` then `0x25`
U+2113 `0x1321` `0x13` then `0x21` `0x21` then `0x13`

### Supplementary Characters

This next query shows how Supplementary Characters are handled across “`_BIN`“, “`_BIN2`” collations, as well as non-binary collations of version 90 or higher:

```SELECT tab.[ID],
tab.[col] AS [TheChar],
N'_' + tab.[col] AS [PrefixedChar],
'---' AS [---],
ROW_NUMBER() OVER (ORDER BY tab.[col]
COLLATE Latin1_General_100_BIN) AS [Char_BIN],
ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col]
COLLATE Latin1_General_100_BIN) AS [Prefixed_BIN],
'---' AS [---],
ROW_NUMBER() OVER (ORDER BY tab.[col]
COLLATE Latin1_General_100_BIN2) AS [Char_BIN2],
ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col]
COLLATE Latin1_General_100_BIN2) AS [Prefixed_BIN2],
'---' AS [---],
ROW_NUMBER() OVER (ORDER BY tab.[col]
COLLATE Japanese_90_CI_AS) AS [Char_CI_AS],
ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col]
COLLATE Japanese_90_CI_AS) AS [Prefixed_CI_AS]
FROM   (VALUES (1, NCHAR(0xFB06)), (2, NCHAR(0xD802) + NCHAR(0xDC2C)),
(3, NCHAR(0xD83D) + NCHAR(0xDE43))) tab([ID], [col])
ORDER BY tab.[ID];
```

The query above returns the following 3 rows:

ID Char Prefixed   BIN   BIN2   90_CI_AS
Char Prefixed   Char Prefixed   Char Prefixed
1 _ﬆ   3 2   3 3   1 1
2 𐠬 _𐠬   1 1   1 1   2 2
3 🙃 _🙃   2 3   2 2   3 3

The reason for including this query is to highlight two related nuances of working with UTF-16 (i.e. the Unicode encoding used by `NVARCHAR` data): one of terminology, and the other of behavior.

Looking at the order number columns (the six right-most columns), we can observe the following:

1. Both the prefixed and non-prefixed columns for “`_BIN2`” sorted in the same order.
2. The non-prefixed characters for both “`_BIN`” and “`_BIN2`” sorted in the same order.
3. The non-prefixed and prefixed characters for “`_BIN`” did not sort in the same order.

So far, all of those behaviors are expected and consistent with what we have seen previously in this post.

But, the non-binary, version 90 collation columns (the right-most two columns) show an ordering that is both:

• the same between the non-prefixed and prefixed columns, yet
• different than either of the two other sort orders seen in the columns to the left

This difference in sort order exposes a technicality that is not often mentioned, probably because most of the time there is little practical benefit in being this technical. By now you are probably aware that the underlying values used in Unicode to uniquely identify each character are called “code points”. A code point, can be encoded in various ways. The Unicode specification provides five options for encoding: UTF-8, UTF-16 Big Endian, UTF-16 Little Endian, UTF-32 Big Endian, and UTF-32 Little Endian (I am leaving out the non-Endian variations of UTF-16 and UTF-32 since they merely imply Big Endian). UTF-8 encodes each code point as a single entity of between 1 and 4 bytes. UTF-32 encodes each code point as a single entity of 4 bytes.

In contrast, UTF-16 encodes each code point as either one or two 2-byte entities. For the first 65,536 code points (the total number of combinations of 2 bytes, `0x0000``0xFFFF`), everything said so far still holds true. But, things get a little confusing when dealing with Supplementary Characters (i.e. all characters outside of the initial 65,536). Supplementary Characters, like all other characters, each have a unique code point. In UTF-8 and UTF-32 there is no difference, but in UTF-16 each Supplementary Character is comprised of two non-Supplementary Characters (i.e. a “surrogate pair”). So what does the term “code point” refer to? Does it refer to the actual Supplementary Character, or does it refer to each of the two-byte pieces that make up the Supplementary Character? It can’t be both. And it isn’t.

In UTF-16, the two-byte entities are actually known as “code units“. And, it just so happens that the first 65,536 code points are each made up of a single code unit having the same value (e.g. code point U+FB06 is code unit `FB06`). Supplementary Characters are code points comprised of two code units (e.g. code point U+1F643 is comprised of code units `D83D` and `DE43`).

Looking back at the Microsoft documentation quote at the beginning of this section, we can now see that the statement, “In a BIN2 collation all characters are sorted according to their code points”, is not entirely accurate. Sure, for non-Supplementary Characters that is true, but dealing with Supplementary Characters shows us that the definition of BIN2 collations is technically incorrect. The BIN2 collations, when dealing with `NVARCHAR` data, sort by code unit, not by code point.

Am I being unfair, or overly nit-picky? No, because the query and results above show that there are three ways of sorting these values, and it is not possible to explain this behavior without understanding the difference between code unit and code point. Hopefully the chart below will help visualize the difference in these three sorting methods. This chart shows the same three characters in the same progression as the results above: `BIN`, then `BIN2`, and then non-binary-version-90-or-newer. But this time I’m showing how the sorting algorithm saw each character, which will help explain why each approach did something different.

• Binary (`_BIN`): this algorithm goes byte-by-byte (after the first character, so after the first two bytes). And, since we are working on a Little Endian platform, the bytes of each base entity are reversed. A UTF-16 entity (i.e. code unit) is always 2 bytes. The value in parenthesis in the “Binary” column is the reversed first code unit, which is why the first byte in the first row is `0x02` instead of `0xD8`, and the first byte in the second row is `0x06` instead of `0xFB`.
• Code Unit (`_BIN2`): this algorithm goes code unit-by-code unit. A code unit’s value stays the same no matter how it is encoded. So the number of bytes and the order that they are in physically does not matter. There is no need to show a different representation in parenthesis in the “CodeUnit” column because `D802` is `D802`, even if that code unit is stored as `0x02D8` on disk.
• Code Point (non-binary-version-90-or-newer): this algorithm goes code point-by-code point (for Supplementary Characters only; otherwise non-binary collations use culture-specific linguistic rules). The value in parenthesis in the “CodePoint” column is the actual code point, and that is the value used for sorting (not the bytes or the code units).
Binary (`_BIN`) CodeUnit (`_BIN2`) CodePoint (`_90_CI_AS`)
0xD802DC2C (02D8) 0xD802DC2C 0xFB06 (U+FB06)
0xFB06 (06FB) 0xD83DDE43 0xD802DC2C (U+1082C)
0xD83DDE43 (3DD8) 0xFB06 0xD83DDE43 (U+1F643)

Given the description of the “`_BIN2`” collations as being “code point” ordered yet they are actually “code unit” ordered, which one is correct? According to the following quote from the official Unicode standard (i.e. the specification) in Chapter 2: General Structure / Section 2.5: Encoding Forms (page 39):

All three encoding forms give the same results for binary string comparisons or string sorting when dealing only with BMP characters (in the range U+0000..U+FFFF). However, when dealing with supplementary characters (in the range U+10000..U+10FFFF), UTF-16 binary order does not match Unicode code point order.

it would appear that the behavior is correct, and the documentation should probably be updated.

### UTF-8 Bonus Round

Starting in SQL Server 2019 it’s possible to use the UTF-8 encoding.

The Unicode specification (same section / page as quoted directly above) states:

A binary sort of UTF-8 strings gives the same ordering as a binary sort of Unicode code points. This is obviously the same order as for a binary sort of UTF-32 strings.

We can test with the following query:

```SELECT tab.[ID],
tab.[col] AS [TheCharNVC],
CONVERT(VARCHAR(10), tab.[col] COLLATE UTF8_BIN2) AS [TheCharVC],
'--' AS [--],
ROW_NUMBER() OVER (ORDER BY tab.[col] COLLATE UTF8_BIN2)
AS [CharNVC_UTF8_BIN2],
ROW_NUMBER() OVER (ORDER BY CONVERT(VARCHAR(10), tab.[col]
COLLATE UTF8_BIN2)) AS [CharVC_UTF8_BIN2]
FROM   (VALUES (1, NCHAR(0xFB06)),
(2, NCHAR(0xD802) + NCHAR(0xDC2C)),
(3, NCHAR(0xD83D) + NCHAR(0xDE43))) tab([ID], [col])
ORDER BY tab.[ID];
```

which returns:

ID   TheCharNVC
(UTF-16)
TheCharVC
(UTF-8)
CharNVC_UTF8_BIN2
(UTF-16)
CharVC_UTF8_BIN2
(UTF-8)
1   3 1
2 𐠬 𐠬   1 2
3 🙃 🙃   2 3

The `NVARCHAR` columns show us that we are dealing with the same three characters between the UTF-8 and UTF-16 columns, that the ordering is consistent with what we saw previously for `NVARCHAR` data, and that the ordering is different than the UTF-8 column. We are now interested in UTF-8 behavior, and that can only be found when working with `VARCHAR` data (and using a UTF8 collation, of course). The chart above shows that yes, the `UTF8_BIN2` collation (new in SQL Server 2019 CTP 2.3) does indeed use the code point value for ordering.

5 (1)

5 (1)

# Book Review: Big Red - Voyage of a Trident Submarine

I've grown up reading Tom Clancy and probably most of you have at least seen Red October, so this book caught my eye when browsing used books for a recent trip. It's a fairly human look at what's involved in sailing on a Trident missile submarine...

2009-03-10

# Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Question: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup? This question was sent to me via email. My reply follows. Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup? Databases to be mirrored are currently running on 2005 SQL instances but will be upgraded to 2008 SQL in the near future.

2009-02-23

# Inserting Markup into a String with SQL

In which Phil illustrates an old trick using STUFF to intert a number of substrings from a table into a string, and explains why the technique might speed up your code...

2009-02-18

# Networking - Part 4

You may want to read Part 1 , Part 2 , and Part 3 before continuing. This time around I'd like to talk about social networking. We'll start with social networking. Facebook, MySpace, and Twitter are all good examples of using technology to let...

2009-02-17