Update: I just read the link in Steve's editorial: https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/ There is a lot of good information in there. It looks like the UTF-8 collation is case insensitive, so I shouldn't have too much rework. I didn't see any mention of specifying the number of characters. It seems that everything is in byte lengths, so I may need to make a few fields wider. Since most of my data is in the 0-127 range, it shouldn't be too bad. Most of them have already been lengthened to accommodate the extra bytes already.
Hi there. Glad you found the info useful. I have since updated the article to remove some obsolete stuff, rearrange some things, add some new info, etc. I can also say that there is no singular UTF-8 collation (i.e. it's not "the UTF-8 collation"). The UTF-8 collations are mostly just a duplicate of all of the "_SC" and "_140_" collations with "_UTF8" added to the name. The only difference for the "_UTF8" version is that the encoding used for
VARCHAR data (columns, variables, and literals) is UTF-8 / code page 65001 instead of one of the 8-bit code page normally associated with that locale (i.e. "Latin1_General" = code page 1252, "Hebrew" = code page 1255, etc). Hence, you can still tailor the sorting and comparison sensitivities for any non-binary UTF8 collation you use (i.e. case, accent, width, kana type, and variation selector).
And yes, when specifying max size/length for the string types (anything that will accept a number), it is always in terms of "code units". Yes, most people, and maybe even most documentation/books, has characterized this value as being "characters", but that is oversimplifying it. To be fair, there was a time when it was accurate to say that the
VARCHAR(n) was "characters" and it was never incorrect, but that ended 20 years ago when SQL Server 2000 was released. But, "code units" are 1 byte in 8-bit datatypes (i.e.
VARCHAR and the deprecated
TEXT ) and 2 bytes in 16-bit datatypes (i.e.
NVARCHAR and the deprecated
NTEXT ). In UTF-16, "characters" are either 1 or 2 code units (i.e. 2 or 4 bytes). In UTF-8, "characters" are 1 - 4 code units (i.e. 1 - 4 bytes).
For more details regarding characters vs bytes, please see:
How Many Bytes Per Character in SQL Server: a Completely Complete Guide