RE: BCP - unicode – SQLServerCentral

SSCoach

Points: 16431

February 4, 2019 at 12:03 pm

#2021075

JohnG69 - Tuesday, August 4, 2015 11:18 AM
I'm building a SP that uses the BCP command to create a file with the -w option.
When I open the file in TextPad, I see the file type as Unicode
When I open the file in editpad pro, I see the file type as UTF-16
If I understand correctly, BCP -w spits out files that are UTF-16LE... Is this correct?
Based on http://unicode.org/faq/utf_bom.html, the difference between UTF-16 and UTF-16LE is:"UTF-16 and UTF-32 use code units that are two and four bytes long respectively. For these UTFs, there are three sub-flavors: BE, LE and unmarked. The BE form uses big-endian byte serialization (most significant byte first), the LE form uses little-endian byte serialization (least significant byte first) and the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used."
How does this affect our SQL Servers?
This can only affect developers if they read the files, right?
or does it affect the tables that are loaded by the bcp?

Hello JG.

"UTF-16" by itself should refer to "UTF-16 BE", but that is just a recommendation. In practice, especially in Microsoft-land, it is often used in place of "UTF-16 LE", especially because in many cases, people are unaware of Endianess in the first place.
"UTF-16 LE" is what Windows / .NET / SQL Server uses. It is the encoding used for "string" in .NET and NVARCHAR / NCHAR / NTEXT / XML in SQL Server.
"Unicode" is a synonym for "UTF-16 LE" in Microsoft-land. In .NET, if you want UTF-16 BE, they have "BigEndianUnicode".
Yes, the "-w" option of BCP reads or writes (depending on if you are using BCP to import or export data) Unicode / UTF-16 LE.
None of this affects your SQL Server instances. UTF-16 LE is simply the encoding used for the Unicode-only datatypes (NVARCHAR / NCHAR / NTEXT / XML).
Starting with SP2 for SQL Server 2014, BCP and BULK INSERT support UTF-8 files via code page '65001'. While UTF-8 is Unicode, it is an 8-bit encoding and so you cannot use the " -w " switch. You would use the " -c " switch since along with " -C 65001 ". This data can go into, or come from, NVARCHAR types, at which point it will be converted to, or from, UTF-16 LE (depending on whether you are importing or exporting data).
Starting with SQL Server 2019, UTF-8 is supported internally via new " _UTF8 " collations. These collations use code page 65001 in VARCHAR / CHAR (but not TEXT) datatypes. This has no bearing on how data is imported from files, or exported to files, as the file encoding is handled by the client code (bcp.exe / BULK INSERT / sqlcmd.exe / etc). Having UTF-8 files to import, or wanting to export to UTF-8, will not be made any easier by using UTF-8 collations, however, since again, that is handled by the code doing the import or export. For further details on this new feature, please see my post: Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?
Yes, the encoding used by import / export files is only a concern for the producer of the import file, or consumer of the export file. Import files should use whatever encoding that can be imported by SQL Server and not result in data loss. Meaning, if you have Unicode data, requesting import files be encoded as Unicode / UTF-16 LE will work across all versions of BCP / BULK INSERT, etc. If you are using SQL Server 2014 SP2 or newer, then you have the option of requesting that import files be encoded as UTF-8. Likewise with exports, you need to export in whatever encoding can be read by the consumer and won't result in data loss. If you have Unicode data, exporting as Unicode / UTF-16 LE via the " -w " switch will work across all versions, while using " -c -C 65001 " to encode the file as UTF-8 will work starting with SQL Server 2014 SP2.
None of this affects how the data is stored within SQL Server (i.e. "the tables that are loaded by bcp"). Within SQL Server, the encoding is determined by datatype and collation:

If NVARCHAR / NCHAR / NTEXT / XML, then the encoding is UTF-16 LE.
If VARCHAR / CHAR / TEXT, then the encoding is determined by the code page used by the collation of the column (or the default collation of the database when dealing with string literals and variables), with the following two exceptions:

There are 15 collations that are "Unicode-only" and have a code page of 0, and hence cannot be used with VARCHAR / CHAR / TEXT datatypes
The UTF-8 collations cannot be used with the TEXT datatype

Hopefully this info helps, even if it has been 3.5 years 🙂 .

For more info on working with collations, please visit: Collations Info

Take care, Solomon...

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR