UTF-8 to UTF-16 conversion

  • Hi all,

    As you all know the bulk insert / bcp commands are made to work with UTF-16le files.

    on a daily basis, We receive multiple CSV files which are in UTF-8 coding.

    I found a great way to convert the file using powershell:

    Set @SQLStr = 'xp_cmdshell ''powershell.exe -command "get-content ' + @SourcePathFile + ' | out-file -filepath ' + @DestinationPathFile + ' -encoding Unicode"'''

    Exec (@SQLSTr)

    The problem is that the customer that we are dealing with doesn't want to enable xp_cmdshell (which can be understandable for some and not for others)

    I tried the Bulk insert on an UTF-8 encoded file and it seems to work but I see extra characters in the first row, which puts in doubt the validity of loading a UTF8 file with bulk insert...

    drop table #tmp

    create table #tmp (line varchar(8000));

    bulk insert #tmp from 'C:\Users\John_Doe\The_UTF8_File.csv' with (DATAFILETYPE = 'widechar',Firstrow = 1, ROWTERMINATOR = '')

    select * from #tmp

    Without having to install or change anything on the client's server, how can I automate this conversion process, so that the bulk insert works properly?

    The idea is so simple... "convert a UTF8 file to UTF16le"... but it doesn't seem to be that simple since I can't find it anywhere on the web...

    any idea is welcome.

    thanks

    JG

  • JohnG69 - Friday, September 25, 2015 9:47 AM

    As you all know the bulk insert / bcp commands are made to work with UTF-16le files.

    We receive multiple CSV files which are in UTF-8 coding.
    I found a great way to convert the file using powershell ... The problem is that the customer that we are dealing with doesn't want to enable xp_cmdshell. I tried the Bulk insert on an UTF-8 encoded file and it seems to work but I see extra characters in the first row, which puts in doubt the validity of loading a UTF8 file with bulk insert.

    bulk insert #tmp
       from 'C:\Users\John_Doe\The_UTF8_File.csv'
       with (DATAFILETYPE = 'widechar',Firstrow = 1, ROWTERMINATOR = '')

    Without having to install or change anything on the client's server, how can I automate this conversion process, so that the bulk insert works properly?

    Hi John. First off, please see my reply to your related question from another topic here, which covers the basics of UTF-8 / UTF-16LE, etc:

    BCP - unicode

    Secondly, using a DataFileType of "widechar" is incorrect for UTF-8. "widechar" indicates UTF-16 LE, the 16-bit encoding used by NVARCHAR / NCHAR / NTEXT / XML. You need to use "char" as that is for 8-bit encodings, which is what UTF-8 is.

    Thirdly, you don't mention either here or in the related question which version of SQL Server you are using. But you posted both towards the end of 2015, so I should mention that starting with SQL Server 2014 SP2, code page 65001 (which is UTF-8) is supported in bcp and BULK INSERT.

    Fourthly, for anyone using SQL Server 2014 SP1 or older, the incoming data does need to be converted at some point. PowerShell is  good option, and I don't believe that you need to use xp_cmdshell in order to do this. You should be able to use SQL Server Agent by creating either a "PowerShell" step, or an "Operating System (CmdExec)" step in which you call PowerShell. If neither of those options work, it might still be possible to import the data as-is, into a VARCHAR column, and use a function to convert the bytes from UTF-8 into UTF-16 LE (i.e. NVARCHAR). This can be done easily with a SQLCLR function, or I believe I saw a T-SQL version of such a function somewhere, but don't remember off-hand. I can update this later if I find it and it actually works.

    I hope that helps. Take care, Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply