Maintaining Latin characters in SQL BULK INSERT with Format File

  • I'm trying to import a .CSV file into a SQL Server table using a BULK INSERT with a format file. I can get it to import, but any Latin characters are imported as strange characters. I take a lot of pride in completing this personal project on my own, but I've hit a point where I just need help. I can change the characters by doing some messy UPDATE and REPLACE statements after importing the data, but I really want to be able to import the Latin characters as they appear in the .CSV file in one step.

    Here's the database and table that I created:

    CREATE DATABASE Test;

    CREATE TABLE dbo.rawData

    ([Position] nvarchar(500) NULL,

    [Const] nvarchar(500) NULL,

    [Created] nvarchar(500) NULL,

    [Modified] nvarchar(500) NULL,

    [Description] nvarchar(500) NULL,

    [Title] nvarchar(500) NOT NULL,

    [TitleType] nvarchar(500) NULL,

    [Directors] nvarchar(500) NULL,

    [YouRated] nvarchar(500) NULL,

    [IMDbRating] nvarchar(500) NULL,

    [Runtime] nvarchar(500) NULL,

    [Year] nvarchar(500) NULL,

    [Genres] nvarchar(500) NULL,

    [NumVotes] nvarchar(500) NULL,

    [ReleaseDate] nvarchar(500) NULL,

    nvarchar(500) NULL,

    )

    GO

    And here is some of the data that I'm working with taken from a .CSV file (saved as ratings.csv). I use Notepad++ and it is encoded in UTF-8. Notice how the last row for "Dallas Buyers Club" has a director with a Latin character in his name:

    "position","const","created","modified","description","Title","Title type","Directors","You rated","IMDb Rating","Runtime (mins)","Year","Genres","Num. Votes","Release Date (month/day/year)","URL"

    "1","tt0437863","Tue Feb 16 00:00:00 2016","","","The Benchwarmers","Feature Film","Dennis Dugan","5","5.6","80","2006","comedy, romance, sport","39413","2006-04-07","http://www.imdb.com/title/tt0437863/"

    "2","tt0085334","Tue Feb 16 00:00:00 2016","","","A Christmas Story","Feature Film","Bob Clark","6","8.1","94","1983","comedy, family","103770","1983-11-18","http://www.imdb.com/title/tt0085334/"

    "3","tt2403029","Tue Feb 16 00:00:00 2016","","","The Starving Games","Feature Film","Jason Friedberg, Aaron Seltzer","2","3.3","83","2013","comedy","13719","2013-10-31","http://www.imdb.com/title/tt2403029/"

    "4","tt0316465","Tue Feb 16 00:00:00 2016","","","Radio","Feature Film","Michael Tollin","6","6.9","109","2003","biography, drama, sport","31692","2003-10-24","http://www.imdb.com/title/tt0316465/"

    "5","tt0141369","Tue Feb 16 00:00:00 2016","","","Inspector Gadget","Feature Film","David Kellogg","4","4.1","78","1999","action, adventure, comedy, family, sci_fi","35340","1999-07-18","http://www.imdb.com/title/tt0141369/"

    "6","tt0033563","Tue Feb 16 00:00:00 2016","","","Dumbo","Feature Film","Sam Armstrong, Norman Ferguson","6","7.3","64","1941","animation, family, musical","80737","1941-10-23","http://www.imdb.com/title/tt0033563/"

    "7","tt0384642","Tue Feb 16 00:00:00 2016","","","Kicking & Screaming","Feature Film","Jesse Dylan","5","5.5","95","2005","comedy, family, romance, sport","29539","2005-05-01","http://www.imdb.com/title/tt0384642/"

    "8","tt0116705","Tue Feb 16 00:00:00 2016","","","Jingle All the Way","Feature Film","Brian Levant","7","5.4","89","1996","comedy, family","66879","1996-11-16","http://www.imdb.com/title/tt0116705/"

    "9","tt1981677","Tue Feb 16 00:00:00 2016","","","Pitch Perfect","Feature Film","Jason Moore","7","7.2","112","2012","comedy, music, romance","203205","2012-09-28","http://www.imdb.com/title/tt1981677/"

    "10","tt0409459","Tue Feb 16 00:00:00 2016","","","Watchmen","Feature Film","Zack Snyder","7","7.6","162","2009","action, mystery, sci_fi","368137","2009-02-23","http://www.imdb.com/title/tt0409459/"

    "11","tt1343092","Tue Feb 16 00:00:00 2016","","","The Great Gatsby","Feature Film","Baz Luhrmann","5","7.3","143","2013","drama, romance","345664","2013-05-01","http://www.imdb.com/title/tt1343092/"

    "12","tt0332379","Tue Feb 16 00:00:00 2016","","","School of Rock","Feature Film","Richard Linklater","5","7.1","108","2003","comedy, music","202083","2003-09-09","http://www.imdb.com/title/tt0332379/"

    "13","tt0120783","Tue Feb 16 00:00:00 2016","","","The Parent Trap","Feature Film","Nancy Meyers","6","6.4","128","1998","adventure, comedy, drama, family, romance","82087","1998-07-20","http://www.imdb.com/title/tt0120783/"

    "14","tt0790636","Tue Feb 16 00:00:00 2016","","","Dallas Buyers Club","Feature Film","Jean-Marc Vallée","7","8.0","117","2013","biography, drama","308118","2013-09-07","http://www.imdb.com/title/tt0790636/"

    I have a format file (saved as format.fmt), that when opened in Notepad++ looks like this:

    11.0

    16

    1 SQLCHAR 0 1000 "\",\"" 1 Position SQL_Latin1_General_CP1_CI_AS

    2 SQLCHAR 0 1000 "\",\"" 2 Const SQL_Latin1_General_CP1_CI_AS

    3 SQLCHAR 0 1000 "\",\"" 3 Created SQL_Latin1_General_CP1_CI_AS

    4 SQLCHAR 0 1000 "\",\"" 4 Modified SQL_Latin1_General_CP1_CI_AS

    5 SQLCHAR 0 1000 "\",\"" 5 Description SQL_Latin1_General_CP1_CI_AS

    6 SQLCHAR 0 1000 "\",\"" 6 Title SQL_Latin1_General_CP1_CI_AS

    7 SQLCHAR 0 1000 "\",\"" 7 TitleType SQL_Latin1_General_CP1_CI_AS

    8 SQLCHAR 0 1000 "\",\"" 8 Directors SQL_Latin1_General_CP1_CI_AS

    9 SQLCHAR 0 1000 "\",\"" 9 YouRated SQL_Latin1_General_CP1_CI_AS

    10 SQLCHAR 0 1000 "\",\"" 10 IMDbRating SQL_Latin1_General_CP1_CI_AS

    11 SQLCHAR 0 1000 "\",\"" 11 Runtime SQL_Latin1_General_CP1_CI_AS

    12 SQLCHAR 0 1000 "\",\"" 12 Year SQL_Latin1_General_CP1_CI_AS

    13 SQLCHAR 0 1000 "\",\"" 13 Genres SQL_Latin1_General_CP1_CI_AS

    14 SQLCHAR 0 1000 "\",\"" 14 NumVotes SQL_Latin1_General_CP1_CI_AS

    15 SQLCHAR 0 1000 "\",\"" 15 ReleaseDate SQL_Latin1_General_CP1_CI_AS

    16 SQLCHAR 0 1000 "\"" 16 URL SQL_Latin1_General_CP1_CI_AS

    When I run the following code, everything imports, however the Latin characters are replaced with a series of strange characters. Here is the code that I'm running:

    BULK INSERT [Test].[dbo].[rawData]

    FROM 'C:\IMDbRatings\Files\ratings.csv' WITH (FIRSTROW = 2, FORMATFILE= 'C:\IMDbRatings\format.fmt');

    A few things I tried were changing the .CSV file to UCS-2 BE, adding different conditions in the WITH clause of the BULK INSERT, and changing the variable type in the format file to SQLNCHAR instead of SQLCHAR, but nothing worked. Often what happens in these cases is "0 rows are affected", rather than an error. Any help would be so appreciated.

  • Did you use the 'encode' or 'convert' choice in notepad++?

    I used 'convert to ucs2 be bom' and saved the file, then a simple bulk insert as.

    BULK INSERT dbo.rawData

    FROM 'C:\data\flatFileSamples\ratings.csv'

    WITH

    (

    CODEPAGE = 'ACP',

    FIELDTERMINATOR = ',',

    ROWTERMINATOR =''

    );

    It may not be the format of data you want but the accent in the name came through. I'm assuming this is BULK INSERT not liking UTF8.

  • I tried without the format file (which is ultimately what I will need to use) and by changing the encoding, but I'm still getting an error. This is the code I'm using:

    BULK INSERT [Test].[dbo].[rawData]

    FROM 'C:\IMDbRatings2\Files\ratings.csv' WITH (CODEPAGE = 'ACP', FIELDTERMINATOR = ',', ROWTERMINATOR='');

    And here is the error message:

    Bulk load: DataFileType was incorrectly specified as char. DataFileType will be assumed to be widechar because the data file has a Unicode signature.

    Bulk load: DataFileType was incorrectly specified as char. DataFileType will be assumed to be widechar because the data file has a Unicode signature.

    Msg 4863, Level 16, State 1, Line 3

    Bulk load data conversion error (truncation) for row 1, column 16 (URL).

  • I apologise. I had been editing the source file to remove commas from within the fields, e.g. some of the directors have names separated by commas. I was then able to use comma as the field separator just to prove the encoding issue.

    So to prove the encoding issue without editing the file, convert to ucs-2 be bom with notepad++, then :

    BULK INSERT dbo.rawData

    FROM 'C:\data\flatFileSamples\ratings.csv'

    WITH

    (

    CODEPAGE = 'ACP',

    DATAFILETYPE = 'widechar',

    FIELDTERMINATOR = '","'

    );

    You should get the data to import (not perfectly) and it'll show the accent.

  • Thanks for the response. I tried using your exact instructions, including the encoding change, but I get this error:

    Msg 4863, Level 16, State 1, Line 22

    Bulk load data conversion error (truncation) for row 1, column 16 (URL).

  • It's possible that we have differences in our data files, probably hidden characters. You'll have to experiment, maybe look at CR LF characters

  • You're missing your row delimiter in the last row from your format file.

    16 SQLCHAR 0 1000 "\"\r\ n" 16 URL SQL_Latin1_General_CP1_CI_AS

    EDIT: remove the space between \ and n. I added it because the forum software would remove it.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Hi, the original problem I'm facing is that I cannot get the Latin characters from the .csv to SQL Server. Due to the format of the .csv (I need a delimiter and a text qualifier), I have to use a format file.

    However, DouglasH was suggesting that I try to see if the collation issues are still there after changing the encoding when using a BULK INSERT without the format file. The error message I was showing was when I try this without the format file. This error is separate from the collation issue.

  • Have you tried this options?

    ,CODEPAGE = 'RAW'

    ,DATAFILETYPE = 'char'

    You could also use widechar instead of char if you have unicode values.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • With my ratings.csv set to UCS-2 BE BOM and my format file looking like this:

    11.0

    16

    1 SQLCHAR 0 1000 "\",\"" 1 Position SQL_Latin1_General_CP1_CI_AS

    2 SQLCHAR 0 1000 "\",\"" 2 Const SQL_Latin1_General_CP1_CI_AS

    3 SQLCHAR 0 1000 "\",\"" 3 Created SQL_Latin1_General_CP1_CI_AS

    4 SQLCHAR 0 1000 "\",\"" 4 Modified SQL_Latin1_General_CP1_CI_AS

    5 SQLCHAR 0 1000 "\",\"" 5 Description SQL_Latin1_General_CP1_CI_AS

    6 SQLCHAR 0 1000 "\",\"" 6 Title SQL_Latin1_General_CP1_CI_AS

    7 SQLCHAR 0 1000 "\",\"" 7 TitleType SQL_Latin1_General_CP1_CI_AS

    8 SQLCHAR 0 1000 "\",\"" 8 Directors SQL_Latin1_General_CP1_CI_AS

    9 SQLCHAR 0 1000 "\",\"" 9 YouRated SQL_Latin1_General_CP1_CI_AS

    10 SQLCHAR 0 1000 "\",\"" 10 IMDbRating SQL_Latin1_General_CP1_CI_AS

    11 SQLCHAR 0 1000 "\",\"" 11 Runtime SQL_Latin1_General_CP1_CI_AS

    12 SQLCHAR 0 1000 "\",\"" 12 Year SQL_Latin1_General_CP1_CI_AS

    13 SQLCHAR 0 1000 "\",\"" 13 Genres SQL_Latin1_General_CP1_CI_AS

    14 SQLCHAR 0 1000 "\",\"" 14 NumVotes SQL_Latin1_General_CP1_CI_AS

    15 SQLCHAR 0 1000 "\",\"" 15 ReleaseDate SQL_Latin1_General_CP1_CI_AS

    16 SQLCHAR 0 1000 "\"" 16 URL SQL_Latin1_General_CP1_CI_AS

    And my BULK INSERT is:

    BULK INSERT [Test].[dbo].[rawData]

    FROM 'C:\IMDbRatings2\Files\ratings.csv' WITH (CODEPAGE = 'RAW' ,DATAFILETYPE = 'widechar', FIRSTROW = 2, FORMATFILE= 'C:\IMDbRatings2\format3.fmt');

    However, when I run this I see 0 rows returned.

    EDIT: Same result when I change the last qualifier to "\"\r\ n" (without the space)

    EDIT 2: I'm able to pull data in when it is set to UTF-8, but it has improper collation.

    When I change to SQLNCHAR and set to UTF-8 it pulls a bunch of errors. When set to SQLNCHAR and set to UCS-2 BE BOM it pulls 0 rows.

  • Hi,

    I used to do this all the time --

    My understanding has always been that UTF8 was not supported by BCP. Therefore, if the incoming file was UTF8, I rewrote out the file to Unicode and did the bulk format on the Unicode file copy. Obviously this is done in a script task in SSIS, although I've written a small EXE to do the same thing. I don't know a way to change the encoding of the text file in SQL directly.

    Disclaimer: no idea if this dodge is still needed in SQL2014+, since the last time I did this was in SQL2012.

    >L<

  • Lisa Slater Nicholls (6/22/2016)


    Hi,

    I used to do this all the time --

    My understanding has always been that UTF8 was not supported by BCP. Therefore, if the incoming file was UTF8, I rewrote out the file to Unicode and did the bulk format on the Unicode file copy. Obviously this is done in a script task in SSIS, although I've written a small EXE to do the same thing. I don't know a way to change the encoding of the text file in SQL directly.

    Disclaimer: no idea if this dodge is still needed in SQL2014+, since the last time I did this was in SQL2012.

    >L<

    I believe you're on the right track. I converted the file to ANSI and it worked perfectly. It appears that UTF-8 is only compatible on 2016.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (6/23/2016) It appears that UTF-8 is only compatible on 2016.

    Wow, good to know. Do you have a doc reference for this being supported in 2016? Thanks!

    >L<

  • Lisa Slater Nicholls (6/23/2016)


    Luis Cazares (6/23/2016) It appears that UTF-8 is only compatible on 2016.

    Wow, good to know. Do you have a doc reference for this being supported in 2016? Thanks!

    >L<

    That's implied on BOL: https://msdn.microsoft.com/en-us/library/ms188365.aspx

    This is what implies it:

    ** Important ** Versions prior to SQL Server 2016 do not support code page 65001 (UTF-8 encoding).

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Thanks so much Luis. I want to pass this back to my ex-colleagues who are heavily reliant on my framework code that does this.

    It would be a significant perf boost not to have to do the file copy -- most of the customers' files do come in as UTF-8.

    >L<

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply