SQLServerCentral Article

Skipping Columns (& other tricks) Using BCP Format Files (SQL Spackle)

,

"SQL Spackle" is a collection of short articles written based on multiple requests for similar code or to share short methods for getting certain things done. These short articles are NOT meant to be complete solutions. Rather, they are meant to "fill in the cracks."

--Phil McCracken

Introduction

There have been a lot of basic questions about BCP, BULK INSERT, and BCP Format Files and I thought I'd tackle the more common questions in a nice, short, all-in-one-place "SQL Spackle" article. Notice that, although written specifically for BULK INSERT usage, everything applies to BCP (different syntax, same functionality).

Here are some of the questions I aimed to answer with this short article.

  1. How do I BULK INSERT a file that has fewer fields than my table has columns?
  2. How do I import a file where the fields are out of order compared to the columns in my table?
  3. How do I only load certain fields from a file? (Similar to having more fields in the file than columns in the table or just unwanted fields in the file)
  4. I got a truncation (or other) error during the import. How do I find what the cause is and where the bad data in the file is at? Can I narrow it down to a row and column?
  5. How do I load all the good data even if there's some bad data in a file?
  6. What on this good, green Earth is a {BCP} Format File?

The subjects of ETL, BCP, BULK INSERT, and BCP Format Files are huge subjects each in their own right. Full coverage of any of those aspects is certainly much too broad in scope to answer in a simple "SQL Spackle" article.

I can, however, give you a quick overview that will answer the questions above.

And before you even ask, no... BULK INSERT can't be used to import spreadsheets or database files.  It's just for text files (file extension doesn't matter and includes "CSV)) or SQL Native data files

The Table

Let's start off with creating a simple test table that we want to import a file into using BULK INSERT. If you're following along, open your favorite test database in SSMS and run the following code to create the table.

 CREATE TABLE dbo.SomeTable
        (
         SomeTableID INT IDENTITY(1,1) NOT NULL --This is column "1"
        ,ColA INT --This is column "2"
        ,ColB INT --This is column "3"
        ,ColC INT --This is column "4"
        )
;

The Text File

The following text file example contains fewer columns than the table that you're trying to import to. More specifically, it contains info for "ColC" (field 1 in the file, column 4 in the table) and "ColA" (field 3 in the file, column 2 in the table) of the table AND they are NOT in the same order as the table. (Note that I've used commas instead of tabs just for visibility purposes but tabs would also work just fine). It also has an "extra" column (field 2 in the file) that we don't actually want to import. The names of the columns are also different just as many of the files we receive contain.

If you save this file as C:\Temp\SomeFile.txt, you'll be able to run this whole example without any changes. Note that "C:\Temp\" must be on your server or other drive that SQL Server can actually see. One could write an entire chapter on how to make it so SQL Server can see a directory but that's well beyond the scope of this short article.

Column_C,Extra,Column_A
1,2,3
4,5,6
7,8,9

The BCP Format File

Here's the non-XML BCP format file that will import the 1st and 3rd "fields" of the file to the correct "ColC" and "ColA" columns (respectively, according to the file structure) of "YourTable". If you save this BCP file as C:\Temp\BCP Format File.fmt (even including the annotations near the bottom of the file), you'll be able to run this whole example without any changes (except for the extra space described after this).

8.0
3
1     SQLCHAR   0     8000     ","         4   ColC        ""
2     SQLCHAR   0     8000     ","         0   NotUsed     ""
3     SQLCHAR   0     8000     "\r\n"      2   ColA        ""
----- ------- ------ ------ ------------ ----- -------- ---------
File   Data   Prefix  Data  End Of Field Table  Table   Collation
Order  Type   Length Length   Delimiter  Col # Col Name   Name

Just a quick walk-through on the format file...

The first line identifies the rev of "BCP". For most text files, it's backwards and forwards compatible. "8.0" is for SQL Server 2000, "9.0" is for SQL Server 2005, etc, etc. You can set it to "8.0" and probably never have to worry about what the rev is ever again.

Contrary to popular belief, the second line does NOT identify the number of fields in the file. Rather, it identifies the number of "format lines" that will follow in the BCP Format file. Typically, there is a 1-to-1 match of "format lines" to fields in the file, but not always. There are some tricks that can be done (WAY beyond the scope of this post) where this might be different.

Note that anything that follows the "format lines" isn't considered by BCP or BULK INSERT. This is a great place to document the format file with embedded comments (much more extensively than I did).

The "File Order" column should be in numeric sequence starting at "1". It also makes it easier to think about the file "from left to right". Again, this is normally 1-to-1 with the file but can be logically different in some very special cases (again, WAY beyond the scope of this article).

The "Data Type" column is used to identify the data type of the field you're importing. You could specifiy the datatype of each filed but I think that's overkill because the staging table should be setup with the correct datatypes.  I only use SQLCHAR or SQLNCHAR for that reason.  It helps keep things simple.  (See the "References" section at the end of this article for more information)

The "Prefix Length" column is used in conjunction with the "Data Type".  For me, it's a "0" for SQLCHAR and "2" for SQLNCHAR.  The other datatypes have specific requirements for "Prefix Length" if NULLs are possible in the data file. Again, I use a properly defined table to take care of all of this instead of defining it in the BCP Format File for simplicity and maintenance sake. (See the "References" section at the end of this article for more information)

The "Data Length" column identifies the data length of the field in the file. This only has to match for "fixed length" fields in the data file. For most delimited files (like the one we're working on), the only thing that matters is that it should be bigger than the largest expected width of the given field. With that thought in mind, I typically just use "8000" (the max for SQLChar) for everything. That will also allow for some "auto-magic" error checking because, if the data in the file is too wide for the column in the table, it will give you a "truncation" error letting you know something bad happened in the file. More on that in a minute.

The "End of Field Delimiter" column contains the delimiter (in quotes) that is just to the right of the respective field (that's why they call it a "FIELDTERMINATOR" in Books Online). Obviously the last field in each line of the file also contains the "End of Record" or "Row Terminator" delimiters and, for many text files, it's "\r\n", which respectively stand for "Carriage Return and Line Feed (or "newline"). Be advised that some text files are setup just for one or the other. For "Tab Delimited" files, you would use "\t" as the delimiter for all fields except the last one. I've used a comma just because that's the example I've setup for visibility purposes. There are some more tricks that can be played here for multi-line records but, again, that's WAY beyond the scope of this post.

The "Table Col #" column is what allows you to control which file field goes to which table column. Notice the first "format line" in the format file where this value is "4". The first field of the file contains data for "ColC" and "ColC" is the 4th column in the table. If you look at the 2nd "format line", that's for a field in the file that we don't actually want to import so we give it the value of "0" in this column so it won't be loaded. The 3rd "format line" contains data for "ColA" and "ColA" is the 2nd column in the table so that "format line" contains the number "2" in this column. You get the idea.

The "Table Col Name" column of the format file is a bit of a misnomer. It doesn't actually matter what's in this column so long as you follow the formatting rules for this column and the formatting rules are simply to use double-quotes around anything you want if what you want contains spaces or special characters. For example, the first "format line" is supposed to be for "ColC" of the table but I've used "Doesn't matter" here because it truly doesn't matter what's in this column. I will say that I'll generally use the actual name of the column in the table or the file just for documentation purposes and ease of troubleshooting, but you can use whatever you want'. I'll usually use "Not Used" for any field that I'm not importing.

The last column of the BCP Format File is for "Collation Name". Two double quotes should be used for any non-character based columns in the table. For character based columns, you could include the collation name for the particular column if it's other than the server default. Typically and unless there's some really strange collation stuff going on between the file and the table, I use "" for this entire column just to keep the visual clutter in check and to make it so special cases stick out like a sore thumb.

The BULK INSERT Statement

Ok, how to use the BCP Format File with BULK INSERT? Here's what I would use for the particular example I've laid out above. Again, if you've saved the files to the directory/file names I've used in all the above, this will work without modification.

   BULK INSERT dbo.SomeTable
   FROM 'C:\Temp\SomeFile.txt'
   WITH (
         BATCHSIZE      = 50000
        ,CODEPAGE       = 'RAW'
        ,DATAFILETYPE   = 'char'
        ,FIRSTROW       = 2
        ,FORMATFILE     = 'C:\Temp\BCP Format File.fmt'
        ,MAXERRORS      = 2000000000
        ,ERRORFILE      = 'C:\Temp\SomeFile.err'  
        ,TABLOCK
        )
;

A couple of things to notice.

The first row of the command identifies which table to insert into. This cannot be a variable or other expression. It has to be a string literal. If you need it to change and you've properly sanitized the input, you could use Dynamic SQL here to make it so.

The second row of the command identifies which file to import. Again, this has to be a string literal so, observing the rules to avoid SQL Injection, the command would have to be Dynamic SQL. Again, I've used "C:\Temp\" and that means that the file has to be on the server itself. Use a full UNC path (\\MachineName\ShareName\Path\FileName.FileExtension) that SQL Server can "see" if the file resides on another machine.

Moving into the options of the BULK INSERT command, BATCHSIZE controls the number of rows to import as a single transaction. The only reason why I don't make this a huge number, like 2 billion, is because I'm easily bored and need something to do. I don't want to use a default of "1000" because that will actually slow the import down (and, for the size of the files I load, performance really counts) but I do want to see the progress of the load. So, I'll typically use 50000 or 100000 depending on what the size of the file is. Feel free to play with this number to get the best performance for the files that you are loading.

I don't normally have to work with UNICODE files so I use "RAW" for a CodePage (which also allows me to import certain "special" characters). You can change that or even eliminate it if you're just working with "normal ASCII" text files. I like to have it there as a reminder as to what kind of file I'm working with. The same goes for the DATAFILETYPE. Note that if you are working with Unicode files, you'll need to change these. Since that's also beyond the scope of this short article, please refer to Books Online for what to change them to.

FIRSTROW effectively allows you to skip rows at the beginning of a file such as the header of a file. Just be advised that it's actually a misnomer and should have been named "FirstRECORD". It doesn't skip "rows" in the classic sense. Instead, it skips "delimiter sets" as defined in the "format lines" of the BCP Format File. The reason why they did this is so that you could import multi-line rows as a single record (an easy thing to do with a BCP Format File). What that means is that the first row (in this case), must have precisely the same delimiters as all of the other rows in the file or you'll get an error or unexpected results that may take you a month of Sundays to troubleshoot without knowing this tidbit. Obviously, "multi-line records" would change the rules for what "FIRSTROW" actually is and could be the topic of an article on more advanced usage of BCP Format Files.

FORMATFILE is pretty obvious. It contains the path to and name of the format file with all the same caveats as the file path of the file being loaded (Dynamic SQL, etc). Again, the one really big thing that trips a lot of people up is that if the path is based on drive letters, that path must be a local drive of the server and usually cannot include a "mapped drive". If you need to import from a file located on a different box, you must use the full UNC path (\\MachineName\ShareName\Path\FileName.FileExtension) for the import to be successful. There are exceptions like certain drives (usually, SAN) setup for SQL Server that the server thinks of as "local storage".

The next two options (MAXERRORS and ERRORFILE) are two very powerful parameters that many people simply overlook. One of the largest complaints is that if a bad line shows up in the file, it will stop the import when it hits that line and what they'd really like to do is to import whatever it can and then troubleshoot/repair what it can't. That's what MAXERRORS and ERRORFILE are for. I'll typically set MAXERRORS to some ridiculously high number (2 billion in the example) that I know will be larger than any number of rows that will be in a file so that it never "errors out". The ERRORFILE will contain any and all error rows along with a pretty decent "reason" why the particular row failed to load. For example, if you change the file being imported to look like the following…

Column_C,Extra,Column_A
1,2,3333333333333333333333333
4,5,6
7,8,9

And then run the code, 4 things will happen.

  1. Since the bad row didn't violate the delimiters (which would produce a different error than what we're about to describe), the two good rows are still imported into the table.
  2. The following error message is displayed on the screen (or the error logging if done in a job) to identify that there was an error (and it clearly identifies where the problem is, in this case).

    Msg 4867, Level 16, State 1, Line 1 Bulk load data conversion error (overflow) for row 2, column 3 (ColA).

  3. The file called (in this case) "SomeFile.err" (from the BULK INSERT command so will be whatever you defined it as) will be created. It contains the bad row from the file that caused the problem so you can rework it if you need to. Here's what will be in the "SomeFile.err" file.

    1,2,3333333333333333333333333

  4. The file called (in this case) "SomeFile.err.Error.Txt" will also be generated and will have a similar name as the previous file with "Error.Txt" added to it. This will contain an error message for the bad row contained in the .err file although it's a bit less helpful and more cryptic than what shows up on the screen.  Here's what will be in the "Somefile.err.Error.Txt" file.  The most useful information is "Row 2".

    Row 2 File Offset 25 ErrorFile Offset 0 - HRESULT 0x8002000a

    That's awefully cryptic but there is hope in the following URL which contains explanations for all of the HRESULT error codes.  It's easy to search for the code but it's also a case sensitive search.  Make sure that you make any letters in the HRESULT uppcase to search.  In this case, the error is an "Out of present range." errorL.

    http://msdn.microsoft.com/en-us/library/cc704587.aspx

Note that you MUST remove the error files that were created prior to the next run or you will get an error that prevents the loading of any additional information using the same BULK INSERT command.

Shifting gears a bit, one of the most important things to notice is that, because we're using a BCP Format File to control the "shape" of the import, you must NOT include either the FIELDTERMINATOR or the ROWTERMINATOR options.

The Results

Using the text file, BCP Format File, table, and BULK INSERT command that we created, here's the output from the table using the following command.

SELECT * FROM dbo.SomeTable
;

Here are the results.  

SomeTableID ColA        ColB        ColC

----------- ----------- ----------- -----------

1           3           NULL        1

2           6           NULL        4

3           9           NULL        7

(3 row(s) affected)

Some Advice

I always use a staging table for my imports so that I can further validate the data (can mark each row with what type of error I found for further troubleshooting/repair) and, perhaps, mark each imported row as an "INSERT" or an "UPDATE" when I finally do move the data to the final table. Since I load huge amounts of data and I want things to be "minimally logged" and also want to improve performance a fair bit more, I almost always include the TABLOCK option. Yes, there are some other rules to follow (lookup "minimal logged bulk copy [SQL Server]" in Books Online for details) but, without the TABLOCK option, it's all for naught.

As to which style of BCP Format File to use goes, my recommendation is that the non-XML version is quite a bit easier to use and maintain for many reasons but the following reasons are the big ones for me.

  1. Done properly, there's no need to establish the datatypes or widths of each column in either the file or the table column definitions.  If a datatype or width changes in a column of the staging table, there are no changes required in the non-XML format file, MAX datatypes notwithstanding.
  2. It's true that both the non-XML and XML format files can be generated from a table using BCP.  If a customization needs to be made (like we did in this article) in the XML format file, it usually means that you have to make the change in both the file definition section and the table definition section instead of just making a single line change in the non-XML version.  If you have a wide table, that can be a bit of a daunting task to get right because the two lines being changed might not be visible on the same screen at the same time.
  3. Setting up an XML format file to ignore file or table columns is a bit more difficult than just typing a "0" in a column of the non-XML format file.

Last and certainly not least, it says right in Books Online that using BULK INSERT to import CSV files is not supported (TSVs are, though) but you can do it if you want.  They're quite right, otherwise it would be an easy thing to import text qualified CSV data and it would be more forgiving on the inconsitant use of text qualifier quotes.  That shouldn't matter to anyone, though.  The real key is to get whomever or whatever is providing the data to provide it in a consistant manner just as you would do in a properly designed table.

Recap

Just to summarize what we've done.

  1. We built a test table with more columns than the file has.
  2. We have a file that doesn't have all those columns BUT it does have an extra column that we don't want to import.
  3. We built a non-XML BCP Format File to do the mapping of the desired columns and the exclusion of the unwanted column.
  4. We built a BULK INSERT command that imports the text file using the BCP Format File and discussed some options you might not have been aware of.
  5. We did a short test to see what some of the built in error checking and "bad row sequestration" does.

Your Turn... STUDY!

Heh... now, it's your turn. If you think you'll EVER need to use BCP or BULK INSERT to import (or export in the case of BCP) files and you haven't done a really deep dive in Books Online on all of the wonderful and relatively simple options that either has offer, then you're missing out on most of what those two wonderful tools can actually do. For example, did you know that you could use a BCP command to create the BCP Format File for a given table instead of building it by hand?

Done correctly, I've not seen anything beat BULK INSERT with a BCP Format File for performance on well formed text files that have a consistent "shape".  The BCP command runs a very close second.

Recommended Reading

The following links are actually just the tip of the proverbial iceberg.  The Microsoft Articles at these links contain other links that are also worthwhile studying.

BULK INSERT - http://technet.microsoft.com/en-us/library/ms188365.aspx

BCP - http://technet.microsoft.com/en-us/library/ms162802.aspx

Non-XML Format Files - http://technet.microsoft.com/en-us/library/ms191479.aspx

Thanks for listening, folks.

"Crack Filled!"


© Copyright, Jeff Moden, 29 Dec 2013, All Rights Reserved

Rate

4.78 (27)

You rated this post out of 5. Change rating

Share

Share

Rate

4.78 (27)

You rated this post out of 5. Change rating