Exclude Columns When Using Bulk Insert

Question

Exclude Columns When Using Bulk Insert

david.holley

Hall of Fame

Points: 3627
More actions
December 28, 2013 at 4:47 pm

#287763

I'm able to successfully import data in a tab-delimited .txt file using the following statement.
BULK INSERT ImportProjectDates FROM "C:\tmp\ImportProjectDates.txt"
WITH (FIRSTROW=2,FIELDTERMINATOR = '\t', ROWTERMINATOR = '')
However, in order to import the text file, I had to add columns to the text file to match the columns that exist in the table. The original file is an export out of another database and contains all but 5 columns from my db.
How would I control which column BULK INSERT actually imports when working with a .txt file? I've tried using a FORMAT FILE, however I kept getting errors which I tracked down to being a case of not using it with a .txt file.
Yes, I could have the DBA add in the missing columns to the query from the other DB to create the columns, however I'd like to know a little bit more about this overall.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004589 More actions · Answer 1

david.holley (12/28/2013)
...however I kept getting errors which I tracked down to being a case of not using it with a .txt file.

Patently not the case. BCP format files work [font="Arial Black"]very[/font] well with text files that have a consistent "shape".

As to your problem, let's build a working example. This example assumes that you can store files in the C:\Temp directory ON THE SERVER ITSELF. We'll talk later a bit about what to do when that's not possible.

[font="Arial Black"]The Table[/font]

Let's say you have a table that you want to import to that looks like the following and I've numbered the columns according to their physical and logical position within the table for easy reference. If you create this table, you'll be able to run this whole example.

CREATE TABLE dbo.SomeTable

(

SomeTableID INT IDENTITY(1,1) NOT NULL --This is column "1"

,ColA INT --This is column "2"

,ColB INT --This is column "3"

,ColC INT --This is column "4"

)

;

[font="Arial Black"]The Text File[/font]

Similar to your problem, the following text file example contains fewer columns than the table that you're trying to import to. More specifically, it contains info for "ColC" (field 1 in the file, column 4 in the table) and "ColA" (field 3 in the file, column 2 in the table) of the table AND they are NOT in the same order as the table. (Note that I've used commas instead of tabs just for visibility purposes). It also has an "extra" column (field 2 in the file) that we don't actually want to import. The names of the columns are also different just to complete the picture.

If you save this file as C:\Temp\SomeFile.txt, you'll be able to run this whole example without any changes.

Column_C,Extra,Column_A

1,2,3

4,5,6

7,8,9

[font="Arial Black"]The BCP Format File[/font]

Here's the non-XML BCP format file that will import the 1st and 3rd "fields" of the file to the correct "ColC" and "ColA" columns (respectively according to the file structure) of "YourTable". If you save this BCP file as C:\Temp\BCP Format File.fmt (even including the annotations near the bottom of the file), you'll be able to run this whole example without any changes (except for the extra space described after this).

8.0

3

1 SQLCHAR 0 9999 "," 4 "Doesn't matter" ""

2 SQLCHAR 0 9999 "," 0 "Not Used" ""

3 SQLCHAR 0 9999 "\r\ n" 2 ColA ""

----- ------- ------ ------ ------------ ----- ---------------- ---------

File Data Prefix Data End Of Field Table Table Collation

Order Type Length Length Delimiter Col # Col Name Name

Just a quick walk-through on the format file...

First, notice that I had to put a space on line 3 between the "\" and the "n" in the "End of Field Delimiter" column because this forum uses backslash-U for something and it would display correctly, otherwise. You'll need to remove that extra space before you use the format file.

The first line identifies the rev of "BCP". For most text files, it's backwards and forwards compatible. "8.0" is for SQL Server 2000, "9.0" is for SQL Server 2005, etc, etc. Like I said, you can set it to "8.0" and probably never have to worry about what the rev is ever again.

Contrary to popular belief, the 2nd line does NOT identify the number of fields in the file. Rather, it identifies the number of "format lines" that will follow in the BCP Format file. Typically, there is a 1-to-1 match of "format lines" to fields in the file, but not always. There are some tricks that can be done (WAY beyond the scope of this post) where this might be different.

Note that anything that follows the "format lines" isn't considered by BCP or BULK INSERT. This is a great place to document the format file with embedded comments (much more extensively than I did).

The "File Order" column should be in numeric sequence. It also makes it easier to think about the file "from left to right". Again, this is normally 1-to-1 with the file but can be logically different in some very special cases (again, WAY beyond the scope of this post).

The "Data Type" column is used to identify the data type of the field you're importing. For text files, this should always be "SQLCHAR"... period.

The "Prefix Length" column is typically used only for "Native" imports. For text files, it should always be "0"... period.

The "Data Length" column identifies the data length of the field in the file. This only has to match for "fixed length" fields in the data file. For most delimited files, the only thing that matters is that it should be bigger than the largest expected field. With that thought in mind, I typically just use "9999" for everything. That will also allow for some auto-magic error checking because, if the data in the file is too wide for the column in the table, it will give you a "truncation" error letting you know something bad happened in the file. More on that in a minute.

The "End of Field Delimiter" column contains the delimiter (in quotes) that is just to the right of the respective field. Obviously the last field in the file has the "End of Record" delimiters and, for many text files, it's "\r\ n" (need to remove the extra space between the "\" and the "n", which was included just for display purposes), which stand for "Carriage Return, Line Feed (or "newline"). Be advised that some text files are setup just for one or the other. For you specific requirement, you would use "\t" as the delimiter for all field except the last one. I've used a comma just because that's the example I've setup for visibility purposes. There are some more tricks that can be played here but, again, that's WAY beyond the scope of this post.

The "Table Col #" column is what allows you to control which file field goes to which table column. Notice the first "format line" in the format file where this value is "4". The first field of the file contains data for "ColC" and "ColC" is the 4th column in the table. If you look at the 2nd "format line", that's for a field in the file that we don't actually want to import so we give it the value of "0" in this column. The 3rd "format line" contains data for "ColA" and "ColA" is the 2nd column in the table so that "format line" contains the number "2" in this column.

The "Table Col Name" column of the format file is a bit of a misnomer. It doesn't actually matter what's in this column so long as you follow the formatting rules for this column and the formatting rules are simply to use double-quotes around anything you want if what you want contains spaces or special characters. For example, the first "format line" is supposed to be for "ColC" of the table but I've used "Doesn't matter" here because it truly doesn't matter what's in this column. I will say that I'll generally use the actual name of the column in the table just for documentation and ease of troubleshooting, but you can use whatever you want'. I'll also use "Not Used" for any field that I'm not importing.

The last column of the BCP Format File is for "Collation Name". 2 double quotes should be used for any non-character based columns in the table. For character based columns, you could include the collation name for the particular column if it's other than the server default. Typically and unless there's some really strange collation stuff going on between the file and the table, I use "" for this entire column just to keep the visual clutter in check and to make it so special cases stick out like a sore thumb.

[font="Arial Black"]The BULK INSERT Statement[/font]

Ok, how to use it with BULK INSERT. Here's what I would use for the particular example I've laid out above. Again, if you've saved the files to the directory/file names I've used in all the above, this will work without modification.

BULK INSERT dbo.SomeTable

FROM 'C:\Temp\SomeFile.txt'

WITH (

BATCHSIZE = 50000

,CODEPAGE = 'RAW'

,DATAFILETYPE = 'char'

,FIRSTROW = 2

,FORMATFILE = 'C:\Temp\BCP Format File.fmt'

,MAXERRORS = 2000000000

,ERRORFILE = 'C:\Temp\SomeFile.err'

)

;

A couple of things to notice.

First, BATCHSIZE controls the number of rows to import as a single transaction. The only reason why I don't make this a huge number, like 2 billion, is because I'm easily bored and need something to do. I don't want to use a default of "1000" because that will actually slow the import down (and, for the size files I load, performance really counts) but I do want to see the progress of the load. So, I'll typically use 50000 or 100000 depending on what the size of the file I expect is.

I don't normally have to work with UNICODE files so I use "RAW" for a CodePage (which also allows me to import certain "special" characters). You can change that or even eliminate it if you're just working with "normal ASCII" text files. I like to have it there as a reminder as to what kind of file I'm working with. The same goes for the DATAFILETYPE.

You already know what FIRSTROW does. Just be advised that it's actually a misnomer. It doesn't skip "rows" in the classic sense. It skips "delimiter sets". What that means is that the first row (in this case), must have precisely the same delimiters as all of the other rows or you'll get an error or unexpected results that may take you a month of Sundays to troubleshoot without knowing this.

FORMATFILE is pretty obvious. The one thing that trips a lot of people up is that if the path is based on drive letters, that path must be a local drive of the server and usually cannot include a "mapped drive". If you need to import from a file located on a different box, you must setup a "named share" and use the full UNC path (\\MachineName\ShareName\Path\FileName.FileExtension) for the import to be successful. There are exceptions like certain drives (usually, SAN) setup for SQL Server that the server thinks of as "local storage".

The next two entries are something that many people never do simply because they haven't taken the time to deeply study everything that you can do with BCP, BULK INSERT, and BCP Format Files. One of the largest complaints is that if a bad line shows up in the file, it will stop the import when it hits that line and what they'd really like to do is to import whatever they can and then troubleshoot/repair what it couldn't. That's what MAXERRORS and ERRORFILE are for. I'll typically set MAXERRORS to some ridiculously high number (2 billion in the example) that I know will be larger than any number of rows that will be in a file so that it never "errors out". The ERRORFILE will contain any and all error rows along with a pretty decent "reason" why the particular row failed to load.

One of the most important things to notice is that, because we're using a BCP Format File to control the "shape" of the import, you must NOT include either the FIELDTERMINATOR or the ROWTERMINATOR options.

Last but certainly not least, I always use a staging table for my imports so that I can further validate the data (can mark with what type of error I found for further troubleshooting repair) and, perhaps, mark each imported row as an "INSERT" or and "UPDATE" when I finally do move the data to the final table. Since I load huge amounts of data and I want things to be "minimally logged" and also want to improve performance a fair bit more, I almost always include the TABLOCK option. Yes, there are some other rules to follow (lookup "minimal logged bulk copy [SQL Server]" in Books Online for details) but, without the TABLOCK option, it's all for naught.

[font="Arial Black"]Summary[/font]

Just to summarize what we've done.

1. We have a table with more columns than the file has.

2. We have a file that doesn't have all those columns BUT it does have an extra column that we don't want to import.

3. We built a non-XML BCP Format File to do the mapping.

4. We built a BULK INSERT command that imports the text file using the BCP Format File and discussed some options you might not have been aware of.

Because of the BCP Format File, it allows us to easily import fewer columns than there are in the table (which covers your specific request) and also allows us to ignore "extra fields" in the file that we don't want to import (a frequent case).

[font="Arial Black"]Your Turn... STUDY![/font]

Heh... your turn. If you're going to use BCP or BULK INSERT to import (or export in the case of BCP) files and you haven't done a really deep dive in Books Online on all of the wonderful and relatively simple options that either has offer, then you're missing out on most of what those wonderful tools can actually do. Done correctly, I've not seen anything beat BULK INSERT with a BCP Format File for performance on well formed text files that have a consistent "shape".

BWAAA-HAAAA!!! For as often as these types of problems come up on these forums, maybe I'll submit this post as an "SQL Spackle" article.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

david.holley Hall of Fame Points: 3627 More actions · Answer 2

Thank you for that very thorough reply that sounds a lot like an article on here (hint). I think that my frustration is the general lack of practical examples. I think I'm 4/5ths of the way there already. It's more trying to troubleshoot the errors. The data is being pulled into a staging table to massage it before its committed hence the discrepancy in the number of columns between the file and the table.

I'm building an ASP.net front end for it all and seriously entertaining the idea of checking the first row against the columns in the table to confirm that the user uploaded the correct file to the server even though (I'm assuming) Bulk Insert should fail.

david.holley Hall of Fame Points: 3627 More actions · Answer 3

david.holley

Hall of Fame

Points: 3627

December 30, 2013 at 9:32 pm

#1676511

Sweet! It's mostly up and running.

Jeff Moden SSC Guru Points: 1004589 More actions · Answer 4

Cool... Thanks for the feedback.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sqldriver SSChampion Points: 10440 More actions · Answer 5

I don't usually bookmark threads that aren't full of .gifs, but this is an exception.

Thanks, Jeff.

Jeff Moden SSC Guru Points: 1004589 More actions · Answer 6

You bet. Thank you very much for the feedback.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

david.holley Hall of Fame Points: 3627 More actions · Answer 7

I'm now entertaining the idea automagically presenting a list of fields relevant to the import that the user is attempting and the status of each field.

Field Comment

Project Number Valid

Date Valid

Description Error - This column is missing from the uploaded file

Desrciption Note - This column is not recognized by the import and will be ignored.

Comment Valid

Location Note - This column is not recognized by the import and will be ignored.

CreatedBy Note - This column is not recognized by the import and will be ignored.

CreatedDateTime Note - This column is not recognized by the import and will be ignored.

In this example 'Description' is required, but missing and 'Desrciption' is provided but will be ignored. Thus user probably mistyped the named. The last three columns exist in the import, but will be ignored. Although, the format file will handle this, there are vicious and unconfirmed rumors going around that it users would like to know when information will and will not be imported.

Jeff Moden SSC Guru Points: 1004589 More actions · Answer 8

So read the first row (the header row) into a table and do a comparison.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

david.holley Hall of Fame Points: 3627 More actions · Answer 9

That was my first thought. Then it occurred to me that 1) there are columns in the table that aren't required and 2) there could be columns in the file that don't exist in the table at all.

Jeff Moden SSC Guru Points: 1004589 More actions · Answer 10

That would be the exact reason for reading the column header into a table and doing the comparison. It would show the basic structure to the user and identify which columns were missing from the expected payload and any columns that are considered extra and unusable.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

arnipetursson SSCertifiable Points: 6557 More actions · Answer 11

Jeff, that is an awesome example of using format files with BCP in SQL server.

In the past I have only used bcp format files with Sybase BCP.

Another option is to pre-load the data via BCP into a work table that has a structure that matches the file.

Then you can validate the data that was in the file before loading it into the "live" table.

But of course we lose some of the benefits of BCP (speed, etc.).

But thanks again Jeff.

david.holley Hall of Fame Points: 3627 More actions · Answer 12

david.holley

Hall of Fame

Points: 3627

January 3, 2014 at 11:45 am

#1677401

"We always test and when we do, its in production."

anju04n6 SSC Enthusiast Points: 128 More actions · Answer 13

Help needed please

I am trying to run below query fro bulk insert

SELECT MATNR,CASE WHEN ERSDA='00000000' THEN NULL ELSE CONVERT(datetime, ERSDA, 102) END AS ERSDA,ERNAM, CASE WHEN LAEDA='00000000' THEN NULL ELSE CONVERT(datetime, LAEDA, 102) END AS LAEDA,AENAM,MTART,MBRSH,MATKL,BISMT,MEINS,SPART,EAN11,NUMTP,EXTWG

FROM OPENROWSET(BULK 'D:\Loads_Test\Data\MARA.txt' ,FORMATFILE = 'D:\Loads_Test\Formats\MARA.fmt') as a

and the format file used is as below

1SQLCHAR018":"1MATNR""

2SQLCHAR010":"2ERSDA""

3SQLCHAR012":"3ERNAM""

4SQLCHAR010":"4LAEDA""

5SQLCHAR012":"5AENAM""

6SQLCHAR04":"6MTART""

7SQLCHAR01":"7MBRSH""

8SQLCHAR09":"8MATKL""

9SQLCHAR018":"9BISMT""

10SQLCHAR03":"10MEINS""

11SQLCHAR02":"11SPART""

12SQLCHAR018":"12EAN11""

13SQLCHAR02":"13NUMTP""

14SQLCHAR018":"14EXTWG""

15SQLCHAR04":"0VKORG""

16SQLCHAR018":"0PRODH""

17SQLCHAR02":"0KONDM""

18SQLCHAR02":"0KTGRM""

19SQLCHAR03":"0MEINH""

20SQLCHAR06":"0UMREZ""

21SQLCHAR06":"0UMREN""

22SQLCHAR040"\r"0MAKTX""

I keep getting the truncate error like below, I am assuming it because it is not able to identify the carriage return