SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Skipping Columns (& other tricks) Using BCP Format Files (SQL Spackle)


Skipping Columns (& other tricks) Using BCP Format Files (SQL Spackle)

Author
Message
j-1064772
j-1064772
Say Hey Kid
Say Hey Kid (683 reputation)Say Hey Kid (683 reputation)Say Hey Kid (683 reputation)Say Hey Kid (683 reputation)Say Hey Kid (683 reputation)Say Hey Kid (683 reputation)Say Hey Kid (683 reputation)Say Hey Kid (683 reputation)

Group: General Forum Members
Points: 683 Visits: 1207
Apart from the fact that I needed this 4 years ago, excellent article.
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)

Group: General Forum Members
Points: 86826 Visits: 41103
Andy DBA (1/23/2014)
Great article! Exellent choice on where to draw the line on its scope.
Also, sometimes it's just as valuable to learn what can't be done as it is to learn what can be done because then you don't waste time trying to figure out how to do the impossible. So, thank you for confirming right away that BULK INSERT can't import spreadsheet files and for the reminder later on that BOL says using BULK INSERT to import CSV files is not supported.

Jeff Moden - Article
The real key is to get whomever or whatever is providing the data to provide it in a consistant manner just as you would do in a properly designed table.

So true, but easier said than done! Sometimes you gotta take what you get. I've recently had to strip text qualifier quotes from TSVs prior to BULK INSERTING and wondered if I was missing out on some new command line or format file option that was available since the last time I read any documentation. Sadly, I see the answer is "No".

I'm only going to mention one tiny correction in case you did it on purpose to see if anyone is paying attention:

]Jeff Moden - Article
For example, the first "format line" is supposed to be for "ColC" of the table but I've used "Doesn't matter" here because it truly doesn't matter what's in this column.

The posted example has "ColC", not "Doesn't matter". (BTW, I did not know this value was arbitrary, so thank you for pointing that out in your article.)


If the quotes are consistent in the column, then you are, in fact, missing out on a format file option that MS didn't document because they "don't support CSVs". \" is what you use as part of the FIELDTEMINATOR column to represent double quotes. Importing such CSVs using BCP and BULK INSERT will be the subject of another article.

Apologies for the error on the "Doesn't Matter" column notation. I had second thoughts about that and thought I had caught all of the places in the article that were based on that. Apparently, I missed one. I'll submit a correction for that. Thanks for bringing it to my attention.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Andy DBA
Andy DBA
SSC Veteran
SSC Veteran (252 reputation)SSC Veteran (252 reputation)SSC Veteran (252 reputation)SSC Veteran (252 reputation)SSC Veteran (252 reputation)SSC Veteran (252 reputation)SSC Veteran (252 reputation)SSC Veteran (252 reputation)

Group: General Forum Members
Points: 252 Visits: 762
Jeff Moden (1/23/2014)
If the quotes are consistent in the column, then you are, in fact, missing out on a format file option that MS didn't document because they "don't support CSVs". \" is what you use as part of the FIELDTEMINATOR column to represent double quotes. Importing such CSVs using BCP and BULK INSERT will be the subject of another article.
Of course! Yes, I do remember coming across the approach where double quotes are included as part of the field terminator. There's also a way to "eat" the leading double quote from the first column, but I'll wait for your next article on the topic. I was able to take an approach where I didn't need format files, but I guess what I was really hoping for is a Magically-treat-double quotes-like-Excel-does command line option!:-D



davidb-727770
davidb-727770
Grasshopper
Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)

Group: General Forum Members
Points: 23 Visits: 8
I'm using SQL2012 and following your steps as carefully as possible.
I have copy and paste your example code and data.
When I run the BULK INSERT in Studio the result is error 'Cannot bulk load. Invalid column number in the format file...'
I have googled for more info but no-one seems to have an answer that makes sense to me.

Suggestions?
davidb-727770
davidb-727770
Grasshopper
Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)

Group: General Forum Members
Points: 23 Visits: 8
Ignore my last.
When I copied the BULK INSERT statement I missed the last semi-colon. The error message was misleading.
Yes Jeff. Works like a charm and thanks for posting it.
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)

Group: General Forum Members
Points: 86826 Visits: 41103
Andy DBA (1/23/2014)
Jeff Moden (1/23/2014)
If the quotes are consistent in the column, then you are, in fact, missing out on a format file option that MS didn't document because they "don't support CSVs". \" is what you use as part of the FIELDTEMINATOR column to represent double quotes. Importing such CSVs using BCP and BULK INSERT will be the subject of another article.
Of course! Yes, I do remember coming across the approach where double quotes are included as part of the field terminator. There's also a way to "eat" the leading double quote from the first column, but I'll wait for your next article on the topic. I was able to take an approach where I didn't need format files, but I guess what I was really hoping for is a Magically-treat-double quotes-like-Excel-does command line option!:-D


The old 32 bit "Jet" drivers used to work a treat for exactly that using OPENROWSET. Ii've not tried the 64 bit ACE drivers but rumor has it they they work just as well. The problem with OPENROWSET is that it requires some serious privs whereas BULK INSERT does not.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
sqldriver
sqldriver
SSC Eights!
SSC Eights! (978 reputation)SSC Eights! (978 reputation)SSC Eights! (978 reputation)SSC Eights! (978 reputation)SSC Eights! (978 reputation)SSC Eights! (978 reputation)SSC Eights! (978 reputation)SSC Eights! (978 reputation)

Group: General Forum Members
Points: 978 Visits: 2518
This article helped me big time having to load the 18Gb text file here. All I needed was the MD5 column, though I find it somewhat odd that the destination table has to have all the columns in it, as opposed to just the one you're loading into from.

I ran into a similar problem as another poster with the text file having double quotes, but deleting dupes from and then replacing the double quotes in a table performed about as well as can be expected, especially vs. PowerShell, which processes large text files embarrassingly slow, even using the System.IO.StreamReader or System.IO.File. LBAL is just as bad as RBAR :-P

SQL and format file contents are below.


USE [nist]

--CREATE TABLE dbo.nist_stage
-- (
--[SHA1] VARCHAR(10),
--[MD5] CHAR(40),
--[CRC32] VARCHAR(10),
--[FileName] VARCHAR(10),
--[FileSize] VARCHAR(10),
--[ProductCode] VARCHAR(10),
--[OpSystemCode] VARCHAR(10),
--[SpecialCode] VARCHAR(10)
-- )

--DROP TABLE dbo.nist_stage

--TRUNCATE TABLE dbo.nist_stage

--CREATE CLUSTERED INDEX [MD5] ON [dbo].[nist_stage]
--(
-- [MD5] ASC
--)

--DROP INDEX [MD5] ON [dbo].[nist_stage]

BULK INSERT dbo.nist_stage
FROM 'M:\bcp\NSRLFile.txt'
WITH (
BATCHSIZE = 50000
,CODEPAGE = 'RAW'
,DATAFILETYPE = 'char'
,FIRSTROW = 2
,FORMATFILE = 'M:\bcp\Admin\format.fmt'
,MAXERRORS = 2000000000
,ERRORFILE = 'M:\bcp\Admin\error.log'
,TABLOCK
)

--SELECT TOP 1000 * FROM dbo.nist_stage

;WITH dedupe AS (
SELECT MD5, ROW_NUMBER() OVER (PARTITION BY MD5 ORDER BY MD5) Rn
FROM dbo.nist_stage AS ns
)
DELETE D
FROM dedupe D
WHERE D.Rn > 1

UPDATE dbo.nist_stage
SET MD5 = REPLACE(MD5, '"', '')

DROP INDEX [MD5] ON [dbo].[NSRLFiles]

INSERT INTO dbo.NSRLFiles (MD5)
SELECT ns.MD5
FROM dbo.nist_stage AS ns;

CREATE CLUSTERED INDEX [MD5] ON [dbo].[NSRLFiles]
(
[MD5] ASC
)


/*
Times with clustered index present on staging table

Load: 31:51
Delete dupes: 3:53
Remove double quotes: 8:25

Time without clustered index present on staging table

Load: 5:04
Delete dupes: 9:42
Remove double quotes: 1:32


Times with PowerShell processing file:

Delete dupes: Ew
Remove double quotes: Nope
Load: Why?


*/




11.0
8
1 SQLCHAR 0 8000 "," 0 SHA1 ""
2 SQLCHAR 0 8000 "," 2 MD5 ""
3 SQLCHAR 0 8000 "," 0 CRC32 ""
4 SQLCHAR 0 8000 "," 0 FileName ""
5 SQLCHAR 0 8000 "," 0 FileSize ""
6 SQLCHAR 0 8000 "," 0 ProductCode ""
7 SQLCHAR 0 8000 "," 0 OpSystemCode ""
8 SQLCHAR 0 8000 "\r" 0 SpecialCode ""

Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search