Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

Skipping Columns (& other tricks) Using BCP Format Files (SQL Spackle) Expand / Collapse
Author
Message
Posted Thursday, January 23, 2014 1:05 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Thursday, September 4, 2014 2:07 PM
Points: 105, Visits: 496
Apart from the fact that I needed this 4 years ago, excellent article.
Post #1534233
Posted Thursday, January 23, 2014 1:36 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 7:36 AM
Points: 35,218, Visits: 31,677
Andy DBA (1/23/2014)
Great article! Exellent choice on where to draw the line on its scope.
Also, sometimes it's just as valuable to learn what can't be done as it is to learn what can be done because then you don't waste time trying to figure out how to do the impossible. So, thank you for confirming right away that BULK INSERT can't import spreadsheet files and for the reminder later on that BOL says using BULK INSERT to import CSV files is not supported.

Jeff Moden - Article
The real key is to get whomever or whatever is providing the data to provide it in a consistant manner just as you would do in a properly designed table.

So true, but easier said than done! Sometimes you gotta take what you get. I've recently had to strip text qualifier quotes from TSVs prior to BULK INSERTING and wondered if I was missing out on some new command line or format file option that was available since the last time I read any documentation. Sadly, I see the answer is "No".

I'm only going to mention one tiny correction in case you did it on purpose to see if anyone is paying attention:

]Jeff Moden - Article
For example, the first "format line" is supposed to be for "ColC" of the table but I've used "Doesn't matter" here because it truly doesn't matter what's in this column.

The posted example has "ColC", not "Doesn't matter". (BTW, I did not know this value was arbitrary, so thank you for pointing that out in your article.)


If the quotes are consistent in the column, then you are, in fact, missing out on a format file option that MS didn't document because they "don't support CSVs". \" is what you use as part of the FIELDTEMINATOR column to represent double quotes. Importing such CSVs using BCP and BULK INSERT will be the subject of another article.

Apologies for the error on the "Doesn't Matter" column notation. I had second thoughts about that and thought I had caught all of the places in the article that were based on that. Apparently, I missed one. I'll submit a correction for that. Thanks for bringing it to my attention.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1534248
Posted Thursday, January 23, 2014 4:12 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Friday, September 12, 2014 8:39 AM
Points: 147, Visits: 539
Jeff Moden (1/23/2014)
If the quotes are consistent in the column, then you are, in fact, missing out on a format file option that MS didn't document because they "don't support CSVs". \" is what you use as part of the FIELDTEMINATOR column to represent double quotes. Importing such CSVs using BCP and BULK INSERT will be the subject of another article.
Of course! Yes, I do remember coming across the approach where double quotes are included as part of the field terminator. There's also a way to "eat" the leading double quote from the first column, but I'll wait for your next article on the topic. I was able to take an approach where I didn't need format files, but I guess what I was really hoping for is a Magically-treat-double quotes-like-Excel-does command line option!



Post #1534292
Posted Thursday, January 23, 2014 6:29 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, July 22, 2014 5:02 PM
Points: 3, Visits: 6
I'm using SQL2012 and following your steps as carefully as possible.
I have copy and paste your example code and data.
When I run the BULK INSERT in Studio the result is error 'Cannot bulk load. Invalid column number in the format file...'
I have googled for more info but no-one seems to have an answer that makes sense to me.

Suggestions?
Post #1534312
Posted Thursday, January 23, 2014 6:37 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, July 22, 2014 5:02 PM
Points: 3, Visits: 6
Ignore my last.
When I copied the BULK INSERT statement I missed the last semi-colon. The error message was misleading.
Yes Jeff. Works like a charm and thanks for posting it.
Post #1534313
Posted Thursday, January 23, 2014 7:15 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 7:36 AM
Points: 35,218, Visits: 31,677
Andy DBA (1/23/2014)
Jeff Moden (1/23/2014)
If the quotes are consistent in the column, then you are, in fact, missing out on a format file option that MS didn't document because they "don't support CSVs". \" is what you use as part of the FIELDTEMINATOR column to represent double quotes. Importing such CSVs using BCP and BULK INSERT will be the subject of another article.
Of course! Yes, I do remember coming across the approach where double quotes are included as part of the field terminator. There's also a way to "eat" the leading double quote from the first column, but I'll wait for your next article on the topic. I was able to take an approach where I didn't need format files, but I guess what I was really hoping for is a Magically-treat-double quotes-like-Excel-does command line option!


The old 32 bit "Jet" drivers used to work a treat for exactly that using OPENROWSET. Ii've not tried the 64 bit ACE drivers but rumor has it they they work just as well. The problem with OPENROWSET is that it requires some serious privs whereas BULK INSERT does not.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1534314
Posted Thursday, August 21, 2014 9:46 AM


SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Today @ 10:29 AM
Points: 432, Visits: 1,755
This article helped me big time having to load the 18Gb text file here. All I needed was the MD5 column, though I find it somewhat odd that the destination table has to have all the columns in it, as opposed to just the one you're loading into from.

I ran into a similar problem as another poster with the text file having double quotes, but deleting dupes from and then replacing the double quotes in a table performed about as well as can be expected, especially vs. PowerShell, which processes large text files embarrassingly slow, even using the System.IO.StreamReader or System.IO.File. LBAL is just as bad as RBAR

SQL and format file contents are below.

USE [nist]

--CREATE TABLE dbo.nist_stage
-- (
--[SHA1] VARCHAR(10),
--[MD5] CHAR(40),
--[CRC32] VARCHAR(10),
--[FileName] VARCHAR(10),
--[FileSize] VARCHAR(10),
--[ProductCode] VARCHAR(10),
--[OpSystemCode] VARCHAR(10),
--[SpecialCode] VARCHAR(10)
-- )

--DROP TABLE dbo.nist_stage

--TRUNCATE TABLE dbo.nist_stage

--CREATE CLUSTERED INDEX [MD5] ON [dbo].[nist_stage]
--(
-- [MD5] ASC
--)

--DROP INDEX [MD5] ON [dbo].[nist_stage]

BULK INSERT dbo.nist_stage
FROM 'M:\bcp\NSRLFile.txt'
WITH (
BATCHSIZE = 50000
,CODEPAGE = 'RAW'
,DATAFILETYPE = 'char'
,FIRSTROW = 2
,FORMATFILE = 'M:\bcp\Admin\format.fmt'
,MAXERRORS = 2000000000
,ERRORFILE = 'M:\bcp\Admin\error.log'
,TABLOCK
)

--SELECT TOP 1000 * FROM dbo.nist_stage

;WITH dedupe AS (
SELECT MD5, ROW_NUMBER() OVER (PARTITION BY MD5 ORDER BY MD5) Rn
FROM dbo.nist_stage AS ns
)
DELETE D
FROM dedupe D
WHERE D.Rn > 1

UPDATE dbo.nist_stage
SET MD5 = REPLACE(MD5, '"', '')

DROP INDEX [MD5] ON [dbo].[NSRLFiles]

INSERT INTO dbo.NSRLFiles (MD5)
SELECT ns.MD5
FROM dbo.nist_stage AS ns;

CREATE CLUSTERED INDEX [MD5] ON [dbo].[NSRLFiles]
(
[MD5] ASC
)


/*
Times with clustered index present on staging table

Load: 31:51
Delete dupes: 3:53
Remove double quotes: 8:25

Time without clustered index present on staging table

Load: 5:04
Delete dupes: 9:42
Remove double quotes: 1:32


Times with PowerShell processing file:

Delete dupes: Ew
Remove double quotes: Nope
Load: Why?


*/


11.0
8
1 SQLCHAR 0 8000 "," 0 SHA1 ""
2 SQLCHAR 0 8000 "," 2 MD5 ""
3 SQLCHAR 0 8000 "," 0 CRC32 ""
4 SQLCHAR 0 8000 "," 0 FileName ""
5 SQLCHAR 0 8000 "," 0 FileSize ""
6 SQLCHAR 0 8000 "," 0 ProductCode ""
7 SQLCHAR 0 8000 "," 0 OpSystemCode ""
8 SQLCHAR 0 8000 "\r" 0 SpecialCode ""
Post #1605905
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse