SQL Bulk Insert - using " to wrap text fields

Question

Post reply

SQL Bulk Insert - using " to wrap text fields

Michael Lysons

SSCertifiable

Points: 6497
More actions
March 3, 2005 at 5:49 am

#66580

I want to use Bulk Insert to load data. The data is a CSV file, with text fields being surrounded by "" to prevent errors due to commas appearing in the field value.
I am unable to work out how to achieve this using the T-SQL Bulk Insert command. Is it possible or is using a DTS package the way to go?

Viewing 15 posts - 1 through 15 (of 43 total)

You must be logged in to reply to this topic. Login to reply

David Burrows SSC Guru Points: 65144 More actions · Answer 1

Bulk Insert (BCP) will not honour quotes and will treat them like normal data.

Your best bet is to use DTS which will enable you to utilize the quotes correctly.

Far away is close at hand in the images of elsewhere.
Anon.

air_turtle Mr or Mrs. 500 Points: 516 More actions · Answer 2

Hello,

you may want to read about format files. The bcp or BULK INSERT allow to use so called "format files", where you may specify that delimiter between fields is ",".

The only problem here is that for each file to import you would need to write format file.

Andrey

vadba SSChampion Points: 11133 More actions · Answer 3

You would have to strip the double quotes after the data is loaded, as in the following example:

CREATE TABLE #tblTest

(

id int IDENTITY(1,1)

, data varchar(70)

)

SET NOCOUNT ON

INSERT #tblTest (data) VALUES ('"This is line 1."')

INSERT #tblTest (data) VALUES ('"This is line 2."')

INSERT #tblTest (data) VALUES ('Since this line ends with a ", we want to keep the "')

INSERT #tblTest (data) VALUES ('This is line four (4)')

SET NOCOUNT OFF

SELECT * FROM #tblTest

UPDATE #tblTest

SET data = CASE

WHEN data = '""' THEN ''

WHEN Left(data,1) = '"' AND Right(data,1) = '"' THEN Substring(data,2,Len(data)-2)

ELSE data

END

SELECT * FROM #tblTest

DROP TABLE #tblTest

Michael Lysons SSCertifiable Points: 6497 More actions · Answer 4

Thanks for the responses. It looks like a SQL script is a no-no. Stripping the quotes is ok, except I can't get the data into the table in the first place using a SQL script.

David Burrows SSC Guru Points: 65144 More actions · Answer 5

If you want to do this in T-SQL then you could do it this way

CREATE TABLE test (F1 varchar(100))

BULK INSERT test FROM 'c:\temp\test.csv' WITH ( ROWTERMINATOR = '\r\n' )

DECLARE @updct int

SET @updct = 1

WHILE (@updct > 0)

BEGIN

UPDATE test

SET F1 = STUFF (F1,

PATINDEX('%,"%',F1)+1,

PATINDEX('%",%',F1)-PATINDEX('%,"%', F1),

REPLACE(REPLACE(SUBSTRING(F1,PATINDEX('%,"%',F1)+1, PATINDEX('%",%',F1)-PATINDEX('%,"%', F1)), '"', ''), ',', ' '))

WHERE CHARINDEX('"',F1) > 0

SET @updct = @@ROWCOUNT

END

This will replace any comma within a set of quotes to space and removes the quotes.

You can then write a function (there are some on this site) to parse the comma separated data into columns.

Far away is close at hand in the images of elsewhere.
Anon.

Michael Lysons SSCertifiable Points: 6497 More actions · Answer 6

Thanks for that suggestion. I'm going to have a play with this, because I think this might be just what I need 🙂

Michael Lysons SSCertifiable Points: 6497 More actions · Answer 7

You are a genius - a bit of tweaking and I think we have a winner 🙂

Thank you.

Fred Wadley SSC Enthusiast Points: 123 More actions · Answer 8

I have used code similar to what follows to import comma, quote delimited text. The number of rows and columns is usually small.

bulk insert dbname.dbo.tablename from '\\servername\directoryname\filename.csv'

with

(

DATAFILETYPE = 'char',

FIELDTERMINATOR = '","', --uses "," as a field delimiter

ROWTERMINATOR = '"\n' -- uses " and CRLF as end of row)

The foregoing addresses all of the comma quote delimiters except the first quote. A simple update such as what follows will rectify that.

update tablename

set field1 = replace(field1,'"','')

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 9

All good suggestions and I like the simplicity of Fred Wadley's solution. However, sometimes you can't use Bulk Insert because the DBA won't give you either "SA" permissions or "BULK Admin" permissions which is required to be able to use BULK INSERT. DTS is almost always slower that either BULK INSERT or BCP and there's that permissions thing, again.

With all of that in mind, I usually use BCP especially when loading millions of records. If you read up on all of the nifty parameters in Books-on-Line, BCP can be set up to capture "failed" records to a separate file for further examination and possibe correction, as well as other things.

As most have found and some have pointed out in the above postings, BCP would require a format file to use the fields wrapped in double quotes (which is actually a very common format). Here's an example format file to do such a thing...

8.0 10 1 SQLCHAR 0   1 ""        0 FirstTerm    "" 2 SQLCHAR 0 LLL "\",\"" NNN FirstField   "" 3 SQLCHAR 0 LLL "\",\"" NNN SecondField "" 4 SQLCHAR 0 LLL "\",\"" NNN ThirdField   "" 5 SQLCHAR 0 LLL "\",\"" NNN FourthField "" 6 SQLCHAR 0 LLL "\",\"" NNN FifthField   "" 7 SQLCHAR 0 LLL "\",\"" NNN SixthField   "" 8 SQLCHAR 0 LLL "\",\"" NNN SeventhField "" 9 SQLCHAR 0 LLL "\",\"" NNN EighthField "" 10 SQLCHAR 0 LLL "\"\r\n" NNN NinethField ""

There are a couple of things about this format file that you need to know...

The number of fields marked in the second row will always be 1 more than the actual number of fields in the file to accommodate the first double-quote.
The third line (first format line) handles the first double-quote by assigning it to database table field #0 meaning that it's simply discarded. Notice that there is no terminator assigned to this field because it is a single character.
The "LLL" in each format line should be replaced with the field length... my recommendation is to always make this value 999. That way, if the incoming field is too long for the datatype of the related column in the target table, the row will be correctly rejected instead of just truncating the field. Along with the -m parameter (number of errors before failure, default is 10), the -e parameter (location of error file), and a few other BCP parameter settings, the rejected rows will be captured in an error file and all of the good rows will be successfully loaded.
The "NNN" represents the column number of the target table. A "0" in this column means that the field will simply be discarded. These column numbers do not need to be in the same order as the incoming file which allow you to do a bit of field mapping between the fields of the incoming file and the target table as you would with DTS.
The column containing all of the "NthField" info is more for documentation than anything else... it will NOT affect which table column is being loaded but it IS very helpful if you use the correct column name.
The last column of "" is for the required collation sequence which, as you can tell, is not required at all but something must be there. Hence, the "".
Since the first double-quote is dropped, all we have to do is worry about the terminators at the end of each field. Since we need "," for all of those, we have to build a terminator value. \" identifies a single double-quote. We need 2 of those separated by a comma so we get \",\". That, in itself, must be wrapped in double-quotes so we get "\",\"" for the intermediatory terminators.
The terminator of the last column is a double-quote followed by the normal end-of-record (line) marker and the new-line marker. That must be enclosed in double-quotes so we end up with "\"\r\n".

Of course, you can add as many intermediatory fields as you need. And, the BCP batch job can be scheduled using the Windows Task Scheduler (or any other scheduler) instead of having to bug the DBA's to setup a Job in SQL-Server.

It takes a minute or two per field (including figuring out the mapping of each field) but, in the long run, it's really worth it. It will be the fastest BULK insert you can write (especially if you temporarily set the "Select Into/Bulk Copy" option to "TRUE"), provides error logging and error record capture, and really doesn't require much in the form of permissions especially if the target table (should usually be a staging table) has been blessed by the DBA's.

Even if you do have the necessary permissions to use BULK INSERT, it is almost always better to define the input file with a format file so you can do things like ignoring the first double-quote and setting up to log incorrectly formatted input lines or input lines that don't meet the datatype requirements of each column of the target table. Since I've done the "hard work" for you, all you have to do is a little CPR (cut, paste, replace) on the format file I posted above.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Michael Lysons SSCertifiable Points: 6497 More actions · Answer 10

Thanks to everyone who has responded: I've got a lot of info now (and one working prototype).

David Burrows SSC Guru Points: 65144 More actions · Answer 11

Jeff,

Nice solution if it was possible. As stated in BOL (SQL2K SP3) and in practice escaping double quotes (\") is not allowed with BCP.

Far away is close at hand in the images of elsewhere.
Anon.

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 12

Hi David,

The example I gave was modified to be generic from production code that I currently have working in the same fashion offered. Dunno what to say about the notation in BOL. I never read that before so it may be that it only works because I didn't know it couldn't be done kinda like the bumble-bee isn't supposed to be able to fly.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

David Burrows SSC Guru Points: 65144 More actions · Answer 13

I can't get it to work at all

Far away is close at hand in the images of elsewhere.
Anon.

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 14

I'll find the production code and see if I did anything different... sorry for the confusion.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)