SQL Bulk Insert - using " to wrap text fields

  • I want to use Bulk Insert to load data.  The data is a CSV file, with text fields being surrounded by "" to prevent errors due to commas appearing in the field value.

    I am unable to work out how to achieve this using the T-SQL Bulk Insert command.  Is it possible or is using a DTS package the way to go?

  • Bulk Insert (BCP) will not honour quotes and will treat them like normal data.

    Your best bet is to use DTS which will enable you to utilize the quotes correctly.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • Hello,

    you may want to read about format files. The bcp or BULK INSERT allow to use so called "format files", where you may specify that delimiter between fields is ",".

    The only problem here is that for each file to import you would need to write format file.


    Andrey

  • You would have to strip the double quotes after the data is loaded, as in the following example:

    CREATE TABLE #tblTest

    (

      id int IDENTITY(1,1)

    , data varchar(70)

    )

    SET NOCOUNT ON

    INSERT #tblTest (data) VALUES ('"This is line 1."')

    INSERT #tblTest (data) VALUES ('"This is line 2."')

    INSERT #tblTest (data) VALUES ('Since this line ends with a ", we want to keep the "')

    INSERT #tblTest (data) VALUES ('This is line four (4)')

    SET NOCOUNT OFF

    SELECT * FROM #tblTest

    UPDATE #tblTest

       SET data = CASE

                    WHEN data = '""' THEN ''

                    WHEN Left(data,1) = '"' AND Right(data,1) = '"' THEN Substring(data,2,Len(data)-2)

                    ELSE data

                  END

    SELECT * FROM #tblTest

    DROP TABLE #tblTest

  • Thanks for the responses.  It looks like a SQL script is a no-no.  Stripping the quotes is ok, except I can't get the data into the table in the first place using a SQL script.

  • If you want to do this in T-SQL then you could do it this way

    CREATE TABLE test (F1 varchar(100))

    BULK INSERT test FROM 'c:\temp\test.csv' WITH ( ROWTERMINATOR = '\r\n' )

    DECLARE @updct int

    SET @updct = 1

    WHILE (@updct > 0)

    BEGIN

    UPDATE test

    SET F1 = STUFF (F1,

      PATINDEX('%,"%',F1)+1,

      PATINDEX('%",%',F1)-PATINDEX('%,"%', F1),

      REPLACE(REPLACE(SUBSTRING(F1,PATINDEX('%,"%',F1)+1, PATINDEX('%",%',F1)-PATINDEX('%,"%', F1)), '"', ''), ',', ' '))

    WHERE CHARINDEX('"',F1) > 0

    SET @updct = @@ROWCOUNT

    END

    This will replace any comma within a set of quotes to space and removes the quotes.

    You can then write a function (there are some on this site) to parse the comma separated data into columns.

     

    Far away is close at hand in the images of elsewhere.
    Anon.

  • Thanks for that suggestion.  I'm going to have a play with this, because I think this might be just what I need 🙂

     

  • You are a genius - a bit of tweaking and I think we have a winner 🙂

    Thank you.

  • I have used code similar to what follows to import comma, quote delimited text.  The number of rows and columns is usually small. 

    bulk insert dbname.dbo.tablename from '\\servername\directoryname\filename.csv'

     with

     (

              DATAFILETYPE      = 'char',

              FIELDTERMINATOR = '","', --uses "," as a field delimiter

              ROWTERMINATOR  = '"\n'  -- uses " and CRLF as end of row)

    The foregoing addresses all of the comma quote delimiters except the first quote.   A simple update such as what follows will rectify that.

    update tablename

    set field1 = replace(field1,'"','')

  • All good suggestions and I like the simplicity of Fred Wadley's solution.  However, sometimes you can't use Bulk Insert because the DBA won't give you either "SA" permissions or "BULK Admin" permissions which is required to be able to use BULK INSERT.  DTS is almost always slower that either BULK INSERT or BCP and there's that permissions thing, again.

    With all of that in mind, I usually use BCP especially when loading millions of records.  If you read up on all of the nifty parameters in Books-on-Line, BCP can be set up to capture "failed" records to a separate file for further examination and possibe correction, as well as other things.

    As most have found and some have pointed out in the above postings, BCP would require a format file to use the fields wrapped in double quotes (which is actually a very common format).  Here's an example format file to do such a thing...

    8.0

    10

    1  SQLCHAR 0   1 ""         0 FirstTerm    ""

    2  SQLCHAR 0 LLL "\",\""  NNN FirstField   ""

    3  SQLCHAR 0 LLL "\",\""  NNN SecondField  ""

    4  SQLCHAR 0 LLL "\",\""  NNN ThirdField   ""

    5  SQLCHAR 0 LLL "\",\""  NNN FourthField  ""

    6  SQLCHAR 0 LLL "\",\""  NNN FifthField   ""

    7  SQLCHAR 0 LLL "\",\""  NNN SixthField   ""

    8  SQLCHAR 0 LLL "\",\""  NNN SeventhField ""

    9  SQLCHAR 0 LLL "\",\""  NNN EighthField  ""

    10 SQLCHAR 0 LLL "\"\r\n" NNN NinethField  ""

    There are a couple of things about this format file that you need to know...

    1. The number of fields marked in the second row will always be 1 more than the actual number of fields in the file to accommodate the first double-quote.
    2. The third line (first format line) handles the first double-quote by assigning it to database table field #0 meaning that it's simply discarded.  Notice that there is no terminator assigned to this field because it is a single character.
    3. The "LLL" in each format line should be replaced with the field length... my recommendation is to always make this value 999.  That way, if the incoming field is too long for the datatype of the related column in the target table, the row will be correctly rejected instead of just truncating the field.  Along with the -m parameter (number of errors before failure, default is 10), the -e parameter (location of error file), and a few other BCP parameter settings, the rejected rows will be captured in an error file and all of the good rows will be successfully loaded.
    4. The "NNN" represents the column number of the target table.  A "0" in this column means that the field will simply be discarded.  These column numbers do not need to be in the same order as the incoming file which allow you to do a bit of field mapping between the fields of the incoming file and the target table as you would with DTS.
    5. The column containing all of the "NthField" info is more for documentation than anything else... it will NOT affect which table column is being loaded but it IS very helpful if you use the correct column name.
    6. The last column of "" is for the required collation sequence which, as you can tell, is not required at all but something must be there.  Hence, the "".
    7. Since the first double-quote is dropped, all we have to do is worry about the terminators at the end of each field.  Since we need "," for all of those, we have to build a terminator value.  \" identifies a single double-quote.  We need 2 of those separated by a comma so we get \",\".  That, in itself, must be wrapped in double-quotes so we get "\",\"" for the intermediatory terminators.
    8. The terminator of the last column is a double-quote followed by the normal end-of-record (line) marker and the new-line marker.  That must be enclosed in double-quotes so we end up with "\"\r\n".

    Of course, you can add as many intermediatory fields as you need.  And, the BCP batch job can be scheduled using the Windows Task Scheduler (or any other scheduler) instead of having to bug   the DBA's to setup a Job in SQL-Server.

    It takes a minute or two per field (including figuring out the mapping of each field) but, in the long run, it's really worth it.  It will be the fastest BULK insert you can write (especially if you temporarily set the "Select Into/Bulk Copy" option to "TRUE"), provides error logging and error record capture, and really doesn't require much in the form of permissions especially if the target table (should usually be a staging table) has been blessed by the DBA's.

    Even if you do have the necessary permissions to use BULK INSERT, it is almost always better to define the input file with a format file so you can do things like ignoring the first double-quote and setting up to log incorrectly formatted input lines or input lines that don't meet the datatype requirements of each column of the target table.  Since I've done the "hard work" for you, all you have to do is a little CPR (cut, paste, replace) on the format file I posted above.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks to everyone who has responded: I've got a lot of info now (and one working prototype).

     

  • Jeff,

    Nice solution if it was possible. As stated in BOL (SQL2K SP3) and in practice escaping double quotes (\") is not allowed with BCP.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • Hi David,

    The example I gave was modified to be generic from production code that I currently have working in the same fashion offered.  Dunno what to say about the notation in BOL.  I never read that before so it may be that it only works because I didn't know it couldn't be done kinda like the bumble-bee isn't supposed to be able to fly.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I can't get it to work at all

    Far away is close at hand in the images of elsewhere.
    Anon.

  • I'll find the production code and see if I did anything different...  sorry for the confusion.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 42 total)

You must be logged in to reply to this topic. Login to reply