Problems using Bulk insert with a CSV file that has inconsistent quotes

  • I have data that looks like this (the first row contain headers):

    id, company, rep, employees

    729216,INGRAM MICRO INC.,"Stuart, Becky",523

    729235,"GREAT PLAINS ENERGY, INC.","Nelson, Beena",114

    721177,GEORGE WESTON BAKERIES INC,"Hogan, Meg",253

    In other words, only values that contain a comma are wrapped by quotes.

    Because the quotes aren't consistent, I can't use '","' as a delimiter

    I tried using ',' as a delimter and loading the the data into a temporary table where every column is a varchar, then using some post processing to strip out the quotes, but that doesn't work either, because the fields that contain ',' are split into multiple columns.

    Unfortunately, I don't have the ability to manipulate the CSV file beforehand. And I can't use DST, because this needs to be part of an automated process.

    In short, it's part of an application where users can upload CSV files to the server, and have those files loaded into a database.

    Is this hopeless?

    Many thanks in advance for any advice.

  • It looks like the quotes are around a consistent set of columns, in your example. If that's so, then OpenRowSet with an XML-typle configuration file should be able to deal with that correctly. You can define the column-terminator for each column individually in that case. Might do exactly what you need.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared,

    Many thanks for your reply.

    Actually, the quotes aren't as consistent.

    Some columns will ALWAYS be wrapped in quotes, since the values will always contain a comma.

    However, a few columns won't be consistent - the values in it will only occasionally contain commas, so will only occasionally have quotes.

    Does that mean I can't use OpenRowSet?

    Cheers,

    Matt

  • In that case, you're almost going to have to handle it row-by-row. You might be able to build an algorithm for splitting it that would work on the whole set at once, but it's probably going to be more efficient to use .NET regex functions or something of that sort.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Try this (you'll need to change the hard-coded path to your text file to import):

    -- the final results end up in this table variable

    declare @Import TABLE (

    ID int,

    Company varchar(50),

    Rep varchar(50),

    Employees int)

    -- create a temporary table to do all the dirty work

    if OBJECT_ID('tempdb..#BulkImport') is not null DROP TABLE #BulkImport

    CREATE TABLE #BulkImport(

    LineText varchar(100)

    )

    -- perform a bulk insert of the text file.

    --******* CHANGE THE HARD-CODED PATH\FILENAME FOR YOUR SYSTEM

    BULK INSERT #BulkImport FROM 'C:\SQL\Test.txt'

    -- get rid of the header line

    delete from #BulkImport where LineText like 'id%'

    -- add necessary columns to the temporary table

    -- if these columns were part of the table definition, then the bulk insert would fail

    ALTER TABLE #BulkImport ADD FirstOne int

    ALTER TABLE #BulkImport ADD SecondOne int

    ALTER TABLE #BulkImport ADD ThirdOne int

    ALTER TABLE #BulkImport ADD FourthOne int

    -- declare some working variables

    declare @first int, @second int, @third int, @fourth int

    -- perform Phil Factor's "quirky update"

    -- see http://www.simple-talk.com/content/print.aspx?article=446

    update #BulkImport

    set @first = FirstOne = CharIndex(',', LineText),

    @second = SecondOne = case when SUBSTRING(LineText, @first+1,1) = '"' then CHARINDEX('"', LineText, @first+2)

    else CHARINDEX(',', LineText, @first+2) end,

    @third = ThirdOne = CharIndex('",', LineText, @second + 1),

    @fourth = FourthOne = CharIndex(',', LineText, @third+1)

    -- put this all into the table variable

    insert into @Import

    select

    LEFT(LineText, FirstOne-1),

    CASE when SUBSTRING(LineText, FirstOne+1,1) = '"' then SUBSTRING(LineText, FirstOne+2, SecondOne-FirstOne-2)

    else SUBSTRING(LineText, FirstOne+1, SecondOne-FirstOne-1) end,

    SUBSTRING(LineText, SecondOne + case when SUBSTRING(LineText, SecondOne, 1) = '"' then 3 else 2 end, ThirdOne - SecondOne - case when SUBSTRING(LineText, SecondOne, 1) = '"' then 4 else 3 end),

    SUBSTRING(LineText, FourthOne+1, LEN(LineText))

    from #BulkImport

    -- show the results

    select * from @Import

    -- drop the temp table

    DROP TABLE #BulkImport

    Thank you Phil and Jeff for teaching this to me! 🙂

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Actually.... let's just say the data from the original post was stored on the server in C:\Temp and the filename was called SomeFile.txt

    Then, you can do a little SQL prestidigitation using Linked Servers...

    --===== Create a text base linked server.

    EXEC dbo.sp_AddLinkedServer TempTextServer,

    'Jet 4.0',

    'Microsoft.Jet.OLEDB.4.0',

    'C:\Temp', --LOOK!!! THIS IS THE DIRECTORY YOU WANT TO POINT AT!!!!

    NULL,

    'Text'

    GO

    --===== Set up the login (change the first null to a valid login name if you don't want SA)

    EXEC dbo.sp_AddLinkedSrvLogin TempTextServer, FALSE, NULL, NULL

    GO

    --===== List the file names available in the new text based linked server.

    -- Notice that the "dot" in file names has been replace by a # sign

    EXEC dbo.sp_Tables_Ex TempTextServer

    GO

    --===== Query the file as if it were a table. Notice that the quotes and commas

    -- are handled automatcially.

    SELECT *

    FROM TempTextServer...[SomeFile#txt]

    Here's what the output of the original file looks like from the last SELECT above....

    [font="Courier New"]id          company                      rep            employees

    ----------- ---------------------------- -------------- -----------

    729216      INGRAM MICRO INC.            Stuart, Becky  523

    729235      GREAT PLAINS ENERGY, INC.    Nelson, Beena  114

    721177      GEORGE WESTON BAKERIES INC   Hogan, Meg     253

    (3 row(s) affected)

    [/font]

    Of course, since it works like a table, you can do SELECT/INTO, INSERT/SELECT, or whatever you need to do.

    There's no doubt about it... it WILL be slower than Bulk Insert... but it will work without having to write special parsing code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Leave it to Jeff to find another way with less code.:rolleyes:

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Jeff and WayneS,

    Just wanted to thank you both for these fantastic solutions.

    They both work great (but you knew that already).

    The other option I've been messing with is using RegExp to pre-process the file - loading the entire file into memory, replacing the comma delimiters with tabs, removing unnecessary quotes, and rewriting it. This allows me to use the standard "Bulk Insert".

    So, I guess the question is - which is the most efficient, and places the least stress on the server?

    FWIW - i expect the file will typically be between 5,000 and 20,000 rows, about 20 fields, and typically between 500kb and 5mb.

    Cheers,

    Matt

  • stuehler (4/23/2009)


    Jeff and WayneS,

    Just wanted to thank you both for these fantastic solutions.

    You're welcome

    So, I guess the question is - which is the most efficient, and places the least stress on the server?

    FWIW - i expect the file will typically be between 5,000 and 20,000 rows, about 20 fields, and typically between 500kb and 5mb.

    Cheers,

    Matt

    Matt,

    20 fields... my way would require significant changes (though it should be straightforward). Jeff's way should handle the file without changes.

    5k-20k rows... Jeff said that my way would be faster and more efficient (to answer the first question). But, I'd set up a file with 20k entries and try his way out first. If that speed is acceptable, it's a lot easier. Plus my way uses a feature of the update statement not frequently seen, so it would be confusing to many people. (which is why I included the link to explain it.)

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (4/23/2009)


    Leave it to Jeff to find another way with less code.:rolleyes:

    Heh... I'm just lazy... I hate typing lots of code.

    I learned how to do this just for the occasional add hoc import of data for test purposes. I need a way for it to be repeatable without writing a "system" to handle it more quickly. For production systems, I typically launch pork chops at the folks providing the data until they can provide it in a correct and easily consumable format.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/23/2009)


    Actually.... let's just say the data from the original post was stored on the server in C:\Temp and the filename was called SomeFile.txt

    Then, you can do a little SQL prestidigitation using Linked Servers...

    --===== Create a text base linked server.

    EXEC dbo.sp_AddLinkedServer TempTextServer,

    'Jet 4.0',

    'Microsoft.Jet.OLEDB.4.0',

    'C:\Temp', --LOOK!!! THIS IS THE DIRECTORY YOU WANT TO POINT AT!!!!

    NULL,

    'Text'

    GO

    --===== Set up the login (change the first null to a valid login name if you don't want SA)

    EXEC dbo.sp_AddLinkedSrvLogin TempTextServer, FALSE, NULL, NULL

    GO

    --===== List the file names available in the new text based linked server.

    -- Notice that the "dot" in file names has been replace by a # sign

    EXEC dbo.sp_Tables_Ex TempTextServer

    GO

    --===== Query the file as if it were a table. Notice that the quotes and commas

    -- are handled automatcially.

    SELECT *

    FROM TempTextServer...[SomeFile#txt]

    Here's what the output of the original file looks like from the last SELECT above....

    [font="Courier New"]id          company                      rep            employees

    ----------- ---------------------------- -------------- -----------

    729216      INGRAM MICRO INC.            Stuart, Becky  523

    729235      GREAT PLAINS ENERGY, INC.    Nelson, Beena  114

    721177      GEORGE WESTON BAKERIES INC   Hogan, Meg     253

    (3 row(s) affected)

    [/font]

    Of course, since it works like a table, you can do SELECT/INTO, INSERT/SELECT, or whatever you need to do.

    There's no doubt about it... it WILL be slower than Bulk Insert... but it will work without having to write special parsing code.

    Sorry to necrobump this thread, but Jeff, this is a lifesaver. It has been working for me pretty well, but I just ran into a file wherein this system loads one particular value ("1,237") as NULL. I cannot figure out why. There's another comma-containing number that loads just fine ("1,559"). Any thoughts on troubleshooting I could do to see what is happening?

    Also, since Jet is depreciated, I was using the following command to create the server, but I don't know if that makes a difference:

    EXEC dbo.sp_AddLinkedServer TempTextServer,

    'MSDASQL',

    'Microsoft.ACE.OLEDB.12.0',

    'C:\inetpub\uploads',

    NULL,

    'Text'

    Thanks!

    -- Dave

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply