Problems importing CSV files using SSIS.

  • Thanks. This solved my problem!

  • namakakiweyho (10/22/2011)


    I just spent a whole weekend on this problem. My CSV file contains text data contained within double quotes, eg "blah",1,1/10/2011,"blah" and numeric and date data. All fields are delimited by a comma as dictated by the CSV file format. If your field contains a comma but is enclosed by double quotes, then you must specify (") in the Text Qualifier box on the Flat Form Connection Manager Editor form....BUT, if you have already experienced problems with the file import, then you must delete the flat file connection and then create a new instance, otherwise no matter what you do, SSIS will not recognise the text qualifier. It works!!

    Awesome dude thanks for posting this!

  • Okay, I know this is an old post, but I recently had the same issue and this was one of the first threads I found that described the issue. Now that my import is running, here is what I did.

    The original issue on this thread being that SSIS has a hard time importing data that is sometimes text qualified. This seems to be a pretty common way for Excel or MS Access to export data.

    In my case, I had a comma-delimited file with a field of numbers. Numbers less than 1000 were not text qualified. Numbers >= 1000 were written containing commas and text qualified with double quotes.

    So a sample set of records looked like this:

    Color,Date,Purchases

    Red,"Sep 1, 2014",500

    Blue,"Sep 2, 2014",750

    Green,"Sep 3 2014","1,000"

    If you tell SSIS that the Purchases field is text qualified, it looks for the qualifier in the first two rows and fails to parse the data correctly. If you tell it the data is not text qualified, it finds the "extra" comma in the third row, and fails to parse correctly.

    The solution that got me most of the way there was posted here:

    http://geekswithblogs.net/sathya/articles/how-to-import-and-export-csv-files-directly-in-ssis.aspx

    Steps:

    Create a new connection Manager

    Select New Oledb Connection

    Select the Provider as Microsoft Jet 4.0 Oledb Provider

    After choosing that on the same window at the left you will be having two tabs to switch over. one would be All (Which mentions the properties of the connection) and other would be the connection info

    Choose All. You would see a property called ExtendedProperties.

    In that paste this without quotes : "text;HDR=Yes;FMT=Delimited" --I actually used CSVDelimited

    There will be another property called Mode, which will mention 16 (readonly) and if you want to make that Read/Write change it to 19.

    In the Server or file name text box under the connection tab, give the path of the csv file. Give it only till the folder name and not the csv file itself. (Ex: If your file is c:\sathya\contact.csv give it as c:\sathya)

    And you can use this Oledb Connection in Oledb Source Task, Lookup task and Oledb Destination task also if you have changed the mode to 19 i.e. read/write.

    While writing the query to fetch from csv, since you have given only up to the folder as the datasource you need to write query like "select * from contact.csv"

    At this point, my import ran. However, my Purchases data still came in as NULL values where the source contained commas. Progress.

    Then, based on this:

    http://kb.tableau.com/articles/knowledgebase/jet-incorrect-data-type-issues

    I went into the registry under HKLM\SOFTWARE\Wow6432Node\Microsoft\Jet\4.0\Engines\Text

    and changed the string value for ImportMixedTypes to Text.

    For good measure, I added the rows key set to 0 (scan all rows for type) as well.

    Data now imports as a string (with the commas).

    In my case, I used the DerivedColumn task to strip out commas on the way in so I can treat it as a number too.

    Anyway, my 2 cents. Good luck.

  • I have added a Powershell step that removes the extra " chars.

    I use the | symbol as field seperator which is a regex special char, so the regex expression has a \ in front of the | char to escape it.

    Replace both when you're character is not an special char for regex.

    [font="Courier New"]$ExportFileLocation = 'C:\MyCsvFilesLocation\Exports'

    Get-ChildItem $ExportFileLocation -Filter *.csv | `

    Foreach-Object{

    #write-host $_.Fullname

    (get-content $_.Fullname -ReadCount 0) -replace '(?<!\|)"(?!\||$)',''| set-content $_.Fullname

    }

    [/font]

  • Rob de Vos (5/29/2015)


    I have added a Powershell step that removes the extra " chars.

    I use the | symbol as field seperator which is a regex special char, so the regex expression has a \ in front of the | char to escape it.

    Replace both when you're character is not an special char for regex.

    [font="Courier New"]$ExportFileLocation = 'C:\MyCsvFilesLocation\Exports'

    Get-ChildItem $ExportFileLocation -Filter *.csv | `

    Foreach-Object{

    #write-host $_.Fullname

    (get-content $_.Fullname -ReadCount 0) -replace '(?<!\|)"(?!\||$)',''| set-content $_.Fullname

    }

    [/font]

    So what happens if the full name is supposed to have double quotes in it?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The $_.fullname is the filename.

    The content of the file is being converted by the script.

    PowerShell will handle a filename with double quotes in it correctly.

  • We do lots of csv import and have found that | isn't a good delimiter character as it can easily be mistyped into data when hitting the shift key or even used to "draw pictures" in data text fields. Now we use the character ยง which can't be easily typed.

  • P Jones (6/2/2015)


    We do lots of csv import and have found that | isn't a good delimiter character as it can easily be mistyped into data when hitting the shift key or even used to "draw pictures" in data text fields. Now we use the character ยง which can't be easily typed.

    Good tip.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Phil Parkin (6/2/2015)


    P Jones (6/2/2015)


    We do lots of csv import and have found that | isn't a good delimiter character as it can easily be mistyped into data when hitting the shift key or even used to "draw pictures" in data text fields. Now we use the character ยง which can't be easily typed.

    Good tip.

    I'm still amazed that much of what has been done with CSVs and TSVs is to support spreadsheets and the fact that humans want to be able to see such things using the likes of NotePad to see things.

    Way back when a lot of the folks on these fine forums weren't even born yet (I'm 62 and that includes me), there were simple non-printable single characters that were used VERY effectively for character based transmissions such as CSV files, et al.

    Take a look at the ASCII table, a decent copy of which can be found at the following URL.

    http://www.asciitable.com/

    In particular, have a look at characters 28 through 31 (according to the DEC column). You have a file separator, a group separator, a record (line or row as humans think of them) separator, and a unit separator (divides the record into units and we know these as "delimiters"). A CSV-like file could easily be created using the record separator (CHAR(30)) to replace the "newline" or CrLf combinations and the unit separator (CHAR(31)) would replace the comma delimiters. The neat thing about doing this is you would NOT have to worry about whether or not commas were embedded in the actual data and you wouldn't need text qualifiers (double quotes, usually) at all.

    The group separator (CHAR(29)) could be used to build hierarchical data, if you wanted to screw up a good thing, and you could use the file separator (CHAR(28)) could be used to include a layout file (think header with datatypes like the old DBase III files used to have) followed by the data "file", followed by a summary file (think footer) that would give you the total number of rows, a total byte count and an end-of-transmission character (CHAR(4) or CHAR(23)) to ensure that you got the whole file.

    As in the old days when transmission speeds where incredibly slow (I was thrilled to finally go from 110 Baud to 300 Baud), integer values could be sent as byte values which would have the effect of greatly "compressing" the transmission for integers with more than 2 places.

    Personally, I think it makes the likes of XML and JSON look silly and really bulky for character based transmissions. And, yeah... it even works with {gasp!} Unicode if you need it to.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Your argument against XML, JSON, even CSV with delimiters available on a 101 keyboard rings very similar to the assembler programmer that used to sit down the hall from me that rejected the likes of COBOL, SQL, OO languages like C++, etc. He's certainly retired by now but his argument went that people were giving up control and pure speed by not coding in assembler close to the CPU and that moving towards all the "gook" sitting on top of assembler that acted as an abstraction layer to make computers easier to program was never going to be as good. His codebase was slowly migrated to distributed systems running C#, T-SQL and SOAP (XML) web services and while performance did not measure up 1:1 it was comparable. More importantly it opened a lot of doors for the business to extend their systems plus widened the potential labor pool significantly.

    I recently had a user make us write an SSIS loader for an XML file so they could review the data. This was requested primarily because Excel doesn't do a good job rendering XML files with hierarchical data structures ๐Ÿ˜› We didn't spend a lot of time writing the loader and the story is somewhat anecdotal but in the majority of shops I have been around if the file won't open in Excel or the data can't be easily loaded into a database (like Access) with a full featured UI allowing accept, edit and reject functionality then it's labeled "too techy" for most business users to deal with and gets kicked back to IT to make it more consumable.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Jeff Moden (7/3/2015)


    Phil Parkin (6/2/2015)


    P Jones (6/2/2015)


    We do lots of csv import and have found that | isn't a good delimiter character as it can easily be mistyped into data when hitting the shift key or even used to "draw pictures" in data text fields. Now we use the character ยง which can't be easily typed.

    Good tip.

    I'm still amazed that much of what has been done with CSVs and TSVs is to support spreadsheets and the fact that humans want to be able to see such things using the likes of NotePad to see things.

    Way back when a lot of the folks on these fine forums weren't even born yet (I'm 62 and that includes me), there were simple non-printable single characters that were used VERY effectively for character based transmissions such as CSV files, et al.

    Take a look at the ASCII table, a decent copy of which can be found at the following URL.

    http://www.asciitable.com/

    In particular, have a look at characters 28 through 31 (according to the DEC column). You have a file separator, a group separator, a record (line or row as humans think of them) separator, and a unit separator (divides the record into units and we know these as "delimiters"). A CSV-like file could easily be created using the record separator (CHAR(30)) to replace the "newline" or CrLf combinations and the unit separator (CHAR(31)) would replace the comma delimiters. The neat thing about doing this is you would NOT have to worry about whether or not commas were embedded in the actual data and you wouldn't need text qualifiers (double quotes, usually) at all.

    The group separator (CHAR(29)) could be used to build hierarchical data, if you wanted to screw up a good thing, and you could use the file separator (CHAR(28)) could be used to include a layout file (think header with datatypes like the old DBase III files used to have) followed by the data "file", followed by a summary file (think footer) that would give you the total number of rows, a total byte count and an end-of-transmission character (CHAR(4) or CHAR(23)) to ensure that you got the whole file.

    As in the old days when transmission speeds where incredibly slow (I was thrilled to finally go from 110 Baud to 300 Baud), integer values could be sent as byte values which would have the effect of greatly "compressing" the transmission for integers with more than 2 places.

    Personally, I think it makes the likes of XML and JSON look silly and really bulky for character based transmissions. And, yeah... it even works with {gasp!} Unicode if you need it to.

    ... look at characters 28 through 31 ...

    You're reading my mind again Jeff. ๐Ÿ™‚

    This would make it hard to look at the data file and see what's going on, and would be hard to do manually, but it would have saved me lots of hours over the past few years.



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Indeed, I had a similar experience, a Fortran 4 programmer said when confronted with Fortran 77 the string type was useless, basically it is a integer array so you can better use that !

  • Orlando Colamatteo (7/7/2015)


    Your argument against XML, JSON, even CSV with delimiters available on a 101 keyboard rings very similar to the assembler programmer that used to sit down the hall from me that rejected the likes of COBOL, SQL, OO languages like C++, etc. He's certainly retired by now but his argument went that people were giving up control and pure speed by not coding in assembler close to the CPU and that moving towards all the "gook" sitting on top of assembler that acted as an abstraction layer to make computers easier to program was never going to be as good. His codebase was slowly migrated to distributed systems running C#, T-SQL and SOAP (XML) web services and while performance did not measure up 1:1 it was comparable. More importantly it opened a lot of doors for the business to extend their systems plus widened the potential labor pool significantly.

    I recently had a user make us write an SSIS loader for an XML file so they could review the data. This was requested primarily because Excel doesn't do a good job rendering XML files with hierarchical data structures ๐Ÿ˜› We didn't spend a lot of time writing the loader and the story is somewhat anecdotal but in the majority of shops I have been around if the file won't open in Excel or the data can't be easily loaded into a database (like Access) with a full featured UI allowing accept, edit and reject functionality then it's labeled "too techy" for most business users to deal with and gets kicked back to IT to make it more consumable.

    When I first heard of "XML", it was supposed to be a data "god-send" because no one would ever have to sit down and figure out how to store the data in a database, it would be 100% self documenting, and anyone would be able to look at it and instantly understand the data. Heh... might just be me but I'm thinking that's a whole lot further from the truth than anyone hearing those original "promises" could possibly have imagined.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/8/2015)


    Orlando Colamatteo (7/7/2015)


    Your argument against XML, JSON, even CSV with delimiters available on a 101 keyboard rings very similar to the assembler programmer that used to sit down the hall from me that rejected the likes of COBOL, SQL, OO languages like C++, etc. He's certainly retired by now but his argument went that people were giving up control and pure speed by not coding in assembler close to the CPU and that moving towards all the "gook" sitting on top of assembler that acted as an abstraction layer to make computers easier to program was never going to be as good. His codebase was slowly migrated to distributed systems running C#, T-SQL and SOAP (XML) web services and while performance did not measure up 1:1 it was comparable. More importantly it opened a lot of doors for the business to extend their systems plus widened the potential labor pool significantly.

    I recently had a user make us write an SSIS loader for an XML file so they could review the data. This was requested primarily because Excel doesn't do a good job rendering XML files with hierarchical data structures ๐Ÿ˜› We didn't spend a lot of time writing the loader and the story is somewhat anecdotal but in the majority of shops I have been around if the file won't open in Excel or the data can't be easily loaded into a database (like Access) with a full featured UI allowing accept, edit and reject functionality then it's labeled "too techy" for most business users to deal with and gets kicked back to IT to make it more consumable.

    When I first heard of "XML", it was supposed to be a data "god-send" because no one would ever have to sit down and figure out how to store the data in a database, it would be 100% self documenting, and anyone would be able to look at it and instantly understand the data. Heh... might just be me but I'm thinking that's a whole lot further from the truth than anyone hearing those original "promises" could possibly have imagined.

    I'm with you on that one Jeff. The self documenting aspect of XML is great. What else can I say is great? hmmm ....

    What I really don't like about XML is that there's too many valid ways you can store the same data in an XML file. Some of those ways don't seem to be ETL friendly.



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Jeff Moden (7/8/2015)


    Orlando Colamatteo (7/7/2015)


    Your argument against XML, JSON, even CSV with delimiters available on a 101 keyboard rings very similar to the assembler programmer that used to sit down the hall from me that rejected the likes of COBOL, SQL, OO languages like C++, etc. He's certainly retired by now but his argument went that people were giving up control and pure speed by not coding in assembler close to the CPU and that moving towards all the "gook" sitting on top of assembler that acted as an abstraction layer to make computers easier to program was never going to be as good. His codebase was slowly migrated to distributed systems running C#, T-SQL and SOAP (XML) web services and while performance did not measure up 1:1 it was comparable. More importantly it opened a lot of doors for the business to extend their systems plus widened the potential labor pool significantly.

    I recently had a user make us write an SSIS loader for an XML file so they could review the data. This was requested primarily because Excel doesn't do a good job rendering XML files with hierarchical data structures ๐Ÿ˜› We didn't spend a lot of time writing the loader and the story is somewhat anecdotal but in the majority of shops I have been around if the file won't open in Excel or the data can't be easily loaded into a database (like Access) with a full featured UI allowing accept, edit and reject functionality then it's labeled "too techy" for most business users to deal with and gets kicked back to IT to make it more consumable.

    When I first heard of "XML", it was supposed to be a data "god-send" because no one would ever have to sit down and figure out how to store the data in a database,

    If you bought this one then the bridge pitch would have come right on its heels...

    it would be 100% self documenting,

    Ah, a technical point! This is what XSD is for and it delivers on the self-documenting promise in my opinion.

    and anyone would be able to look at it and instantly understand the data.

    If the bridge pitch didn't appear earlier, it certainly would after you bought into this one. How could anyone realistically make these claims about any technology? The claims are too subjective and primarily depend on the person trying to understand the data.

    Heh... might just be me but I'm thinking that's a whole lot further from the truth than anyone hearing those original "promises" could possibly have imagined.

    Maybe it was oversold to you, I wasn't there, but XML does what is was designed to do. If you have to ship relational or hierarchical data structures around in files or data streams I think XML is a wonderful option, one of the best going actually.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply