Importing a CSV into SQL Server Shouldn't Be This Hard

  • Dawesi wrote:

    Jeff Moden wrote:

    On the subject of "sometimes" quoted CSV's...

    I totally agree that it shouldn't be necessary.  MS has had more that 2 decades to make this "simple" stuff right as have a great many others.

    Still, it's a bloody damned shame that we have to such silly workarounds in this day and age but consider how long we've also been looking for a BULK EXPORT tool.

    Microsoft Log Parser... updated in 2020, been around for decades.... #dropsmic

    Thanks.  I'll check it out.  It is callable from the command line, correct?

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Stopped using the import wizard back in SQL 2005, when they BROKE the text qualifier.  It worked in SQL 2000, was broken in 2005, I stopped testing for it in 2012 (and it was still broken).

    I too get the truncation error you mention with the screenshot.  Just another reason to not use their tools.  However, in attempting to reproduce your experience with the import wizard, I see they finally fixed the text qualifier issue in SQL 2017 ... so there's that.  *eye-roll*

    • This reply was modified 3 years, 2 months ago by  thisisfutile.
  • Lots of great comments.

    I completely understand poorly formatted CSVs or varying structures, like ragged right, causing issues.

    These were simple, consistently formatted files. I can't believe this was hard.

  • Steve Jones - SSC Editor wrote:

    I completely understand poorly formatted CSVs or varying structures, like ragged right, causing issues.

    These were simple, consistently formatted files. I can't believe this was hard.

    No they weren't... at least not by my standard of "consistently formatted files'.  Consider what you said in the article...

    When I looked at the Venmo stuff, it was a lot of fields, with a header row containing the starting balance and a final row with the ending balance. There were also some lovely sections in the middle with interesting characters. Here's an example of a field inside the CSV.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I'm a little late reading this article, but I've had the same problem in the past. One of my solutions was to import the CSV to my PC, then under regional settings, change the comma to a pipe and save as a CSV document. The file is now pipe delimited. I can then import the file much more quickly.

    In windows 10 this setting is under Region-> Numbers-> List Seperator. Change this from a comma to a pipe, then save your imported file back out as CSV. The data is now pipe delimited. Not ideal for large flat files, but for one off's , this has saved me a lot of headaches. Once completed, change your list seperator back to comma.

    Thanks,

    Steve M

  • sslsm51 wrote:

    The file is now pipe delimited. I can then import the file much more quickly.

    How does just changing the delimiter from one character to another make anything easier?  The only time it would would be when you do special handling for "field embedded delimiters" and, if you have code that can do that, it won't need to make such a change because it'll already be able to handle it.

    If you're the creator of the file, then yes... that'll help someone bigtime.  Just remember that Pipe characters can also be used in field data.

     

    p.s.  Welcome aboard!

     

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I supppose that I should have stated that I worked for a medical client service provider. We would recieve all kinds of client files from IS teams to Marketing, Finance departments and anything in between. The problem I faced was malformed CSV exports, where the file did not use text for fields like address and name that might contain a comma. Also incorrect dollar formatting. For one offs, I would adjust the file to pipe delimited. This did not solve the problem of too many splits in certain lines, but we had a script to seperate the rows with column length > len field names. The errors were easier to spot and correct with a pipe in them. With a one off file, most of the time the corrections were done manually, unless we expected to get the request again. I quess that after importing files daily for years, I never gave it much thought about the lengthy process required for flat files

    • This reply was modified 3 years, 1 month ago by  sslsm51.
  • If it's done right on both the part of the sender and the receiver, it's actually not a "lengthy process".  It can be nasty fast and super easy to setup and execute.  It's when people either get lazy or try to get "creative" that "flat files" cause any pain.  Oddly enough, I blame most of that on people that use and abuse spreadsheets and then let that bleed over into other areas.

    I also miss the old days when people actually did care about how my bytes they needed to transmit and did things like sending what some called "packed fields" (1, 2, 3, or 4 byte ints, for example) for dates, times, dollars, and a bunch of other stuff instead of sending all these spreadsheet compatible, human readable, junk.  And, yeah... we didn't use Cr/LF or tabs or any of that.  We used ASCII characters 28 through 31 and a lot of "fixed width fields".  And we never had to work around column headers in the data file or any of that junk because it was either contained in a separate very short meta-data file that contained the "Record Layout" (which even included the datatypes).  There was also a separate manifest file that contained all the particulars like number of rows, contact info, etc, etc.  And all of those files followed a strict but flexible format that could easily be imported and parsed, as well,

    What's absolutely astounding to me is that when people are sending data back and forth between SQL Server instances (if they don't want or need replication, which actually does the same thing), they still do crazy stuff like converting the data to XML or Jason or TSV or CSV or you name the atrocity du jour instead of send the data in SQL "native" format, which has that "packed" stuff I was talking about (it's really all the binary representation of the data/datatype) and it'll auto-magically create a BCP format file (the "Record Layout" meta-data file) for you.

    So, getting back to the subject, none of this stuff is harder than it should be.  Most of it, if you let gravity happen, is as easy as falling off a proverbial wet moss log.  The problem occurs when people get involved.  Heh.... funny how that's the case with a whole lot of computer-related stuff. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • It still blows my mind that this is so hard with the standard tools.

    Importing data from large CSVs - sometimes hundreds of them - all with different unknown file layouts was at one point a major part of my job because I did data conversions from various other legacy systems, not all of which would or could just give you a database backup of the old system. The tools Microsoft provides are pretty terrible, and the company was a nonprofit which flirted with the idea of buying a special tool but never ended up selecting & paying for one, and the hosting servers I was working on didn't allow me to install extra tools -- even, believe it or not, SSIS! -- because it was against the security policy. The process had to be repeatable a couple of times for usually two test conversions and a final live conversion, but once the client was converted, the next client would have a whole different set of unpredictable files to deal with.

    One thing I ended up doing was writing a Python script to combine the CSVs into Excel files with tabs that were named after the file names (but not too long). I had the script do them in batches so a reasonable number of tabs/amount of data would end up in each Excel file, so the Excel file didn't become insanely huge. I would do this on my computer before uploading the Excel to the server where I wasn't allowed to install programs. Then I would import from the Excel, which Microsoft's tool was better at - and it would automatically name the tables after each tab of the Excel, which had the CSV file names which were usually the table names I wanted.

    The Excel importer had its own quirks - like cutting off long notes fields, for example, and some weirdness where dates would sometimes end up importing as the Excel number representation of the date. The date thing could often be fixed by going into the Excel before importing it and letting Excel know the field's data type. And for zip codes, you often wanted to make sure you told Excel it was a text field so it wouldn't cut off leading zeroes. (That's one reason to keep the Excel files not-insanely-huge.) But the Excel importer was way smarter & faster and could do many tabs to many tables.

    • This reply was modified 3 years, 1 month ago by  Andrea C..
  • @andrea - your issues were not actually related to importing CSV files into SQL Server.  This all goes back to the sender and how the files are generated.  If the files are generated in a standardized format and are consistent - then it is quite simple to setup and create an import process that works every time.

    I have seen this problem - a lot.  It is the expectation of the sender that they can send you whatever they want, in whatever format - change it at any time and it will somehow magically be imported correctly.  What they never realized and are never told is that it takes many, many, many man hours to 'fix' the files so they can be imported.

    It doesn't even matter what format - you can define a fixed width file, ragged-right, CSV, TSV, XML, JSON or any other file format and if the data isn't consistent and the format changes it will be difficult to import that data.  But if the format is consistent and doesn't change - then once you have built the import process it just works.

    Notice I didn't identify a tool - that is because the tool doesn't matter.  BCP/SSIS/Powershell/Informatica/Seebeyond/Ensemble - every single one of them will break if the format changes.  In every single one of these tools, you can accommodate some level of change - but change the structure of the file and you will have issues.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Asking for and getting a consistent format for incoming data from multiple clients is often wishful thinking. Sure, if you've done previous conversions from a system(s) that is the source of the incoming data then you already have this in place, but if the incoming data is from a proprietary or obscure system, then the rules change. It is worse if the prior vendor/provider are no longer in business and the client is lucky to be able to export anything. On top of this, if the company you work for really needs new business, they will relax requirements for incoming data and eat the time. Data conversions have never been a profit center in my experience, but that is more of a management issue. So, yes, what was written wasn't really related to SQL Server import tools as much as it is related to the pitfalls of having to do conversions. On that, I can relate.

    Cheers

  • jfogel wrote:

    Asking for and getting a consistent format for incoming data from multiple clients is often wishful thinking. Sure, if you've done previous conversions from a system(s) that is the source of the incoming data then you already have this in place, but if the incoming data is from a proprietary or obscure system, then the rules change. It is worse if the prior vendor/provider are no longer in business and the client is lucky to be able to export anything. On top of this, if the company you work for really needs new business, they will relax requirements for incoming data and eat the time. Data conversions have never been a profit center in my experience, but that is more of a management issue. So, yes, what was written wasn't really related to SQL Server import tools as much as it is related to the pitfalls of having to do conversions. On that, I can relate.

    Heh... it reminds me of the old adage that "The database would be perfect if it weren't for the damned users". 😀  It's the same with everything... it takes humans to screw up computers.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeffrey Williams wrote:

    It doesn't even matter what format - you can define a fixed width file, ragged-right, CSV, TSV, XML, JSON or any other file format and if the data isn't consistent and the format changes it will be difficult to import that data.  But if the format is consistent and doesn't change - then once you have built the import process it just works.

     

    At least when you're dealing with XML or JSON the average end users is too intimidated by the format(and usually lacks the tools anyways) to directly mess and generate the files....  which usually forces some kind of automation into things.  Excel let's anyone think they can create csv and it'll work just as well for any tool as it does in excel.

    • This reply was modified 3 years, 1 month ago by  ZZartin.
  • I know I don't need to tell you, but this could turn in to a whole other thread. Of course, working for smaller companies, I've contributed to this by making the database do things that should have been done in software/apps, but we all find ourselves having to please our masters at some point. Even if they are crazy/clueless.

    Cheers

  • Hello,

    usually I import my txt-files with the "Import flat file..." option in SSMS. But in versions later than 17.9.1 SSMS recognizes the first data row as header. Duplicating the header or inserting empty lines/rows in the txt-file didn't help. The file originally is saved by Excel or SPSS but it makes no difference if it is saved again with e.g. notepad++.

    Has anyone an idea what is wrong with flat file import? I don't find nothing on the internet, I searched several times, in vain.

    Thanks a lot for this article! I know now that there are other options (including the one from jcasement) than "Import Data..." what might work well if settings for up to 400 columns are made correctly...

Viewing 15 posts - 46 through 60 (of 61 total)

You must be logged in to reply to this topic. Login to reply