Importing only certain named columns from CSV

  • I have a client that is terrible about giving me consistent data sets, so there could be any number of columns in the data set, but the data sets always have the fourteen columns I need and the columns are labeled correctly. It's not a negotiable issue, the client is just going to give me however much data the client gives me and I have to pull out what I need. Unfortunately it has to be possible for someone else at my company to run this process if I'm not around, and the other person at my company who's been trying to do it has had difficulty using SSMS and the import/export tool consistently.

    I've been trying to figure out if there's some way I can do a front end (in Visual Studio) where the user can select a set of import files and click a button to pull them into a specific table. But every solution I can see (SSIS, Bulk Insert, Openrowset) seems to need a fixed number and order of columns. Thought about dynamically creating a temp table, bulk inserting into that and then pulling the data into the permanent table, but that seems unnecessarily complicated especially if I need to pull the first row of the CSV files to know the number and names of the columns.

    Anybody have experience doing something like this, or any ideas as to how I might go about it? Thanks.

  • I don't think it's unnecessarily complicated to import the data into an table, validate it and extract what you need.  It's called a staging table and that's the way I do all my importing of external data.  If you insert directly to your destination table, you don't have the ability to validate it or clean out the garbage you don't need.  I'd definitely go this route.

    If the columns aren't consistent, then you are looking at some type of dynamic solution.  I don't know what you're up against, so I can't know what to recommend.

  • Provided that at least the columns you need are always called the same name, I would recommend using a script component as your source. You would be able to loop through your header columns and find out what columns are the ones you need and then extract only those columns so that your source doesn't need dynamic output columns (which are a nightmare in SSIS).

    But, honestly, the best solution is to get the person supplying your data to agree a format and stick to it. I know you have mentioned it's unlikely, but it stops most of the problems.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • achilleid - Friday, April 28, 2017 3:25 PM

    I have a client that is terrible about giving me consistent data sets, so there could be any number of columns in the data set, but the data sets always have the fourteen columns I need and the columns are labeled correctly. It's not a negotiable issue, the client is just going to give me however much data the client gives me and I have to pull out what I need. Unfortunately it has to be possible for someone else at my company to run this process if I'm not around, and the other person at my company who's been trying to do it has had difficulty using SSMS and the import/export tool consistently.

    I've been trying to figure out if there's some way I can do a front end (in Visual Studio) where the user can select a set of import files and click a button to pull them into a specific table. But every solution I can see (SSIS, Bulk Insert, Openrowset) seems to need a fixed number and order of columns. Thought about dynamically creating a temp table, bulk inserting into that and then pulling the data into the permanent table, but that seems unnecessarily complicated especially if I need to pull the first row of the CSV files to know the number and names of the columns.

    Anybody have experience doing something like this, or any ideas as to how I might go about it? Thanks.

    Heh.... you say that dynamically creating a temp table seems unnecessarily complicated but stop and think about the code you'd need to write to have it done by the front end.  I also find occasional find it much more pleasurable to pummel the client into submission to do things consistently by negotiating the proper data format and making it a part of the contract so that I'm not forever making free changes to fix their junk. 😉

    That being said, it is what it is and I've had to tolerate seemingly random column changes from data providers that I've had to depend on.  A lot of people don't like my methods because I only use what's available in T-SQL and some mistakenly call it a security risk.

    To make a much longer story shorter, I read the first row from the file as a blob and use a CSV splitter to separate and enumerate the column names.  Then I'll write a bit of T-SQL to create a BCP Format File (the old style, not the newer double-entry, unnecessarily-complicated XML style) which is setup to ignore (Server Column Order is set to "0" for the "ignore" columns) all but the columns I actually want to import (you can use either a CmdExec step in a job or xp_CmdShell directly to write the BCP Format File out to disk).  Then, I just do a simple BULK INSERT from the known file using the known BCP Format File into a standard staging table, which only contains the columns I want (and I'm with Ed... I ALWAYS use a staging to).

    Be aware that BCP, BULK INSERT, and many of the other methods default to allowing 10 errors before aborting the import.  If you want to capture the "humanized" error reporting that provide line #, field number, and a decent description of what went wrong, that's pretty simple to do, as well.  Post back if you can use xp_CmdShell.  I don't want to waste either of our time if you can't.  It would be a shame though... it works VERY well and I've trained the folks at work how to use it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the advice. While I'd prefer something that could magically read the file and pull only the columns I want, pulling the column names and building a BCP file to import to a staging table seems like a simple enough way to accomplish my goals here. I'll give it a shot and let you know how it goes!

  • While I agree that what you are looking to implement as a solution isn't horrible, I would agree with Thom that getting the client to commit to a repeatable process is the best solution. 
    You may need to break it down to dollars for them.  It will cost X for me to write this dynamic sql package vs. Y for us to do it manually every time vs. Z for your team to use the same query for export.

    Just my $.02.

    Regards,
    Matt

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply