Best approach to download variably named CSV file from FTP and insert into existing table?

  • I need to create a daily transaction that does the following:

    1: Downloads a file from a client's FTP site (note the directory will contain files named FILE_20150101.txt, FILE_20150102.txt, FILE_20150103.txt - so I'll need to figure out how to include logic so that the file with today's date is the file that's downloaded.

    2: The pipe separated values in the file will then be inserted into a table with today's date

    3: This data will then be appended to an existing table which contains all the data from all previous dates.

    Any guidance/advice is greatly appreciated. Thank you for your time.

  • This can all be done in SSIS (if that's the tool that you choose to use). Here's a general list of steps.

    variables:

    Current Date - as an expression to return current date formatted as you need

    Source File Name - as an expression and concatenate the date with the base file name that you are using

    Destination file name - as an expression and concatenate the file name and the date with the directory that you are going to download to.

    Dated table name - as an expression and concatenate your table name and date together to create the dated table name that you are importing to.

    Create table SQL - as an expression, concatenate your table name and the "CREATE TABLE" SQL coded needed to create the dated SQL table

    Connection Managers: FTP connection manager and SQL Server connection manager

    Tasks:

    FTP task: set the source as a variable and map it to your file name variable.Set the destination as a variable and map it your destination file name variable.

    SQL task: Create the dated table name here using your create table SQL variable

    Data Flow Task: task to import the downloaded file to the dated table name. Your destination table will be set to the table name variable. Add a step in here to copy the dated table name data to the single archive table. (or you can do this in a SQL task with the SQL statement loaded from a variable)

    I'd add in file archiving steps too. And I'd set up my SSIS package to use a configuration table. Make sure you set your package properties to delay validation since your objects won't exist at start of runtime.

    I'm pretty sure that there are several lengthy tutorials on how to do this process. I'd recommend more Googling on your part.

    Thanks,

    MWise

  • Oddly enough, I've never used SSIS for such a thing. Here are the steps I've taken in the past.

    1. Today's date is turned into the expected file name... using T-SQL.

    2. A batch file is created to retrieve the file from the FTP site and store it in a staging directory... using T-SQL and xp_CmdShell.

    3. The batch file is deleted... using T-SQL and xp_CmdShell.

    4. The pipe delimited file is properly parsed, imported, and partially cleansed/validated during the import into a staging table in SQL Server... using T-SQL and Bulk Insert.

    5. Final cleansing/validation is done on the staging table including marking rows as updates to existing rows, new rows, or garbage that needs repair... using T-SQL.

    6. Data is moved to the final table based on the marks established in step 5... using T-SQL.

    7. The garbage rows identified in step 5 are moved to a rework table or file and reported on... using T-SQL.

    8. The original file is zipped and moved to permanent archive storage... using T-SQL and xp_CmdShell.

    9. Since I didn't have to be present for any of that to happen, I was at the pub drinking beer and talking with friends... about T-SQL and xp_CmdShell. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you both for your assistance. I'm probably going to go the SSIS route since we've a good amount of existing packages today (and xp_cmdshell is presently disabled). Thanks guys!

  • @MWise - I've a question I'm hoping you can assist with. I've done some searching but have been unable to come up with a solution so far.

    Thus far I've created the FTP Task to download the variable-named file, and then I've a BULK INSERT task afterward. The issue I'm having is with the BULK INSERT task - I don't see the ability to use variables to identify the file for the bulk insert.

    I'm thinking that either A) There's an option with the bulk insert task that I'm just not seeing, or B) I shouldn't be using the actual Bulk Insert task but should instead just rely on an Execute SQL task and put in a bulk insert statement with T-SQL to the tune of something like SELECT 'MYFILE_' + convert(varchar(500),GetDate(),112) + '.txt' for a @filename variable or somesuch.

  • You can set the SourceConnection property of the task to an expression which would just be your variable. There are two way to get there in the designer.

    The Expressions page of the Task and then click ellipses button to bring up the Properties Expressions Editor. Or via the Properties pane of the task. Click on the plus sign next to Expressions, then the ellipses button.

    Then select SourceConnection as the property to set and then you can set it to an expression. Click the ellipses button on the Expression side and click/drag your variable name to the Expression window.

    Again you'll want to make sure that you have the DelayValidation property set to True for the task if your file won't exist at the start of runtime.

    Thanks,

    MWise

  • Jeff Moden (1/2/2015)


    Oddly enough, I've never used SSIS for such a thing. Here are the steps I've taken in the past.

    1. Today's date is turned into the expected file name... using T-SQL.

    2. A batch file is created to retrieve the file from the FTP site and store it in a staging directory... using T-SQL and xp_CmdShell.

    3. The batch file is deleted... using T-SQL and xp_CmdShell.

    4. The pipe delimited file is properly parsed, imported, and partially cleansed/validated during the import into a staging table in SQL Server... using T-SQL and Bulk Insert.

    5. Final cleansing/validation is done on the staging table including marking rows as updates to existing rows, new rows, or garbage that needs repair... using T-SQL.

    6. Data is moved to the final table based on the marks established in step 5... using T-SQL.

    7. The garbage rows identified in step 5 are moved to a rework table or file and reported on... using T-SQL.

    8. The original file is zipped and moved to permanent archive storage... using T-SQL and xp_CmdShell.

    9. Since I didn't have to be present for any of that to happen, I was at the pub drinking beer and talking with friends... about T-SQL and xp_CmdShell. 😀

    Hey - that's my exact process except for #7 and #8 because I don't process the file if it contains any garbage. I especially enjoy #9, knowing that everything works. 😉

  • Herpington_McDerpington (1/6/2015)


    @MWise - I've a question I'm hoping you can assist with. I've done some searching but have been unable to come up with a solution so far.

    Thus far I've created the FTP Task to download the variable-named file, and then I've a BULK INSERT task afterward. The issue I'm having is with the BULK INSERT task - I don't see the ability to use variables to identify the file for the bulk insert.

    I'm thinking that either A) There's an option with the bulk insert task that I'm just not seeing, or B) I shouldn't be using the actual Bulk Insert task but should instead just rely on an Execute SQL task and put in a bulk insert statement with T-SQL to the tune of something like SELECT 'MYFILE_' + convert(varchar(500),GetDate(),112) + '.txt' for a @filename variable or somesuch.

    I don't know about SSIS, but in T-SQL you can't use a variable for the filename. I use a known, constant filename. You may be able to use the same approach in SSIS. Here are my steps:

    1. If the known file already exists, delete it with xp_cmdshell.

    2. Rename the variably-named file to the known filename with xp_cmdshell.

    3. Fire the BULK INSERT with the known filename.

    I'm left with the original file (variably-named) and the known one. The next time the procedure runs, I'm left with the same thing, so all downloaded files continue to reside on the server if I ever need to refer to them. The archive process is separate.

  • Herpington_McDerpington (1/6/2015)


    Thank you both for your assistance. I'm probably going to go the SSIS route since we've a good amount of existing packages today (and xp_cmdshell is presently disabled). Thanks guys!

    That's a shame. Done correctly, only the system and DBAs can use it directly and it's an incredibly powerful tool. Disabling it does nothing for security if a bad guy gets into the system with "SA" privs (their attack software takes microseconds to turn it on) and, done correctly, having it on doesn't provide any non-"SA" users with privs they wouldn't normally have.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Herpington_McDerpington (1/6/2015)


    @MWise - I've a question I'm hoping you can assist with. I've done some searching but have been unable to come up with a solution so far.

    Thus far I've created the FTP Task to download the variable-named file, and then I've a BULK INSERT task afterward. The issue I'm having is with the BULK INSERT task - I don't see the ability to use variables to identify the file for the bulk insert.

    I'm thinking that either A) There's an option with the bulk insert task that I'm just not seeing, or B) I shouldn't be using the actual Bulk Insert task but should instead just rely on an Execute SQL task and put in a bulk insert statement with T-SQL to the tune of something like SELECT 'MYFILE_' + convert(varchar(500),GetDate(),112) + '.txt' for a @filename variable or somesuch.

    In all its wisdom, MS never provided such a nicety with BULK INSERT, OpenRowSet, or any of the more powerful tools. Instead, you have to build the command using dynamic SQL and then execute the dynamic SQL. So, no... you're not missing anything.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Double post - removed.

  • Excellent, I've now got it set up with the SourceConnection expression and it's showing the correct path/filename....however...upon saving the project it immediately throws a non-fatal error indicating that it cannot find the sourceconnection (the file) and quotes the file path/name.

    I double checked the path and the file is definitely present in the directory and matches what the expression is targeting as the SourceConnection exactly.

    I will note that the sourceconnection file path is on a drive separate from the drive that the SSIS package is saved on - so I've concatenated a few variables for the expression to come out as E:\Folder1\FOLDER2\Filename.txt

    Any thoughts on that?

  • I'm just now realizing what you said, Jeff - that I simply can't use a variable name with BULK INSERT - in T-SQL or SSIS (since it's simply executing the same code).

    That being said - T-SQL and XP_CMDShell it is.

  • You need to make sure that the package setting and connection settings for DelayValidation is set to TRUE. Also, I would stay away from mapped drives unless your server has the same drive mappings as your DEV machine.

    MWise

  • Herpington_McDerpington (1/8/2015)


    I'm just now realizing what you said, Jeff - that I simply can't use a variable name with BULK INSERT - in T-SQL or SSIS (since it's simply executing the same code).

    That being said - T-SQL and XP_CMDShell it is.

    Just checking. How are things going on this?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply