What's the best practice to auto import files from SFTP

  • This is really a challenging to me: I am assigned folders (yes, plural, more than one folder) on SFTP server, each folder correspondent to one client. Each folder will be used to save spreadsheets or csv files from that client.

    What I need to do is to grab those files and import them into our own SQL server (2012).

    I am able to automate the sync process to retrieve files from SFTP to a NAS location.

    I am more concerning on if it is possible to automate the process of importing those spreadsheets/csv into SQL, if too complex in writing SSIS package, is there any third party product to do that?

    When dealing with spreadsheet, you know the sheet's name, the column's name are just way too diversified, I've seen a client's spreadsheet has 30 columns and some names contains special characters like ' @ # $ %, so that's also something to be handled in the import - tedious but doable when I manually import them.

    Thank you very much.

  • halifaxdal (11/11/2016)


    This is really a challenging to me: I am assigned folders (yes, plural, more than one folder) on SFTP server, each folder correspondent to one client. Each folder will be used to save spreadsheets or csv files from that client.

    What I need to do is to grab those files and import them into our own SQL server (2012).

    I am able to automate the sync process to retrieve files from SFTP to a NAS location.

    I am more concerning on if it is possible to automate the process of importing those spreadsheets/csv into SQL, if too complex in writing SSIS package, is there any third party product to do that?

    When dealing with spreadsheet, you know the sheet's name, the column's name are just way too diversified, I've seen a client's spreadsheet has 30 columns and some names contains special characters like ' @ # $ %, so that's also something to be handled in the import - tedious but doable when I manually import them.

    Thank you very much.

    Some thoughts (read rant) on the subject (I have done this kind of tasks many times)

    😎

    In my experience, when dealing with a diverse set of files, ELT (Extract Load Transform) tends to lead to a much more manageable solution than ETL (Extract Transform Load). Two variants of ELT which I most often use are either load each line of the import without any splitting into a single column or split the lines into a generic multi column table, the latter was made a lot easier with the introduction of the sparse columns, especially when dealing with thousands of columns.

    There can be many variations within Excel spreadsheets, different versions, multiple sheets, column names, leading blank rows and columns etc. which can make the import challenging. To simplify this potential problem, first look into how you can limit the variations by applying rules to the business, i.e. "Excel spreadsheets must be saved as a CSV" etc. This will save you a lot of effort.

    Excel spreadsheets can be "read" in T-SQL for Sheet names, see this post here, this method may not be the best for the faint-hearted;-).

    It is very important to build error and exception handling into the process as a proactive approach, i.e. validate and fail the document before starting the load process, much easier to queue it for correction than clean up a partial load after a failure.

    The logical correctness of the import must also be asserted, i.e. number of transactions in the incoming data vs. the number in the output of the process. Being able to tie back any data entry in the target database to the actual lines in the incoming files will save your bacon, not question of if but when.

  • halifaxdal (11/11/2016)


    This is really a challenging to me: I am assigned folders (yes, plural, more than one folder) on SFTP server, each folder correspondent to one client. Each folder will be used to save spreadsheets or csv files from that client.

    What I need to do is to grab those files and import them into our own SQL server (2012).

    I am able to automate the sync process to retrieve files from SFTP to a NAS location.

    I am more concerning on if it is possible to automate the process of importing those spreadsheets/csv into SQL, if too complex in writing SSIS package, is there any third party product to do that?

    When dealing with spreadsheet, you know the sheet's name, the column's name are just way too diversified, I've seen a client's spreadsheet has 30 columns and some names contains special characters like ' @ # $ %, so that's also something to be handled in the import - tedious but doable when I manually import them.

    Thank you very much.

    If this is to be a part of your business data model, then the business needs to stop treating it like a kitty box. Yes, you can pull off some miracles with incorrectly formed data but don't. The absolute best practice would be to impart some standards on the form, fit, and function of the data, compel the clients to comply, and then enforce the standards rigidly.

    It's easy for them and it's easy for you and rigid enough to easily detect errors. That makes it good for both you and your clients. It'll also lessen the likelihood of your company getting sued for making unauthorized changes to the data to load their junk in and, I have to tell you, that's a whole lot more likely than you might imagine.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/12/2016)


    halifaxdal (11/11/2016)


    This is really a challenging to me: I am assigned folders (yes, plural, more than one folder) on SFTP server, each folder correspondent to one client. Each folder will be used to save spreadsheets or csv files from that client.

    What I need to do is to grab those files and import them into our own SQL server (2012).

    I am able to automate the sync process to retrieve files from SFTP to a NAS location.

    I am more concerning on if it is possible to automate the process of importing those spreadsheets/csv into SQL, if too complex in writing SSIS package, is there any third party product to do that?

    When dealing with spreadsheet, you know the sheet's name, the column's name are just way too diversified, I've seen a client's spreadsheet has 30 columns and some names contains special characters like ' @ # $ %, so that's also something to be handled in the import - tedious but doable when I manually import them.

    Thank you very much.

    If this is to be a part of your business data model, then the business needs to stop treating it like a kitty box. Yes, you can pull off some miracles with incorrectly formed data but don't. The absolute best practice would be to impart some standards on the form, fit, and function of the data, compel the clients to comply, and then enforce the standards rigidly.

    It's easy for them and it's easy for you and rigid enough to easily detect errors. That makes it good for both you and your clients. It'll also lessen the likelihood of your company getting sued for making unauthorized changes to the data to load their junk in and, I have to tell you, that's a whole lot more likely than you might imagine.

    Unfortunately our clients are from everywhere, and I believe we cannot add much extra "regulation" on clients that their data must be in sort of compliance as each business is different, and to do so might even risk of losing the client.

    It's really a catch 22 here, so tedious with likely no automation solution.

  • Heh... some clients are worth losing. 😉

    The title of your post did ask "What's the best practice to auto import files" and what I wrote in my previous post is the best practice.

    The next "best practice" is to write an interface that will solve the junk from each client and translate it into a single, manageable form. If they "junk up their junk" by continuously changing their format and bastardizing column names, it will constitute a continuous custom effort to fix things every time you receive a file from them. Yes, you can automate some "de-junking" but as fast as you can write some nearly artificial intelligence to fix their junk, their lack of intelligence will easily continue to overwhelm your attempts at heterogeneous AI, which will also slow down processing and, possibly, introduce its own set of errors.

    The reason for creating the standard interface is so that you don't have to change all of your code every time a customer farts out yet another abomination in the form of "data". You'll only need to handle the upfront translation to a standard form even if such a translation becomes necessarily human driven and tedious.

    Step one, though, would be to talk with the customers and see if you can agree on something more standard. If you explain to them the many ways that it will actually help THEM, you might find them surprisingly willing to comply. Remember that "The answer is always NO unless you ask".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • To build on what Jeff said, I'd at least hold each client to their own standard. The more clients adhere to your standard, the better. If you can't get them to adhere to your standard, you can write an import routine just for them. However, I'd require that each client provide data in the same format every time. In other words, clients A, B, D and E use your standard. Client C has their own format and client F has their own format, but they must provide the data the same way every time. If they constantly change their format (like Jeff said, "junk up their junk") then you're going to have to write something new every time. Custom work every time is going to be expensive.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply