Multiple file processing at same time

  • Hi,

    There are about 100 files are placed in FTP

    all the files are needs to be processed at the same time

    Parallel processing..

    each file is about max 200 MB size

    Need to insert the data into table ASAP, how this can be done.

    if there is an error in some files say 10 files, rest 90 files should be processed successfully

    Please suggest your suggestions.

    Regards

  • yuvipoy (8/6/2015)


    Hi,

    There are about 100 files are placed in FTP

    all the files are needs to be processed at the same time

    Parallel processing..

    each file is about max 200 MB size

    Need to insert the data into table ASAP, how this can be done.

    if there is an error in some files say 10 files, rest 90 files should be processed successfully

    Please suggest your suggestions.

    Regards

    you need to break down the problem into smaller pieces:

    1. get files from FTP

    2. load files as they arrive

    those are simple steps but each raises a lot of questions that you must think about. for example, what do you do with a file after downloading it? do you immediately delete it from the remote server? if not, then how will you set things up so you only process each file once? if you do delete it from the remote server immediately after processing it then what happens if it fails to load into your table because of a database availability issue? how will you ensure you do not miss loading that data?

    anytime you are dealing with remote files it raises lots of questions like this around recoverability and making sure you process each file at least once, and only once.

    you have your work cut out for you.

    answer my questions about the FTP step and we'll go from there.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo (8/8/2015)


    you need to break down the problem into smaller pieces:

    1. get files from FTP

    2. load files as they arrive

    those are simple steps but each raises a lot of questions that you must think about. for example, what do you do with a file after downloading it? do you immediately delete it from the remote server? if not, then how will you set things up so you only process each file once? if you do delete it from the remote server immediately after processing it then what happens if it fails to load into your table because of a database availability issue? how will you ensure you do not miss loading that data?

    anytime you are dealing with remote files it raises lots of questions like this around recoverability and making sure you process each file at least once, and only once.

    you have your work cut out for you.

    answer my questions about the FTP step and we'll go from there.

    Orlando ,

    Yes, these are my second set of questions i am thinking of doing it.

    Dealing with files are always a big challenge every time.

    reading the files from FTP.

    In general , how this could have been done?

    How in large organizations will deal with FTP issues?

    Is there any article on those how they approached / will you be able to provide some information how this could be done.

    Thanks

  • yuvipoy (8/10/2015)


    Orlando Colamatteo (8/8/2015)


    you need to break down the problem into smaller pieces:

    1. get files from FTP

    2. load files as they arrive

    those are simple steps but each raises a lot of questions that you must think about. for example, what do you do with a file after downloading it? do you immediately delete it from the remote server? if not, then how will you set things up so you only process each file once? if you do delete it from the remote server immediately after processing it then what happens if it fails to load into your table because of a database availability issue? how will you ensure you do not miss loading that data?

    anytime you are dealing with remote files it raises lots of questions like this around recoverability and making sure you process each file at least once, and only once.

    you have your work cut out for you.

    answer my questions about the FTP step and we'll go from there.

    Orlando ,

    Yes, these are my second set of questions i am thinking of doing it.

    Dealing with files are always a big challenge every time.

    reading the files from FTP.

    In general , how this could have been done?

    How in large organizations will deal with FTP issues?

    Is there any article on those how they approached / will you be able to provide some information how this could be done.

    Thanks

    My first questions to you is how do you plan on loading the files? What technology do you have access to, or have as an option?



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Will this be one of the solution:

    Solution 1:Once the file reaches the FTP, it should be read and processed , the data which are not processed which are failed should be captured in separate table this is applicable when there is an issue with the data length or some invalid parameters with in the file.

    Solution 2:For each file , we need to have the file_id, we should know how many records are there in each files before processing ,on inserting the data in to table we need to update the no records processed success/fail and we need to match with the no of records in the file, in case if no of records success+fail does not matches the no of records in the file , then we conclude that the file has can issue.

    Thanks

  • if you need to process each remote file once and only once then a workflow like this would make sense to me:

    1. create a table in one of your databases that has columns "server name", "remote file path", "remote file name", "last modified date", "date downloaded"

    2. create two interface stored procedures to access the new table:

    ....proc 1 selects row based on "server name", "remote file path" and "remote file name"

    ....proc 2 inserts a new row

    3. perform a directory listing on the remote FTP server

    4. iterate over the list of remote files and check for the existence of a row in your table using your select-proc

    5. if the row exists in your table and the "last modified date" is the same it means you processed it already. if the row exists in your table and the "last modified date" is different it means a different file was posted, what to do depends on your business rules. if the row does not exist then you haven't processed it yet.

    6. once you successfully download the file then you can add a row to your table using the insert interface proc you wrote

    a program like this would be easier to implement outside SQL Server, for example as a .NET Console Application. I would recommend leveraging a tool like WinSCP for the FTP or SFTP logic. the developer provides a very solid implementation and a nice .NET wrapper class to help you interface with the WinSCP executable from a .NET programming context. commercial tools from GlobalScape are fantastic too but cost a bit of money.

    that would be all I would do in terms of attaining the files. I would keep the "file getting" piece separate from the ETL layer.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply