Control Flow: differentiating between source files already processed/not processed

  • I have to import zip files every day from a vendors ftp site who leaves the last 7 days on the ftp site at all times. I run the etl every day and just need the latest zip file.

    For eg.

    Jun1

    tbl1_060114.txt, tbl2_060114.txt, tbl3_060114.txt

    Jun2

    tbl1_060214.txt, tbl2_060214.txt, tbl3_060214.txt

    Jun3

    tbl1_060314.txt, tbl2_060314.txt, tbl3_060314.txt

    and so on.

    What task+approach do I need to take to get just the latest version of tbl1,tbl2,tbl3 from the ftp site? Other consideration: I must not accidentally reprocess a differential that I've already imported, and need to be able to handle for a situation where a database backup is lost and I have to reprocess 2 or more days worth of differentials in the right order.

    Does this make sense?

  • If you move a file to another folder after processing it, will the vendor put it back if it is within the seven-day period?

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • The vendor makes a running 'last 7 days worth' of zipped differential files available on their ftp site, 3 per day (one per table). I just have to go get them before any one of them gets more than 7 days old.

    Those are the only rules. I plan to get whatever they've just posted, on the day they've posted. This means that there will always be 6 other files, per table, that are still sitting there. I guess it doesn't matter to anybody right now whether I move or copy the files from the ftp source, so maybe I should just move files I'm processing. In that case I will just need the differential having a time stamp 1 day greater than the last differential I processed the day before...per table.

    so simple yet so complex.

  • As per my understanding, below approach should be able to handle this:

    1. Create a Table 'Time' which should have all the calendar dates and a Column 'IsProcessed' with flag 'Y' or 'N'

    2.Create a User varible, to hold the value from below query:

    Select TimeColumn from dbo.Time where IsProcessed Flag = 'N' and timecoumn < = Getdate()

    This should give you all the dates for which the files are yet to be picked up. For e.g., 060114, 060214 etc.

    3.Use a Foreach loop container using above mentioned variableto loop thorugh for all the required dates.

    4. Create another Foreach loop container within the above mentioned for each loop container, to Pick up all the files for a given date. You can create variable(s) to dynamically generate the file names to pickup the required file from FTP server for each table.

    5. Move the file to Archive folder just so that there is a copy available with you even after 7 days period.

    6. After all the files are picked up for a particular date, the corresponding IsProcessed flag in Time table should be set to 'Y'.

    In case, you have to reprocess previous day's data, then you can move the file to FTP server and change the Isprocessed Flag in Time table.

    Just a high level idea what I could think of 🙂

  • I agree that keeping a record of the files that you've processed is best way to accommodate your requirements, but it doesn't have to as complex as the previous suggestion. You could simply pick up each file, check the file name against your ProcessedFiles (for example) table; if there is no matching entry, continue to process the file; if there is a matching entry, do nothing further and move on to the next file.

    You can also record other potentially useful information in your ProcessedFiles table such as row counts and timing information. It is also useful as a basic audit table.

    Regards

    Lempster

  • Hi Marinder and Lempster, Thank you both. I think you're both in agreement actually, just that Marinder gave me concrete SSIS steps. Thanks for the details Marinder. I will ponder and get to work on POC!!! thanks so much..

  • how would you control the order of processing. Diff on jun 1 has to be processed before diff on jun 2.

    would you put a query in the second foreach loop to do something like:

    select top 1

    from TimeTbl

    order by datecolumn asc

    ??

  • Sure, ORDER BY would be able to handle that scenario. and that Query would be required for First Foreach loop container.

    Second foreach loop contaiiner should be used for looping through files for a particular date.

  • is it correct that in the second foreach loop container, the filename will be built up using an expression built using two variables: filepart and datepart?

  • M, you think you could look at what I did on this post

    http://www.sqlservercentral.com/Forums/Topic1592460-364-1.aspx?Update=1

  • That's correct. The complete filename (FileName+Datepart) should be built using expression builder.

    I have also replied to other post you've referred to above.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply