Architecture for High Volume file processing

  • I'm looking for advice on the best architecture to handle the following scenario:

    We'll have a file drop folder for flat files coming in from a variety of sources. They may be any one of several different formats -- each format will have something in the file name or extension to denote its format. Files are relatively small (typically under 50K), but the folder may see a hundred or more of these per minute.

    I need to create a process that will monitor the folder for arrival of new files (though we could also simply poll every 10 seconds or so), determine the type of each file found and launch a file transformation routine that will put each of the different formats into a common format and/or a staging table in SQL, from where it will be loaded into several different production tables.

    We have a simpler version of this already created as a C# windows service that will spawn a new thread with an appropriate file conversion routine based on the file format. It is working well, but the number of formats to be supported is growing from 2 to about 6, and the volume of files is growing from about 3 every 10 minutes to, as indicated above, about 100 per minute. I'd like to try to use SSIS to handle this, but I have questions about whether it can do so. (And the C# developers are saying it can't.)

    I've created a bunch of simple, straightforward SSIS packages in the past, but this would be an order of magnitude more complex. I'm wondering if those more experienced with SSIS would consider tackling this scenario using SSIS, or if we should simply beef up our existing C# service.

    If we used SSIS, we would likely have to create multiple custom data flow transformation components to handle the different formats (none are simple, but involve unpacking binary data and variable numbers of columns in a 'row'). The real questions would be:

    a) Is there a way within a ForEach Loop container (or something else?) to branch to different data flow tasks depending on a file extension or name? I can't see a way...

    b) If so, could each branch run on a new thread so that we could handle multiple files simultaneously, or does SSIS only process one file at a time?

    c) Could we run multiples of each branch so as to process multiples of the same file format at the same time?

    d) Are there other ways of handling this with SSIS that I'm not even thinking of (but should be)?

    I'm currently using SQL 2005, but we're expecting to upgrade to 2008 in the next 6 weeks, so if there are capabilities in 2008 that aren't in 2005, those will be available to me.

    I realize this is a pretty high-level question. Just hoping for some input. Thanks in advance!

    Rob Schripsema
    Propack, Inc.

  • I haven't done this, but a few thoughts.

    First, an SSIS package can handle multiple threads and process things in parallel, but it would have to know what files to pick up. I can see this as being a confusing item.

    Second, are you looking to programmatically pick up files and determine which type they are and import them? Or would you create an SSIS package for each format and then pass in the file as a parameter and let SSIS pull in the data?

    I know SSIS excels at large volumes of data, but the large volumes of files could be an issue. If you started 100 instances of SSIS packages running every minute, I'm not sure of the overhead.

  • Here's the scenario I'd like to be able to create with SSIS -- similar to what I know I can do in a C# app:

    1) I'd begin monitoring a folder, looking for files being dropped into it. If monitoring doesn't work, I'd just start the app/package every 30 seconds and process files found in the folder -- until there were none left.

    2) Using something like a ForEach Loop, I'd get the file name and determine the type of file it is -- say, a Log1, Log2, Log3 or Log4 type.

    3) Using something that works like the Conditional Split transform does at the row level in a Data Flow, I'd like to pass that file to one of 4 different file processor data flows based on the type of file discovered in step 2.

    4) Once a data flow process had started, the control flow would look at the next file found in the drop folder, examine it, and pass it off to another data flow designed for that file type. (In C#, it would be like spawning a process in a new thread.) Ideally, more than one instance of a given type of data flow could be running at once, rather than having to process each type sequentially, one file at a time.

    5) Once all file processing had finished, and if there were no more files in the drop folder to process, the app would either silently continue to monitor the folder, or exit, to be restarted 30 seconds later.

    I know, too, that SSIS can handle large amounts of data through the pipeline. As you say, the question is whether it can handle large numbers of files of different types in this fashion (or an alternative fashion, if there's a different architecture I should consider). Seems like it ought to be a common enough scenario that somebody in the community would have been through this before.

    Rob Schripsema
    Propack, Inc.

  • Rob,

    I have a question. How many different transformations do you have? That is to say, of the '100' different incoming files, how many SSIS packages would be needed to process them?

    Josef Richberg
    2009 Exceptional DBA
    http://www.josef-richberg.squarespace.com
    http://twitter.com/sqlrunner

  • Here's how I'd design it:

    1. Create a simple "control" table in SQL. Identity PK, filename, file "type", Execution instance GUID, Start Time, End Time.

    2. Create a "control" package. This package will loop infinitely (for loop) with a pause on each loop. It'll have a Foreach loop inside that looping over files in the folder. For each file, use an Execute SQL Task to check if that file is in the control table. If not, insert a new row.

    3. Create a "control" Agent job. First step is to run the control package, subsequent steps to report errors, etc...

    4. Create a "worker" package for each file type. Yes - one package per file type. Have each package loop infinitely (for loop) with a small pause on each loop. Inside the loop, use an Execute SQL Task to update the control table. The update would stamp a start time and execution GUID onto the TOP 1 row of the file type that didn't already have a start time on it, ordered by the identity col. Follow up with another Execute SQL Task (or just type more into the first one) that retrieves the row for this execution GUID that has NULL in the end time column. Fill variables with the file name. Process the file. Then use an Execute SQL Task to update the row in the control table with an end time.

    5. Create as many agent jobs as you like for each type of "worker" package.

    The above will allow you to keep each package very simple - each package only processes one file type. It will allow you to dynamically scale up - or out - how many concurrent packages run, and where they run. Cause yes - you can start another SSIS server, create jobs using those packages, and refer them to the same control table. This also allows you to fail mostly gracefully - if one of your worker packages fails, your whole system doesn't stop. It'll just be that one package due to one file, or it will be all instances of that package type if you get something logically wrong.

    Todd McDermid - SQL Server MVP, MCTS (SQL 08 BI), MCSD.Net
    My Blog - Dimension Merge SCD Component for SSIS - SSIS Community Tasks and Components

  • sqlrunner (11/18/2010)


    Rob,

    I have a question. How many different transformations do you have? That is to say, of the '100' different incoming files, how many SSIS packages would be needed to process them?

    At present, we need to process 4 different incoming file layouts/types. The destination for each is the same, it's just that each has to be shredded differently. Ultimately, there could be 6 or 7 different types.

    We need to develop this so that it will handle up to 100 individual files per minute (each between 10 and 50Kb) in any combination of the 4+ types. I could set up the incoming processes such that each type will go to a separate folder.

    I had originally been thinking this would all go into a single package, but your question prompted me to wonder if I shouldn't just create a package per file type. I could have each of the packages running simultaneously, each processing files of a given type. In that case, I still have the question of whether it is possible to set up a package such that it can process multiple files at the same time on separate threads, much like I can do in a C# app....

    Rob Schripsema
    Propack, Inc.

  • tmcdermid (11/18/2010)


    Here's how I'd design it:

    1. Create a simple "control" table in SQL.

    ....

    The above will allow you to keep each package very simple - each package only processes one file type. It will allow you to dynamically scale up - or out - how many concurrent packages run, and where they run. Cause yes - you can start another SSIS server, create jobs using those packages, and refer them to the same control table. This also allows you to fail mostly gracefully - if one of your worker packages fails, your whole system doesn't stop. It'll just be that one package due to one file, or it will be all instances of that package type if you get something logically wrong.

    Todd -- thanks, I think I'm starting to see how SSIS would need to be set up for this scenario. Not as clean as I would have hoped (lots of moving parts each having to be coordinated), but certainly workable and conceptually simple.

    Rob Schripsema
    Propack, Inc.

  • I have a script component that watches 1 or more directories for a file mask i.e. (log*), when it finds it, the information is passed to the rest of the package for processing. Once that file is done it loops again and will loop infinitely. I put a 5 second delay in the file check so as not to have it consume so much cpu. You would have the watch process in each of your packages.

    Josef Richberg
    2009 Exceptional DBA
    http://www.josef-richberg.squarespace.com
    http://twitter.com/sqlrunner

  • Rob,

    The method you've described will not allow for multi-threading the process. You will not be able to achieve multi-threading from within one package as you've described because the for each loop through the directory will always be sequential.

    The solution that Todd suggested will work fine and indeed be scalable. Also, having the load logic for each file type contained in it's own package would be a big help come maintenance time. I too, would create a specific package for each file type.

    I know we are in the SSIS forum, but another alternative to this would be to use a Service Broker oriented process. You could keep your C# developers engaged by having their process continue to monitor the directory with the file watcher process. Have them create a process for each file type that takes each type and converts it into a common XML document that represents the logical entities in the data. Then, the C# thread will insert the XML document into the SB queue (via a view...I can explain this in more detail if you have interest).

    Create a stored procedure with the logic to handle the common XML document and set that as the activation SP for the queue. You can use the MAX_QUEUE_READERS setting on the queue to tell it how many instances of your activation procedure that it can have running simultaneously. I've used this method to create a common importer process from multiple sources and it worked really well.

    Pros/cons......I like having the import logic built into a stored procedure and stored in the database layer as opposed to inside a package. This makes maintance easier if you have schema changes you only have to change your import SP and not 1-n packages. SB handles scaling up as many SPs as it needs to process the queue based on queue volume. This common importer can be used elsewhere outside of SSIS. Any process that can create the XML document in the right format can use the importer.

    As for the cons, using SSIS offloads some of the overhead to the SSIS instance so if you have a separate SSIS server, you are lessening the resource contention on your server with your database instance. The SB/SP method will do all of the work within the database engine. Also, larger files will create larger XML documents which may be more resource intensive to process. XML has come a long way since SS2K, but there will always be limitations with parsing out large XML docs. I have not done a comparison between an SSIS dataflow and parsing a XML doc for large files, so I don't really know how much throughput difference you would see.

    I'm sure there are more pros/cons that I listed here in a quick minutes, but that's the main area I can think of. Todd's method would work fine too, I just thought I'd put some food for thought out there.:-)

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • Rob,

    You may be interested in reading this article:

    http://www.bidn.com/articles/integration-services/119/parallel-loads-in-ssis

    which seems to be similar to Todd's idea, and additionally describes how to affinitize an SSIS process to a core, so that you could run, say, 16 SSIS processes, on a given SQL Server.

    -Steve

  • John Rowan (11/18/2010)


    ...

    I'm sure there are more pros/cons that I listed here in a quick minutes, but that's the main area I can think of. Todd's method would work fine too, I just thought I'd put some food for thought out there.:-)

    John -- the more input the better at this point. I like your idea -- but as you said, I might be concerned about the load on the SQL process. Still, the few times I've worked with Service Broker I've been impressed with how well it works and what one can do with it.

    I just might explore this further....

    Rob Schripsema
    Propack, Inc.

  • sqlrunner (11/18/2010)


    I have a script component that watches 1 or more directories for a file mask i.e. (log*), when it finds it, the information is passed to the rest of the package for processing. Once that file is done it loops again and will loop infinitely. I put a 5 second delay in the file check so as not to have it consume so much cpu. You would have the watch process in each of your packages.

    Thanks for the input. I've seen a FileWatcher component (from Konesans?) that seems to also do what you are saying. Depending on how I end up putting this together, that could prove useful.

    Rob Schripsema
    Propack, Inc.

  • Rob Schripsema (11/18/2010)


    sqlrunner (11/18/2010)


    I have a script component that watches 1 or more directories for a file mask i.e. (log*), when it finds it, the information is passed to the rest of the package for processing. Once that file is done it loops again and will loop infinitely. I put a 5 second delay in the file check so as not to have it consume so much cpu. You would have the watch process in each of your packages.

    Thanks for the input. I've seen a FileWatcher component (from Konesans?) that seems to also do what you are saying. Depending on how I end up putting this together, that could prove useful.

    I used that originally, but it was a little awkward for us since we were watching multiple folders for the same set of files. If you are going to watch a single folder for multiple file masks, then it's perfect. If you need to watch multiple directories, a simple C# script is the way to go. I could post the example or simply send you the file if you need it.

    Josef Richberg
    2009 Exceptional DBA
    http://www.josef-richberg.squarespace.com
    http://twitter.com/sqlrunner

  • Rob Schripsema (11/18/2010)


    John Rowan (11/18/2010)


    ...

    I'm sure there are more pros/cons that I listed here in a quick minutes, but that's the main area I can think of. Todd's method would work fine too, I just thought I'd put some food for thought out there.:-)

    John -- the more input the better at this point. I like your idea -- but as you said, I might be concerned about the load on the SQL process. Still, the few times I've worked with Service Broker I've been impressed with how well it works and what one can do with it.

    I just might explore this further....

    Rob,

    Give me some time to get something written up. I can mock up an example on how this works including a view based interface into the SB queue from outside processes. I designed a process just like this at my previous position, but I don't have the code so I'll create a mock up to show you proof of concept. It shouldn't be too hard to take the mock up, fit it for your own database entities, and run a load test to see how it scales. I think a load test would be worth the time before you go about designing the SSIS solution, or in parallel to it.

    I'll get something out either tomorrow or over the weekend.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • John Rowan (11/18/2010)


    Rob,

    Give me some time to get something written up. I can mock up an example on how this works including a view based interface into the SB queue from outside processes. I designed a process just like this at my previous position, but I don't have the code so I'll create a mock up to show you proof of concept. It shouldn't be too hard to take the mock up, fit it for your own database entities, and run a load test to see how it scales. I think a load test would be worth the time before you go about designing the SSIS solution, or in parallel to it.

    I'll get something out either tomorrow or over the weekend.

    Thanks, John. Much appreciated -- and definitely above and beyond the call of duty. I'll definitely give it a try.

    Rob Schripsema
    Propack, Inc.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply