Importing Files in Parallel with SSIS

  • richardmgreen1

    SSCrazy Eights

    Points: 9886

    I wouldn't call 160 massive and this works for us.

     

    We do spawn up to 32 versions of the same SSIS package to do the loading but, if one were to fail, it doesn't fail the rest.

    E.g. If the file34 fails, the other 159 files would load.

    What then happens is we get an email telling us which file failed so we can do something about it (it's usually a change of layout done by the supplier but that happens rarely)..

     

    At that point, we restart the process which ignores the 159 correctly processed files (see the IsLoaded flag) and concentrates on the one failed file.

     

    Once all the files have loaded, we then archive the files (just in case) into a zip file and then delete the original files (7-zip doesn't seem to a zip-and-delete like other compression programs).

     

    Thanks for the link, I'll take a look and see if it's useful for us.

  • iposner

    Ten Centuries

    Points: 1161

    Rather than having a master package distribute the list of files to process, another approach is to give each package the rules it needs to process the subset of files it needs and then kick off the parallel execution of each.

    I call this pattern the "Modulo Shredder". You can read about it here: https://ianposner.wordpress.com/2012/12/14/etl-parallelization-modulo-shredder/

    I've used this pattern to load over one million rows per second.

  • Jeff Moden

    SSC Guru

    Points: 996436

    iposner wrote:

    Rather than having a master package distribute the list of files to process, another approach is to give each package the rules it needs to process the subset of files it needs and then kick off the parallel execution of each.

    I call this pattern the "Modulo Shredder". You can read about it here: https://ianposner.wordpress.com/2012/12/14/etl-parallelization-modulo-shredder/

    I've used this pattern to load over one million rows per second.

    How wide were the rows in field count and bytes?  When you were done, how large did the log file grow?  Once you had the data loaded, what post-processing (clustered index addition, data validation by datatype, etc) was necessary?  And do you have a readily duplicated test that anyone could run?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Forum FAQ

  • iposner

    Ten Centuries

    Points: 1161

    In answer to your question, it is about 10 years since I did this load, but from memory the load was into a heap table that subsequently had indexes built upon it - that was what took the time (5 hours)! Then the table was switched into the main fact table. I may have performed this load in the bulk_logged mode to avoid logging too, but I can't remember. As you've rightly guessed, the table wasn't very wide, but I managed to load 2.5bn rows in just over 40 minutes resulting in just over 1m rows per second by executing the same package 12 times in parallel on a set of 10,000 files. All the sub-packages loaded fact table data and included foreign key lookups in their logic. The server was a regular HP DL-380 connected to a corporate SAN.

    However none of the technical points above is really relevant to the technique itself: In any parallel execution scenario, you can either assign the files to be processed by a controlling process, (which, if you do on a file-by-file basis acts as a bottleneck across all executing sub-packages), or you can assign all the files to the sub-packages as lists upfront (which is better, but slows down if you have to enumerate a lot of files), or as I have suggested, you can give each sub-package the rules it needs to process its subset of files.

    The advantage of the Modulo Shredder approach is that no lists of files need to be sent to each sub-package, and once each sub-package starts executing, no further coordination is necessary between the executing sub-packages: There is no "reporting back with to the mothership" for the next file to process and no lengthy list to wait for before execution starts.

    Unfortunately I don't have an example, but you can test the two main points for yourself:

    • Test how quickly you can bulk insert data into a target heap table in parallel.
    • Test the Modulo Shredder approach

    Hope this helps

  • frederico_fonseca

    SSChampion

    Points: 14632

    richardmgreen1 wrote:

    I wouldn't call 160 massive and this works for us.

    Once all the files have loaded, we then archive the files (just in case) into a zip file and then delete the original files (7-zip doesn't seem to a zip-and-delete like other compression programs).

    Latest versions of 7-zip do have the option to move files onto the archive (this since version 9.30 alpha from 2012-10-26) - although most common version was 9.2 - since the latest stable prod versions (15.12 released on 2015-11-19) that the function is available

     

  • Jeff Moden

    SSC Guru

    Points: 996436

    @iposner ,

    I'm curious...  Did all 10,000 files have precisely the same format?  And what was that format?  CSV?  TSV?  Fixed Width Fields?  And can you make a guess as to how many fields there were and how many may have been used to populate variable width columns in the target table?

    Last but not least, what was included in the 40 minutes of load time?  Was it just the loading of the data contained in the source files into the original heap table?  Obviously, it didn't include 5 hours of index build time but I curious if the 40 minutes included anything other than populating the original heap staging table.  For example, was any form of de-duplication or other data validation included during the load time?  Where the FKs in place on the heap table?

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Forum FAQ

  • iposner

    Ten Centuries

    Points: 1161

    In that project, I had the luxury of liaising with the person creating the files to define a file format. It was a TSV format and the same format for all 10,000 files. There were probably no more than 10 columns in the file. The target table being a fact table, the target table consisted of just foreign keys and numerical measures, so completely fixed width. Foreign keys were defined but not enforced as this would have hindered the load time. The data was machine-generated so there was no need to validate the data (any errors in computation would prevent the load from starting), but there was a need to drive new dimension entries from the files.

    Again, all of this is irrelevant to the parallel processing: You have to realise that a single SSIS non-blocking dataflow is limited to between 50,000 to 100,000 rows per second depending on row width for most common sizes. Therefore to beat this speed, one has to go parallel. How you do it is up to you.

  • Jeff Moden

    SSC Guru

    Points: 996436

    Some of the details may be irrelevant to parallel processing but, in my mind anyway, parallel processing during the initial load is just one part of the equation.  As you said, it took 5 hours to index the 2.5Bn new rows.  There's a nice trick that can be done on the initial load of tables with the Clustered Index already in place that takes significantly less time to accomplish than doing a load to a heap and then adding the Clustered Index but it only works on the first load.  The details you've provided are good info the help think about other possibilities on the very interesting problem that you used "Modulo Shredding" to solve, which is also a very interesting subject.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Forum FAQ

  • iposner

    Ten Centuries

    Points: 1161

    Your suggestion about loading into a clustered index being faster than a heap and then re-indexing requires the data to be pre-sorted. So somewhere you have to sort the data. The problem doesn't go away, just gets shifted elsewhere.

    The downside to loading even sorted data into a table with a clustered index on it is that the presence of the index necessitates you to load a single file at a time into the table/partition (ensuring lock escalation is set to AUTO for partitioned tables).

    In order to combine the benefits of parallelization and clustered indexes in place would require you to run parallel dataflows of pre-sorted sequential data into different partitions/tables.

  • Jeff Moden

    SSC Guru

    Points: 996436

    You're correct in saying that loading a table with a Clustered Index in place does mean that you have to only use a single thread to be able to take advantage (and there are a couple of serious advantages to doing so).  For the situation you've described, it has no particular advantage.  But it does have serious advantages for larger files than what you are necessarily using.

    Also, while you are correct that there's still a sorting requirement for what I said, large file loads with a Clustered Index in place (if you do it correctly) will be much faster that loading the heap and then adding the Clustered Index.  The heap/add-clustered-index method will also require at least twice the disk space as the the size of the data where the (properly structured/executed) inserts into a table with the Clustered Index in place will not.

    With that, I'm also suggesting that while parallelization of loads can certainly cause table load speed to seriously increase, that's not the only thing to consider during massive loads.  Your comment about it taking 5 hours to build the necessary indexes on the data that you've loaded is proof that there are other things that must be simultaneously considered as a part of the throughput, as well.

     

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Forum FAQ

Viewing 10 posts - 16 through 25 (of 25 total)

You must be logged in to reply to this topic. Login to reply