In answer to your question, it is about 10 years since I did this load, but from memory the load was into a heap table that subsequently had indexes built upon it - that was what took the time (5 hours)! Then the table was switched into the main fact table. I may have performed this load in the bulk_logged mode to avoid logging too, but I can't remember. As you've rightly guessed, the table wasn't very wide, but I managed to load 2.5bn rows in just over 40 minutes resulting in just over 1m rows per second by executing the same package 12 times in parallel on a set of 10,000 files. All the sub-packages loaded fact table data and included foreign key lookups in their logic. The server was a regular HP DL-380 connected to a corporate SAN.
However none of the technical points above is really relevant to the technique itself: In any parallel execution scenario, you can either assign the files to be processed by a controlling process, (which, if you do on a file-by-file basis acts as a bottleneck across all executing sub-packages), or you can assign all the files to the sub-packages as lists upfront (which is better, but slows down if you have to enumerate a lot of files), or as I have suggested, you can give each sub-package the rules it needs to process its subset of files.
The advantage of the Modulo Shredder approach is that no lists of files need to be sent to each sub-package, and once each sub-package starts executing, no further coordination is necessary between the executing sub-packages: There is no "reporting back with to the mothership" for the next file to process and no lengthy list to wait for before execution starts.
Unfortunately I don't have an example, but you can test the two main points for yourself:
- Test how quickly you can bulk insert data into a target heap table in parallel.
- Test the Modulo Shredder approach
Hope this helps