large dataset, standard ingestion scheme (merge join, condisplit...), best practice?

  • I've got a table with 40MM records. I've got an incoming data set (~_stage) with about the same # of rows (abbreviated for testing) that will need to insert/update or delete on the ~_master table. I've done this dozens of times, and I've attacked the issue using the tried and true Merge => Condisplit setup shown in "not_sorting.png" attached.

    I've already slimmed the ~_master columns down to the bare essentials, and it refuses to sort. I've got 36G of Memory and it's using it all. I'm running in 64 bit mode and I've got multiple, non-system drives, set to be used for the TempSortBufferFolder.

    The second screenshot "attempt_num_2.png", shows an alternate attempt at the same deal. In this case, I'm issuing a custom SELECT ... ORDER BY on that same slimmed down column set, but it's still crawling.

    40MM is actually an abbreviated set. Production will be 200MM, and could double by the end of 2012.

    What's the "right" way to do this SSC? I'm at the keys all day and will respond quickly. THANKS!

  • Even though the slow down appeared to be at the merge join in those PNG's, I changed the OLE DB destinations to "Rowset with Fast Load", and got a 100:1 reduction in execution time. Went from processing 1,000 rows/sec to 100,000 rows/sec.

    YAHTZEE.

  • The "right way" in SSIS:

    * use fast-load in the destination, but I see you already figured that one out

    * do not use the SORT component. 200 million rows and it can double? Your server won't be able to handle such a load in memory. Sort in the query using ORDER BY. The DB engine will probably do this task quicker than SSIS.

    * You can do the lookup in TSQL using a LEFT OUTER JOIN. If your tables are indexed properly and you join on integer columns, this could work.

    * You can use the Lookup component if you only need to take integer columns. Say for example you only need 2 integer columns of your reference dataset. 2*4 bytes * 400 million rows is 3.2 Gigabyte. SSIS should be able to hold that in memory as you have 32 Gb available. Use a Lookup using Full Cache and your SSIS package doesn't need to sort the data.

    * do not use the OLE DB command. (but I don't see it in your dataflow, so I guess you're OK). Especially with such a huge dataset. Do not use it. Ever. 🙂

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Ah, I forgot:

    to do the inserts in the multimillion table, you're probably best off doing partition switching, instead of direct inserts in the table.

    http://technet.microsoft.com/en-us/library/ms191160.aspx

    (I hope you have Enterprise edition!)

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • fast-load... never saw it have such a dramatic effect.

    SORT in SSIS... yeah, learned that pretty quickly. I've ORDERED BY at the source to overcome this.

    Lookup/FullCache... didn't know sorting was non-compulsory, thanks for that.

    OLE DB command inside flows... Yeah, I can't tell you how many times I've "inherited" this crappy design approach.

    Thanks for the help Koen.

  • Greg J (12/20/2011)


    fast-load... never saw it have such a dramatic effect.

    SORT in SSIS... yeah, learned that pretty quickly. I've ORDERED BY at the source to overcome this.

    Lookup/FullCache... didn't know sorting was non-compulsory, thanks for that.

    OLE DB command inside flows... Yeah, I can't tell you how many times I've "inherited" this crappy design approach.

    Thanks for the help Koen.

    That OLEDB command stuff is not exactly helped by the fact that MS provided the damn thing in the first place! I have some sympathy with people who use it in good faith. But the sympathy does not last very long, mind you 🙂

    Good to hear that you've been able to speed things up so dramatically.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Greg J (12/20/2011)


    fast-load... never saw it have such a dramatic effect.

    It's because it does it inserts in bulk mode, while the regular mode does it inserts row by row (yup, exactly as worse as the OLE DB Command). And they mode the slowest mode the default :rolleyes:

    With your gigantic table, it might also be beneficial if you turn of indexes (delete or disable) before the inserts and rebuild them afterwards. (if you're not using partitioning)

    Also set the recovery model to simple, to minimalize logging overhead.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply