Do It Yourself Load Balancing with SSIS

  • Comments posted to this topic are about the item Do It Yourself Load Balancing with SSIS

  • If the goal is to divide the dataset for another SSIS package why would you put it back in the database so the other package has to read it again vs using a raw data type destination?

    That way you are not adding load and Io to the database again to read the same rows, plus with raw data type as the source it would read faster since the metadata is already validated and each package could be on a server with a local source file.

    If there is a reason why you should not use raw data type I would like to know.

    Thank you,

    Paul

  • This example is nice, but it still has one performance bottleneck - the sourcefile is still running on a single thread, so the dataflow can only run at the speed of the datasource - less than 100,000 rows/sec.

    I designed and implemented a set of packages for ETL, processing in parallel which I call the Modulo Shredder. You can read it about it here[/url]. It ran inserting rows into a SQL Heap table at over 1,000,000 rows/second.

    This it did by reading 1/Nth of the files on each thread, using a modulo algorithm to skip files of no interest to the current thread.

  • Is there a particular reason why you did not just use a random number generator to select ProcessNumber? For example:

    select case when rand() < 0.5 then 0 else 1 end as ProcessNumber

  • Hi Badatthis,

    This is a direct load into a database, not a data staging approach - it just loads faster due to parallel loading.

    All the best,

    Adam

  • Hi iposner,

    You are absolutely right, the source is a bottleneck.

    Watch out for an upcoming article where I show some solutions to handle this for both database and file data sources.

    Regards,

    Adam

  • Hi John,

    I am used to using modulo as it is nicely deterministic and easy to extend to several parallel processing IDs.

    But anything is good if it works!

    Adam

  • Hi,

    you can use also the Balanced Data Distributor, see https://www.microsoft.com/en-us/download/details.aspx?id=4123

    regards,

    Alex

  • Hi Alexander,

    Absolutely right - in fact I described this (really cool) tool in the previous article referred to at the start of this one.

    All the best,

    Adam

  • If the goal is to Load balance then BDD is better option. This is Microsoft component and easily available and it will divide into as many parallel flow as you want.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply