SSIS 2008 - Scalability of Sort & Aggregation Transforms

  • I' trying to develop a SSIS package to populate a SQL Server 2008 data warehouse from serveral MySQL sources.

    The size of the data being read from MySQL is in excess of 100 million rows.

    Questions:

    1) Is the supplied SSIS Aggregation transform implemented as a MEMORY-ONLY function, so when running in 32-bit SSIS I could exceed the 2GB process limit?

    2) Is there an alternative that uses non-memory resources to perform aggregation operations?

    3) Would I be better off writing a asynchronous transforamtion script component to BCP the aggregation keys into a SqlServer temp table and have Sql do the aggregation work reading the resukt set and sending uit ti the script's output data flow?

    Any thoughts appreaciated.

  • My first question is why wouldn't you do your sorting and aggregation on the database engine side? I know in SQL it would be much faster.

    1. As far as your question about in-memory, I *believe* that it can and will use temp disk space.

    2. Not that I know of.

    3. You could do this but having the source do the work up front before SSIS even sees the records is really the best.

    CEWII

  • Thank you Elliot.

    The other things that I have to consider are:

    1) The load that I place on the MySQL system

    2) The aggregation is being used as a distinct operator within dimension analysis so there would be many of these going on in parallel

  • jm99 (8/31/2010)


    Thank you Elliot.

    The other things that I have to consider are:

    1) The load that I place on the MySQL system

    2) The aggregation is being used as a distinct operator within dimension analysis so there would be many of these going on in parallel

    My first thoughts are simple.

    Yes, load on your MySQL is a valid concern however the database engine is by far the place best suited for the work you need, you can do what you are talking about in the SSIS side *BUT* you are going to take quite a bit of a performance hit that is not needed. As far as aggregations in parallel I think both the sort and aggregation components are blocking components.

    You talked about 100M rows of data, how big are we talking in GB? It might just be easier to take a complete copy on the SQL side and periodically pickup deltas, 100M rows with an average 100 bytes/row would be about 10GB (minus indexes). That would be the cheapest operation to MySQL. But would depend a lot on the data.

    CEWII

  • The MySQL is part of a realtime data monitoring system so the load that we place on it for exctraction to a data warehouse cannot be great.

    However, part of the ETL process is to both translate the MySQL FK references to the DW references and perform some intelligence on filling holes in the data.

    My initial thought was to read from MySQL then multicast each record to separate dimension analysis flows. This worked fine until we scaled up and one dimension reached a ridiculous number of keys - BANG!

    I've taken you up on the "do on MySQL side" challenge and it appears that we might be able to get away with the main select and the distinct dimension querries running concurrently with the dimension querries completing before the main extraction.

    Thanks for making me challenge the assumptions!

  • You are very welcome. As I have said many times there are usually a lot of good ways to accomplish any task..

    CEWII

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply