SSIS Lookup transform - populating the cache is done with one CPU core

  • Hi all,

    I am currently building the ETLs for my Data Vault Business DWH. I have huge transactional data, i.e. each transaction is saved as a record in the specific hub. As we have several operational source systems, the hub is sourced from each operational system.

    I have implemented the hub load pattern in SSIS, i.e. I am using a lookup operator to check if the soured record is already available in my hub. I am using the full cache mode, as I have 768 GByte of RAM available at my machine (shared with SQL Server 2016, though) and the transaction hub contains about 750 million records) to avoid a index seek/scan of such large data as I would assume that a memory lookup performance much better. Monthly data feeds bring about 50 million new records, so I have to check all 750+ records if one of the new 50 million records should me put in the hub.

    Starting the package I notice that the loading of the cache takes eternity, i.e. looking at the activity monitor I can see that only 1 of my 48 CPUs are used to load the cache and the I/O is about 1 MByte/s.

    Is there a faster way to populate the cache?

    Is there a better load pattern in your mind?

    Thanks for your input :-).

     

    Bye,

    Sven

  • You could try to partition the import and load the cache in parallel, but I have never done that and don't know if that will really eliminate the problem.

    To be honest, I would just stage the data and then use some indexes on your staging table to help with this. I think SQL Server will deal with this a bit more efficiently than SSIS. You could still partition the import (into staging) part of the process to get those 750 million rows in as quick as possible.

  • I have done it first in pure SQL using a left join with where is null condition to keep only the new ones. For some reason, the query estimator was not able predict the expected rows and give me always 1. For that reason, though my target table is in heap and using INSERT INTOΒ  WITH (TABLOCK) all of my additions are logged on row level an not minimally as expected. Thus, I was looking for a BULK INSERT solution which SSIS delivers with FAST INSERT option.

    Going with your idea: my problem is that my hub table has a hash key (binary 16) and a char business key. I could partition on source system but that will give me only 4 partitions. I cannot partition on load date as my initial load has the 750 millions rows and all succeeding about 50 millions.

    What I need is a hash partitioning like in Oracle.

    Do you have any idea how to get a work around to that?

  • Just to make it clear: staging brings only 50 million records, the lookup is done against 750+ million records. So partitioning the stage data only makes the 50 millions part smaller.

  • From a partitioning point of view, just think of a way in which you can structure your source query to get a somewhat equal distribution. Dates work well for this, because you can write a query that returns data for a given year (or multiple years depending on how many partitions you'd like to have). We're talking about logical partitioning of your source queries here, not physical partitioning although that could also help with performance.

  • Ok gotcha, so it's really only the lookup for existence that's an issue.

    How about assigning surrogate keys in your destination table, and then building either filtered indexes or partitioning the table based on the surrogate key value?

    So much for your data vault implementation πŸ˜‰

  • Here is something I have in mind (just thinking)

    I could extend the current hub structure

    • Hash Key
    • Business Key
    • Load Date
    • Source Sytem

    with a new purely technical column running only from 1 to 48 (number of CPU cores). I just use a ROWNUMBER() function for hub loading and I will physically and logically partition my hub into equally 48 parts. So, I split 750 million records into about 15 million records each and I could use 48 sources to load the cache in parallel (using a cache connection manager).

    The downside of such approach is that hardware dictates the number of sources I have to define. On dev, it's only 4, on integration it's 24 and on production 48.

    This could be a working but not nice solution.

  • BIML could be your friend in this case, to generate all of these packages and save your sanity.

  • Side remark: I do not understand why Microsoft is not able to build source and target operators which are able to handle parallel threads. It is bad coding design to change the package workflow and content just on the design of your machine.

  • Never heard of BIML before. I googled it just now. Can you recommend any free tool to generate the dtsx packages? command line would work, too.

  • BIML is that free tool. Use BIML Express in Visual Studio.

  • Hi, thanks, I will look into that.

    Last question: is this really the idea how to solve such issues, i.e. use a metadata tool to duplicate objects otherwise to be to painful to do by hand? I also noticed that I cannot define a scripting function globally and call it several times leading to update all copies once I make changes to the code to keep it in sync. BIML would help me, too.

    Just curious πŸ™‚

  • To an extent, yes. Some tools, as good as they are, require a lot of monotonous and repetitive work and we're new seeing tools in some of those cases that help standardize and prevent errors.

    BIML has been a great addition for me, and PowerShell too. I've created a process whereby I capture the metadata in Excel, and use BIML to generate all my staging packages (and table scripts). Takes a bit of time to set up initially, but takes away the monotony of making changes to individual packages and ultimately saves a ton of time.

    You've just found a good reason to learn a new technology πŸ™‚

  • could you please explain in more detail the layout of your DV with regards to this transaction data, both hub, links and sats - having their DDL including indexes would help

    from my previous experience with this you should be able to do it better on SQL alone and your loads should be fast enough and processed in parallel - and you can for sure create partitions based on date - although how they are done depends on how your 50 million new records are split per dates

    using SSIS with lookups for this is really not a good option for these volumes - most likely it would be a lot faster to load onto a staging first, then with left outer joins extract new hub BK's to load to hub - this could be done on SSIS just to take advantage of the bulk load in smaller batches than the 50 mil in one go

  • I am currently at work to give you the exact style of DDL. I will come back on Monday if it is ok with you.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply