Incrementally loading a very large Dimension table - Performance problems

  • Scenario:  I have a very large type-1 dimension table (50+ million rows and 15+ GB with page compression) in a SQL Server 2016 SP1 64 bit Enterprise Edition database loaded from an SAP data source using Theobald Xtractor.  I am required to perform both INSERTs and UPDATEs on this dimension daily which I do by taking the highest timestamp in the SQL Server table and then obtaining all the records with a greater than or equal to timestamp from the source system.  After that I use a MERGE statement to identify which records need to be inserted and which to be updated.  All this is done in SSIS, the MERGE statement being run from an Execute SQL task.

    Restrictions:  The table must remain compressed and I am limited to SQL Server and SSIS components (there are no 3rd party components with the exception of Theobald Xtractor).

    Problem:  The MERGE statement runs for nearly five hours.  The table has minimal fragmentation (less than 1 %) and other than the clustered index from the primary key, which is used in the MERGE, there are no other indexes.  All other packages components (get the incremental data, find the highest timestamp, logging etc) complete within minutes but it's the MERGE that's killing the performance.

    My question is:  Is there a better design pattern to perform an incremental load on such a big dimension table without resorting to using MERGE?  In my experience MERGE isn't the greatest tool when it comes to big data loads.  The table is not partitioned because I have to use both INSERTs and UPDATEs.  But if partitioning can help on UPDATEs please let me know how.  I also recall a Kimball design pattern in which extra tables are used to perform an incremental load but I can't remember the details.

  • Yes, there's a few things you can try.

    I'd get rid of the Merge first, splitting inserts and updates into separate statements. It's clearly the Updates that are slowing things down here, so I would try to split the updates into multiple parallel updates based on some column you could use for "partitioning" (using table locks to avoid a deadlock).

    Physical partitioning will also help you here...the age old adage of breaking the monster into smaller pieces. If those physical partitions align with your update "partitions" it'll be even better.

    If you're going down the road of physical partitioning, another option would also be to swap partitions in and out while performing the updates.

  • TheComedian wrote:

    The table must remain compressed

     

    That is very likely your biggest problem.  If you're updating pages and the updates are "ExpAnsive" (the rows get bigger), then you run into rampant page splits.  Compression makes the problem much worse because the page will need to be decompressed, split, and then both will need to be recompressed.  It would be a problem even if there were no expansion of the rows.

     

    It says right in the documentation for compression that it's not well suited for heavy inserts/updates especially if they're occurring in the logical middle of the index.

     

    Also, even without compression, INSERTs into the middle of an index can also cause rampant page splits regardless of FILL FACTOR (with only one exception, INSERTs will try to fill each page to 100% and are compelled to insert into pages identified by logical order) and, of course, "ExpAnsive" updates will cause similar problems with page splits.

     

    According to some nice, simple experiments that Paul Randal did, "bad" page splits can cause 4.5 to 43 times more log file activity and it's all done in a system transaction that can and will cause lots of extended blocking.

     

    I also agree that MERGE sucks so bad that it has its own gravitational field and would split things up using INSERT and UPDATE as previously mentioned.  I'd also use a staging table for the new data and "pre-validate" and mark the rows in the staging table for whether they will be used as INSERTs or UPDATEs.

     

    As a bit of a side bar, you also have to be careful not to write an "illegal" update.  The target table of the UPDATE absolutely MUST be in the FROM clause if any joins are involved or you end up with a "Halloweening" situation that you cannot predict.  It can take a 2 second single CPU UPDATE and turn it into a multi-CPU consuming monster that can take hours instead of seconds to execute.  I've seen it and repaired it many times.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Another possibility along with the known problems of compression and page splits is that you may have a shedload of indexes that are all doing the same deadly dance.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Change the MERGE to be an UPDATE followed by an INSERT. This will probably give you a 40% performance improvement.

    Check if the merge is updating any indexes, as this could slow it down. Also, check if adding an index could help with the performance.

    I've found using compression really slows down some queries, so I'd certainly experiment with the compression removed to see what difference it makes.

  • Thanks for the replies everybody, here are my follow up questions.

    Martin Schoombee wrote:

    so I would try to split the updates into multiple parallel updates based on some column you could use for "partitioning" (using table locks to avoid a deadlock). Physical partitioning will also help you here...the age old adage of breaking the monster into smaller pieces. If those physical partitions align with your update "partitions" it'll be even better. If you're going down the road of physical partitioning, another option would also be to swap partitions in and out while performing the updates.

    How would you do or implement this?  Or is there a link with an example?  The way is implement partitioning for INSERTs is that I insert everything into an empty table (after setting up a partition function and scheme) with the same columns, indexes etc as the main partitioned table and then I simply switch the partition to point to the main partitioned table.  I don't see how this would work with UPDATEs unless you're thinking of deleting the old record and inserting the updated one or did I misunderstand this?

    Jeff Moden wrote:

    I also agree that MERGE sucks so bad that it has its own gravitational field and would split things up using INSERT and UPDATE as previously mentioned.  I'd also use a staging table for the new data and "pre-validate" and mark the rows in the staging table for whether they will be used as INSERTs or UPDATEs.

    So you would suggest the following:

    • INSERT all delta records from the source table into the staging table.
    • Mark the rows in the staging table as INSERT or UPDATE by scanning through the main table and seeing if the primary key matches.
    • INSERT all the rows to be inserted.
    • UPDATE all the rows to be updated.

    Isn't step 2 redundant here?  Can't you just determine which rows are to be inserted/updated by a JOIN during the insert/update?  Or am I missing something?

     

  • Martin Schoombee wrote:

    so I would try to split the updates into multiple parallel updates based on some column you could use for "partitioning" (using table locks to avoid a deadlock). Physical partitioning will also help you here...the age old adage of breaking the monster into smaller pieces. If those physical partitions align with your update "partitions" it'll be even better. If you're going down the road of physical partitioning, another option would also be to swap partitions in and out while performing the updates.

    Suppose you have an attribute "DivisionCode" with a limited number of values and fairly even distribution, you would be able to create separate update statements for each value (or subset of values) of DivisionCode. If this attribute was indexed or the table was partitioned by this attribute (and indexes partition aligned), your parallel updates would benefit. You would have to use the TABLOCK hint to avoid deadlocks, but I've seen dramatic improvements with update statements on large sets of data when using this method. In my case I had a parameterized stored procedure that did more than just the update, but it's the same theory and can definitely improve performance if your updates aren't causing page splits as Jeff mentioned.

    In my opinion it is definitely something worth trying. I'm not suggesting that you try to swap partitions in and out for the updates, but parallel updates using an attribute that's already part of the partition scheme could help improve performance.

  • TheComedian wrote:

    Can't you just determine which rows are to be inserted/updated by a JOIN during the insert/update?  Or am I missing something?  

    I think you want an UPDATE followed by an INSERT something like this:

    UPDATE tgt
    SET tgt.Col1 = src.Col1
    FROM myDimensionTable myDim
    INNER JOIN myStagingTable src
    ON src.Key = tgt.Key;


    INSERT INTO myDimensionTable
    (
    Col1,
    Col2,
    ...
    )
    SELECT Col1,
    Col2,
    ...
    FROM myStagingTable src
    WHERE NOT EXISTS(SELECT *
    FROM myDimensionTable tgt
    WHERE tgt.Key = Src.Key);

     

  • My 2 cents here , which are similar to above and some more.

    Approach 1 :  Follow an INSERT only model.

    Insert all incoming rows with a flag mentioning the latest ones are the current ones.

    Scan the table and if any of the "newly inserted " records existed before , mark them as inactive.

    The disadvantage is , During the insertions, there could be 2 versions of the same record ( 1 old and 1 newly inserted ) . This can be handled by another ACTIVE/VALID flag , but the reporting application has to know to use this flag.

    You may also have to have a separate step to purge the OLD/INACTIVE records from the table as a Maintenance job

     

    Approach 2 : Use a TEMP table '

    Similar to above discussion , do a lookup on the TARGET table and set a flag in the ETL to determine if the record is for INSERT/UPDATE. INSERT the records to the main table , but ROUTE the UPDATE records to a STAGE table.

    UPDATE the main table from this STAGE table using a batched, looped UPDATE Procedure.

    If you do not have a key readily available for comparison , a HASH key might help : Subject to the table.

     

    Approach 3 :  SPLIT the tables

    How frequent do you receive the updates ? Will you be able to split the table ? For example, the ORIGINAL TABLE and a NEW RECORD table ? You can then sync the NEW RECORD table to the ORIGINAL table as a weekly/EOD job. A view on these 2 tables can give the complete set , except you have to get the UPDATED records from the NEW RECORD table.

     

     

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply