Merge from Heap to Clustered Index Table.... Improvements?

  • Hello hello!

    • The source will have approx 10 million rows on each daily run: 5% fresh data (inserts), 30% changed data (updates & deletes), and 65% stale data (no action).
    • Target will have 30 to 60 million, depending on the configured retention time for that table. (old records fall off from source and target eventually) 
    • Wide rows, lots of fields... the code below is just a simulation. 
    • There is some data integrity stuff happening on the heap, hence the [DM_OK_TO_PROCESS] flag.

    Tried various index configurations, the below performed the best so far.

    Query plan is attached. 

    Any ideas how to further improve this?

    Much appreciated, this forum is awesome!
    DN

    These are the stats when I run the merge with dummy data:

    Table 'tbl_Perm_SubjectXYZ'. Scan count 1, logical reads 49945878, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'tbl_Staging_Current_SubjectXYZ'. Scan count 1, logical reads 5027782, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (3007357 rows affected)


    The merge took 3 min on a DEV server and this is after the merge, with simulated 30% row and 5% key changes on source data:
    Source table: 10795993 records; ~ 7GB
    Target table:   18406151 records; ~ 12GB

    MERGE [dbo].[tbl_Perm_SubjectXYZ] WITH(TABLOCK) AS T
    USING
    (
    SELECT *
    FROM [dbo].[tbl_Staging_Current_SubjectXYZ]
    WHERE [DM_OK_TO_PROCESS] = 1 --flag gets set during some intial itegrity checks...
    ) AS S
    ON(T.DM_KEY_HASH = S.DM_KEY_HASH)
    -- source table is heap with unique non-clustered index on DM_KEY_HASH,
    --      and includes DM_ROW_HASH and DM_OK_TO_PROCESS
    -- target table has unique clustered index on DM_KEY_HASH
      WHEN MATCHED AND(T.DM_ROW_HASH <> S.DM_ROW_HASH --row data changed in source
           OR T.DM_RECORD_STATUS = 'D') --record re-appeared in source
      THEN UPDATE SET
           T.[DM_TENANT] = 'Tenant1'
          , T.[FieldFromFile1] = S.[FieldFromFile1]
          , T.[FieldFromFile2] = S.[FieldFromFile2]
          , T.[FieldFromFile3] = S.[FieldFromFile3]
          , T.[FieldFromFile4] = S.[FieldFromFile4]
          , T.[FieldFromFile5] = S.[FieldFromFile5]
          , T.[DM_KEY_HASH] = S.[DM_KEY_HASH]
          , T.[DM_ROW_HASH] = S.[DM_ROW_HASH]
          , T.[DM_STAGE_DATE] = S.[DM_STAGE_DATE]
          , T.[DM_TXT_FILE_NAME] = S.[DM_TXT_FILE_NAME]
          , T.[DM_TXT_FILE_DATE] = S.[DM_TXT_FILE_DATE]
          , T.[DM_ZIP_FILE_NAME] = S.[DM_ZIP_FILE_NAME]
          , T.[DM_ZIP_FILE_ARCHIVE] = S.[DM_ZIP_FILE_ARCHIVE]
          , T.[DM_ZIP_FILE_DATE] = S.[DM_ZIP_FILE_DATE]
          , T.[DM_RECORD_STATUS] = 'U'
          , T.[DM_MERGE_DATE] = GETDATE()
          , T.[DM_INSERT_DATE] = GETDATE()
          , T.[DM_UPDATE_DATE] = GETDATE()
          , T.[DM_DELETE_DATE] = NULL
          , T.[DM_RUN_UID] = '121212'
          , T.[DM_USER_ID] = ORIGINAL_LOGIN()
          , T.[DM_SOFT_DELETE] = 'N'
      WHEN NOT MATCHED BY TARGET
      THEN --new record in source
      INSERT([DM_TENANT]
        , [FieldFromFile1]
        , [FieldFromFile2]
        , [FieldFromFile3]
        , [FieldFromFile4]
        , [FieldFromFile5]
        , [DM_KEY_HASH]
        , [DM_ROW_HASH]
        , [DM_STAGE_DATE]
        , [DM_TXT_FILE_NAME]
        , [DM_TXT_FILE_DATE]
        , [DM_ZIP_FILE_NAME]
        , [DM_ZIP_FILE_ARCHIVE]
        , [DM_ZIP_FILE_DATE]
        , [DM_RECORD_STATUS]
        , [DM_MERGE_DATE]
        , [DM_INSERT_DATE]
        , [DM_UPDATE_DATE]
        , [DM_DELETE_DATE]
        , [DM_RUN_UID]
        , [DM_USER_ID]
        , [DM_SOFT_DELETE])
      VALUES
    ('Tenant1'
    , S.FieldFromFile1
    , S.FieldFromFile2
    , S.FieldFromFile3
    , S.FieldFromFile4
    , S.FieldFromFile5
    , S.DM_KEY_HASH
    , S.DM_ROW_HASH
    , S.DM_STAGE_DATE
    , S.DM_TXT_FILE_NAME
    , S.DM_TXT_FILE_DATE
    , S.DM_ZIP_FILE_NAME
    , S.DM_ZIP_FILE_ARCHIVE
    , S.DM_ZIP_FILE_DATE
    , 'I'
    , GETDATE()
    , GETDATE()
    , GETDATE()
    , NULL
    , '121212'
    , ORIGINAL_LOGIN()
    , 'N'
    )
      WHEN NOT MATCHED BY SOURCE AND T.DM_RECORD_STATUS <> 'D'
    --not in source anymore, and hasn't been flagged as delete yet in target
      THEN UPDATE SET
           T.DM_RECORD_STATUS = 'D'
          , T.[DM_DELETE_DATE] = GETDATE()
           -- , T.[DM_UPDATE_DATE] = GETDATE()
          , T.[DM_MERGE_DATE] = GETDATE()
          , T.DM_RUN_UID = '121212' OPTION(RECOMPILE);

  • Plan not attached. -- saw it now.

    The main performance improvement on this type of process is using the hash compare as you do.

    Following are possibilities - neither is wrong or right. as always .. it depends...
    Possibly changing the heap to a cluster will likely improve - cluster in DM_OK_TO_PROCESS, DM_KEY_HASH 

    sometimes, not always, doing the insert/update/delete separately is faster than merge - you would need to try it out.
    Also, and because you are talking about wide rows, determining which rows to delete/update/insert into a temp table is sometimes faster (one can use just pk and hash index processing before going to data)

  • frederico_fonseca - Wednesday, February 20, 2019 12:13 PM

    Plan not attached. -- saw it now.

    The main performance improvement on this type of process is using the hash compare as you do.

    Following are possibilities - neither is wrong or right. as always .. it depends...
    Possibly changing the heap to a cluster will likely improve - cluster in DM_OK_TO_PROCESS, DM_KEY_HASH 

    sometimes, not always, doing the insert/update/delete separately is faster than merge - you would need to try it out.
    Also, and because you are talking about wide rows, determining which rows to delete/update/insert into a temp table is sometimes faster (one can use just pk and hash index processing before going to data)

    Thanks Frederico!

    I tried to convert the heap first (originates from super fast bulk insert) and then do the merge, but that lead to more duration and reads overall. 
    Lots of stale data.
    BTW, when I run this w/o changes in the source data, it finishes in 30sec with:
    Table 'tbl_Perm_SubjectXYZ'. Scan count 1, logical reads 1517502, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'tbl_Staging_Current_SubjectXYZ'. Scan count 1, logical reads 5027782, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Also tried to add a filter to the NC on the heap for DM_OK_TO_PROCESS = 1, but that added another sort to the plan and more reads overall.
    And still included the DM_OK_TO_PROCESS as well, so stats will be created. 
    Surprised me a little... I thought that would eliminate stuff if anything. 

    I used to have an extra "change tracking table" for other projects, will check that out.

  • Rather than hash key matching, consider implementing Change Tracking on the source table, and then leverage that to perform a differential merge.
    https://youtu.be/ZNUN9qejCfc

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • You should cluster the staging table on the DM_KEY_HASH also, assuming you can compute the hash key as the row is being INSERTed into the staging table.  SQL is having to do 5M lookups to avoid sorting the staging table, and then still sort the data later, so you'd almost certainly be better off clustering the staging table as well.  But naturally check the new query plan and total I/O to be sure.

    As noted by frederico, sometimes MERGE is not the fastest/best way to do things, but 3M for this much data seems very good anyway, so you probably don't need to worry about that.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Eric M Russell - Wednesday, February 20, 2019 12:39 PM

    Rather than hash key matching, consider implementing Change Tracking on the source table, and then leverage that to perform a differential merge.
    https://youtu.be/ZNUN9qejCfc

    Thanks Eric! 
    The source table gets truncated and loaded from file each time.
    We have no control over the data sources/ generation of the files. 
    Would that still work?

  • ScottPletcher - Wednesday, February 20, 2019 12:42 PM

    You should cluster the staging table on the DM_KEY_HASH also, assuming you can compute the hash key as the row is being INSERTed into the staging table.  SQL is having to do 5M lookups to avoid sorting the staging table, and then still sort the data later, so you'd almost certainly be better off clustering the staging table as well.  But naturally check the new query plan and total I/O to be sure.

    As noted by frederico, sometimes MERGE is not the fastest/best way to do things, but 3M for this much data seems very good anyway, so you probably don't need to worry about that.

    Thanks Scott.
    I thought about adding the hashing to the SSIS package which is staging the files into the current staging table. 
    It will be still unsorted and I need to do some de-duping based on the hashes, flag them and report them back to data source. 
    The synthetic key will comprise of a few fields in the source data.
    I'll test this a little more... might be a good way to shave off time.

  • My assumption, by reading the sql posted, is that the hash is done as part of the load of tbl_Staging_Current_SubjectXYZ.

    so one of the approaches I was suggesting is

    load HEAP tbl_Staging_Current_SubjectXYZ ideally hash to be calculated on load by SSIS

    perform validation

    create clustered index xxx on tbl_Staging_Current_SubjectXYZ
    (DM_OK_TO_PROCESS
    , DM_KEY_HASH 

    with (data_compression = page) -- optional - may improve or slow down so requires testing with full prod data

    merge ...

  • nslupi - Wednesday, February 20, 2019 12:43 PM

    Eric M Russell - Wednesday, February 20, 2019 12:39 PM

    Rather than hash key matching, consider implementing Change Tracking on the source table, and then leverage that to perform a differential merge.
    https://youtu.be/ZNUN9qejCfc

    Thanks Eric! 
    The source table gets truncated and loaded from file each time.
    We have no control over the data sources/ generation of the files. 
    Would that still work?

    Earlier you said:

    ... The source will have approx 10 million rows on each daily run: 5% fresh data (inserts), 30% changed data (updates & deletes), and 65% stale data (no action)...

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell - Wednesday, February 20, 2019 3:00 PM

    nslupi - Wednesday, February 20, 2019 12:43 PM

    Eric M Russell - Wednesday, February 20, 2019 12:39 PM

    Rather than hash key matching, consider implementing Change Tracking on the source table, and then leverage that to perform a differential merge.
    https://youtu.be/ZNUN9qejCfc

    Thanks Eric! 
    The source table gets truncated and loaded from file each time.
    We have no control over the data sources/ generation of the files. 
    Would that still work?

    Earlier you said:

    ... The source will have approx 10 million rows on each daily run: 5% fresh data (inserts), 30% changed data (updates & deletes), and 65% stale data (no action)...

    Yeah - comes in on files... sorry, should have described this better. 
    Basically a 5 year data window (full files each time...) from the source. And no, they cannot switch to incrementals, etc. ... I

  • It might be my lack of understanding about the MERGE statement, I've always found performance better by replacing a MERGE with UPDATES followed by INSERTS. But at the top of the statement you have:
    USING
    (
        SELECT *
          FROM [dbo].[tbl_Staging_Current_SubjectXYZ]
        WHERE [DM_OK_TO_PROCESS] = 1 --flag gets set during some intial itegrity checks...
    ) AS S
    ON (T.DM_KEY_HASH = S.DM_KEY_HASH)

    Then you have:
    WHEN MATCHED AND(T.DM_ROW_HASH <> S.DM_ROW_HASH --row data changed in source

    How can they be matched on DM_ROW_HASH and also have different DM_ROW_HASH?

  • nslupi - Wednesday, February 20, 2019 3:24 PM

    Eric M Russell - Wednesday, February 20, 2019 3:00 PM

    nslupi - Wednesday, February 20, 2019 12:43 PM

    Eric M Russell - Wednesday, February 20, 2019 12:39 PM

    Rather than hash key matching, consider implementing Change Tracking on the source table, and then leverage that to perform a differential merge.
    https://youtu.be/ZNUN9qejCfc

    Thanks Eric! 
    The source table gets truncated and loaded from file each time.
    We have no control over the data sources/ generation of the files. 
    Would that still work?

    Earlier you said:

    ... The source will have approx 10 million rows on each daily run: 5% fresh data (inserts), 30% changed data (updates & deletes), and 65% stale data (no action)...

    Yeah - comes in on files... sorry, should have described this better. 
    Basically a 5 year data window (full files each time...) from the source. And no, they cannot switch to incrementals, etc. ... I

    If the source table must be truncated daily, then Change Tracking won't help.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Jonathan AC Roberts - Thursday, February 21, 2019 6:40 AM

    It might be my lack of understanding about the MERGE statement, I've always found performance better by replacing a MERGE with UPDATES followed by INSERTS. But at the top of the statement you have:
    USING
    (
        SELECT *
          FROM [dbo].[tbl_Staging_Current_SubjectXYZ]
        WHERE [DM_OK_TO_PROCESS] = 1 --flag gets set during some intial itegrity checks...
    ) AS S
    ON (T.DM_KEY_HASH = S.DM_KEY_HASH)

    Then you have:
    WHEN MATCHED AND(T.DM_ROW_HASH <> S.DM_ROW_HASH --row data changed in source

    How can they be matched on DM_ROW_HASH and also have different DM_ROW_HASH?

    The records are matched on the key hash (syn key from natural keys in data).
    When the row hash is different, it is a update... (key sill matches, but data different in that record... )

  • nslupi - Thursday, February 21, 2019 9:43 AM

    Jonathan AC Roberts - Thursday, February 21, 2019 6:40 AM

    It might be my lack of understanding about the MERGE statement, I've always found performance better by replacing a MERGE with UPDATES followed by INSERTS. But at the top of the statement you have:
    USING
    (
        SELECT *
          FROM [dbo].[tbl_Staging_Current_SubjectXYZ]
        WHERE [DM_OK_TO_PROCESS] = 1 --flag gets set during some intial itegrity checks...
    ) AS S
    ON (T.DM_KEY_HASH = S.DM_KEY_HASH)

    Then you have:
    WHEN MATCHED AND(T.DM_ROW_HASH <> S.DM_ROW_HASH --row data changed in source

    How can they be matched on DM_ROW_HASH and also have different DM_ROW_HASH?

    The records are matched on the key hash (syn key from natural keys in data).
    When the row hash is different, it is a update... (key sill matches, but data different in that record... )

    Ah yes, thanks, I missed the KEY ROW difference in the names.

Viewing 14 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply