February 20, 2019 at 11:38 am
Hello hello!
Tried various index configurations, the below performed the best so far.
Query plan is attached.
Any ideas how to further improve this?
Much appreciated, this forum is awesome!
DN
These are the stats when I run the merge with dummy data:
Table 'tbl_Perm_SubjectXYZ'. Scan count 1, logical reads 49945878, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'tbl_Staging_Current_SubjectXYZ'. Scan count 1, logical reads 5027782, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.(3007357 rows affected)
The merge took 3 min on a DEV server and this is after the merge, with simulated 30% row and 5% key changes on source data:
Source table: 10795993 records; ~ 7GB
Target table: 18406151 records; ~ 12GB
MERGE [dbo].[tbl_Perm_SubjectXYZ] WITH(TABLOCK) AS T
USING
(
SELECT *
FROM [dbo].[tbl_Staging_Current_SubjectXYZ]
WHERE [DM_OK_TO_PROCESS] = 1 --flag gets set during some intial itegrity checks...
) AS S
ON(T.DM_KEY_HASH = S.DM_KEY_HASH)
-- source table is heap with unique non-clustered index on DM_KEY_HASH,
-- and includes DM_ROW_HASH and DM_OK_TO_PROCESS
-- target table has unique clustered index on DM_KEY_HASH
WHEN MATCHED AND(T.DM_ROW_HASH <> S.DM_ROW_HASH --row data changed in source
OR T.DM_RECORD_STATUS = 'D') --record re-appeared in source
THEN UPDATE SET
T.[DM_TENANT] = 'Tenant1'
, T.[FieldFromFile1] = S.[FieldFromFile1]
, T.[FieldFromFile2] = S.[FieldFromFile2]
, T.[FieldFromFile3] = S.[FieldFromFile3]
, T.[FieldFromFile4] = S.[FieldFromFile4]
, T.[FieldFromFile5] = S.[FieldFromFile5]
, T.[DM_KEY_HASH] = S.[DM_KEY_HASH]
, T.[DM_ROW_HASH] = S.[DM_ROW_HASH]
, T.[DM_STAGE_DATE] = S.[DM_STAGE_DATE]
, T.[DM_TXT_FILE_NAME] = S.[DM_TXT_FILE_NAME]
, T.[DM_TXT_FILE_DATE] = S.[DM_TXT_FILE_DATE]
, T.[DM_ZIP_FILE_NAME] = S.[DM_ZIP_FILE_NAME]
, T.[DM_ZIP_FILE_ARCHIVE] = S.[DM_ZIP_FILE_ARCHIVE]
, T.[DM_ZIP_FILE_DATE] = S.[DM_ZIP_FILE_DATE]
, T.[DM_RECORD_STATUS] = 'U'
, T.[DM_MERGE_DATE] = GETDATE()
, T.[DM_INSERT_DATE] = GETDATE()
, T.[DM_UPDATE_DATE] = GETDATE()
, T.[DM_DELETE_DATE] = NULL
, T.[DM_RUN_UID] = '121212'
, T.[DM_USER_ID] = ORIGINAL_LOGIN()
, T.[DM_SOFT_DELETE] = 'N'
WHEN NOT MATCHED BY TARGET
THEN --new record in source
INSERT([DM_TENANT]
, [FieldFromFile1]
, [FieldFromFile2]
, [FieldFromFile3]
, [FieldFromFile4]
, [FieldFromFile5]
, [DM_KEY_HASH]
, [DM_ROW_HASH]
, [DM_STAGE_DATE]
, [DM_TXT_FILE_NAME]
, [DM_TXT_FILE_DATE]
, [DM_ZIP_FILE_NAME]
, [DM_ZIP_FILE_ARCHIVE]
, [DM_ZIP_FILE_DATE]
, [DM_RECORD_STATUS]
, [DM_MERGE_DATE]
, [DM_INSERT_DATE]
, [DM_UPDATE_DATE]
, [DM_DELETE_DATE]
, [DM_RUN_UID]
, [DM_USER_ID]
, [DM_SOFT_DELETE])
VALUES
('Tenant1'
, S.FieldFromFile1
, S.FieldFromFile2
, S.FieldFromFile3
, S.FieldFromFile4
, S.FieldFromFile5
, S.DM_KEY_HASH
, S.DM_ROW_HASH
, S.DM_STAGE_DATE
, S.DM_TXT_FILE_NAME
, S.DM_TXT_FILE_DATE
, S.DM_ZIP_FILE_NAME
, S.DM_ZIP_FILE_ARCHIVE
, S.DM_ZIP_FILE_DATE
, 'I'
, GETDATE()
, GETDATE()
, GETDATE()
, NULL
, '121212'
, ORIGINAL_LOGIN()
, 'N'
)
WHEN NOT MATCHED BY SOURCE AND T.DM_RECORD_STATUS <> 'D'
--not in source anymore, and hasn't been flagged as delete yet in target
THEN UPDATE SET
T.DM_RECORD_STATUS = 'D'
, T.[DM_DELETE_DATE] = GETDATE()
-- , T.[DM_UPDATE_DATE] = GETDATE()
, T.[DM_MERGE_DATE] = GETDATE()
, T.DM_RUN_UID = '121212' OPTION(RECOMPILE);
February 20, 2019 at 12:13 pm
Plan not attached. -- saw it now.
The main performance improvement on this type of process is using the hash compare as you do.
Following are possibilities - neither is wrong or right. as always .. it depends...
Possibly changing the heap to a cluster will likely improve - cluster in DM_OK_TO_PROCESS, DM_KEY_HASH
sometimes, not always, doing the insert/update/delete separately is faster than merge - you would need to try it out.
Also, and because you are talking about wide rows, determining which rows to delete/update/insert into a temp table is sometimes faster (one can use just pk and hash index processing before going to data)
February 20, 2019 at 12:35 pm
frederico_fonseca - Wednesday, February 20, 2019 12:13 PMPlan not attached. -- saw it now.The main performance improvement on this type of process is using the hash compare as you do.
Following are possibilities - neither is wrong or right. as always .. it depends...
Possibly changing the heap to a cluster will likely improve - cluster in DM_OK_TO_PROCESS, DM_KEY_HASHsometimes, not always, doing the insert/update/delete separately is faster than merge - you would need to try it out.
Also, and because you are talking about wide rows, determining which rows to delete/update/insert into a temp table is sometimes faster (one can use just pk and hash index processing before going to data)
Thanks Frederico!
I tried to convert the heap first (originates from super fast bulk insert) and then do the merge, but that lead to more duration and reads overall.
Lots of stale data.
BTW, when I run this w/o changes in the source data, it finishes in 30sec with:
Table 'tbl_Perm_SubjectXYZ'. Scan count 1, logical reads 1517502, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'tbl_Staging_Current_SubjectXYZ'. Scan count 1, logical reads 5027782, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
I used to have an extra "change tracking table" for other projects, will check that out.
February 20, 2019 at 12:39 pm
Rather than hash key matching, consider implementing Change Tracking on the source table, and then leverage that to perform a differential merge.
https://youtu.be/ZNUN9qejCfc
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
February 20, 2019 at 12:42 pm
You should cluster the staging table on the DM_KEY_HASH also, assuming you can compute the hash key as the row is being INSERTed into the staging table. SQL is having to do 5M lookups to avoid sorting the staging table, and then still sort the data later, so you'd almost certainly be better off clustering the staging table as well. But naturally check the new query plan and total I/O to be sure.
As noted by frederico, sometimes MERGE is not the fastest/best way to do things, but 3M for this much data seems very good anyway, so you probably don't need to worry about that.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
February 20, 2019 at 12:43 pm
Eric M Russell - Wednesday, February 20, 2019 12:39 PMRather than hash key matching, consider implementing Change Tracking on the source table, and then leverage that to perform a differential merge.
https://youtu.be/ZNUN9qejCfc
Thanks Eric!
The source table gets truncated and loaded from file each time.
We have no control over the data sources/ generation of the files.
Would that still work?
February 20, 2019 at 12:52 pm
ScottPletcher - Wednesday, February 20, 2019 12:42 PMYou should cluster the staging table on the DM_KEY_HASH also, assuming you can compute the hash key as the row is being INSERTed into the staging table. SQL is having to do 5M lookups to avoid sorting the staging table, and then still sort the data later, so you'd almost certainly be better off clustering the staging table as well. But naturally check the new query plan and total I/O to be sure.As noted by frederico, sometimes MERGE is not the fastest/best way to do things, but 3M for this much data seems very good anyway, so you probably don't need to worry about that.
Thanks Scott.
I thought about adding the hashing to the SSIS package which is staging the files into the current staging table.
It will be still unsorted and I need to do some de-duping based on the hashes, flag them and report them back to data source.
The synthetic key will comprise of a few fields in the source data.
I'll test this a little more... might be a good way to shave off time.
February 20, 2019 at 1:39 pm
My assumption, by reading the sql posted, is that the hash is done as part of the load of tbl_Staging_Current_SubjectXYZ.
so one of the approaches I was suggesting is
load HEAP tbl_Staging_Current_SubjectXYZ ideally hash to be calculated on load by SSIS
perform validation
create clustered index xxx on tbl_Staging_Current_SubjectXYZ
(DM_OK_TO_PROCESS
, DM_KEY_HASH
)
with (data_compression = page) -- optional - may improve or slow down so requires testing with full prod data
merge ...
February 20, 2019 at 3:00 pm
nslupi - Wednesday, February 20, 2019 12:43 PMEric M Russell - Wednesday, February 20, 2019 12:39 PMRather than hash key matching, consider implementing Change Tracking on the source table, and then leverage that to perform a differential merge.
https://youtu.be/ZNUN9qejCfcThanks Eric!
The source table gets truncated and loaded from file each time.
We have no control over the data sources/ generation of the files.
Would that still work?
Earlier you said:
... The source will have approx 10 million rows on each daily run: 5% fresh data (inserts), 30% changed data (updates & deletes), and 65% stale data (no action)...
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
February 20, 2019 at 3:24 pm
Eric M Russell - Wednesday, February 20, 2019 3:00 PMnslupi - Wednesday, February 20, 2019 12:43 PMEric M Russell - Wednesday, February 20, 2019 12:39 PMRather than hash key matching, consider implementing Change Tracking on the source table, and then leverage that to perform a differential merge.
https://youtu.be/ZNUN9qejCfcThanks Eric!
The source table gets truncated and loaded from file each time.
We have no control over the data sources/ generation of the files.
Would that still work?Earlier you said:
... The source will have approx 10 million rows on each daily run: 5% fresh data (inserts), 30% changed data (updates & deletes), and 65% stale data (no action)...
Yeah - comes in on files... sorry, should have described this better.
Basically a 5 year data window (full files each time...) from the source. And no, they cannot switch to incrementals, etc. ... I
February 21, 2019 at 6:40 am
It might be my lack of understanding about the MERGE statement, I've always found performance better by replacing a MERGE with UPDATES followed by INSERTS. But at the top of the statement you have:USING
(
SELECT *
FROM [dbo].[tbl_Staging_Current_SubjectXYZ]
WHERE [DM_OK_TO_PROCESS] = 1 --flag gets set during some intial itegrity checks...
) AS S
ON (T.DM_KEY_HASH = S.DM_KEY_HASH)
Then you have:WHEN MATCHED AND(T.DM_ROW_HASH <> S.DM_ROW_HASH --row data changed in source
How can they be matched on DM_ROW_HASH and also have different DM_ROW_HASH?
February 21, 2019 at 7:35 am
nslupi - Wednesday, February 20, 2019 3:24 PMEric M Russell - Wednesday, February 20, 2019 3:00 PMnslupi - Wednesday, February 20, 2019 12:43 PMEric M Russell - Wednesday, February 20, 2019 12:39 PMRather than hash key matching, consider implementing Change Tracking on the source table, and then leverage that to perform a differential merge.
https://youtu.be/ZNUN9qejCfcThanks Eric!
The source table gets truncated and loaded from file each time.
We have no control over the data sources/ generation of the files.
Would that still work?Earlier you said:
... The source will have approx 10 million rows on each daily run: 5% fresh data (inserts), 30% changed data (updates & deletes), and 65% stale data (no action)...
Yeah - comes in on files... sorry, should have described this better.
Basically a 5 year data window (full files each time...) from the source. And no, they cannot switch to incrementals, etc. ... I
If the source table must be truncated daily, then Change Tracking won't help.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
February 21, 2019 at 9:43 am
Jonathan AC Roberts - Thursday, February 21, 2019 6:40 AMIt might be my lack of understanding about the MERGE statement, I've always found performance better by replacing a MERGE with UPDATES followed by INSERTS. But at the top of the statement you have:USING
(
SELECT *
FROM [dbo].[tbl_Staging_Current_SubjectXYZ]
WHERE [DM_OK_TO_PROCESS] = 1 --flag gets set during some intial itegrity checks...
) AS S
ON (T.DM_KEY_HASH = S.DM_KEY_HASH)
Then you have:WHEN MATCHED AND(T.DM_ROW_HASH <> S.DM_ROW_HASH --row data changed in source
How can they be matched on DM_ROW_HASH and also have different DM_ROW_HASH?
The records are matched on the key hash (syn key from natural keys in data).
When the row hash is different, it is a update... (key sill matches, but data different in that record... )
February 21, 2019 at 9:45 am
nslupi - Thursday, February 21, 2019 9:43 AMJonathan AC Roberts - Thursday, February 21, 2019 6:40 AMIt might be my lack of understanding about the MERGE statement, I've always found performance better by replacing a MERGE with UPDATES followed by INSERTS. But at the top of the statement you have:USING
(
SELECT *
FROM [dbo].[tbl_Staging_Current_SubjectXYZ]
WHERE [DM_OK_TO_PROCESS] = 1 --flag gets set during some intial itegrity checks...
) AS S
ON (T.DM_KEY_HASH = S.DM_KEY_HASH)
Then you have:WHEN MATCHED AND(T.DM_ROW_HASH <> S.DM_ROW_HASH --row data changed in source
How can they be matched on DM_ROW_HASH and also have different DM_ROW_HASH?
The records are matched on the key hash (syn key from natural keys in data).
When the row hash is different, it is a update... (key sill matches, but data different in that record... )
Ah yes, thanks, I missed the KEY ROW difference in the names.
Viewing 14 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply
This website stores cookies on your computer.
These cookies are used to improve your website experience and provide more personalized services to you, both on this website and through other media.
To find out more about the cookies we use, see our Privacy Policy