Merge vs other options

Question

Post reply

Merge vs other options

astrid 69000

Hall of Fame

Points: 3208

November 13, 2019 at 7:49 pm

#3697465

Hi,

I need to load a table that has 5,000,000 records, I usually use a merge to populate the tables. But big tables can take quite a bit.

I was considering doing an update insert with two different procedures and use a ROWLOCK, but I am not even sure it is a good idea, and on top I am scared that the table will be scanned twice to be updated and inserted if I use 2 different procedure.

I am open for opinions and ideas, here is a sample of the merge we use.

thanks

CREATE PROCEDURE 
[Raw].[LoadLinkOpportunity_sp]
@LoadDateDATETIME
WITH RECOMPILE

AS
-- Declare variables
DECLARE
@_LoadDateDATETIME= @LoadDate

BEGIN
SET NOCOUNT ON;    

DECLARE   
  @SummaryOfChanges TABLE(Change VARCHAR(20));  
MERGE  
  DVDB1.Raw.LinkOpportunity             AS Target  
USING   
(  
SELECT
CONVERT(CHAR(40), Replicate (0,40)) AS LinkOpportunityHashKey
,CONVERT(DATETIME,'1753-01-01 00:00:00.00') AS LoadDate
,@_LoadDate AS LastSeenDate
,'SYSTEM' AS RecordSource
,CONVERT(CHAR(40), Replicate (0,40)) AS OpportunityHashKey
,CONVERT(CHAR(40), Replicate (0,40)) AS SalesforceUserHashKey
,CONVERT(CHAR(40), Replicate (0,40)) AS CRMAccountHashKey
,CONVERT(CHAR(40), Replicate (0,40)) AS CampaignHashKey


UNION
SELECT  
DISTINCT  
LinkOpportunityHashKey
,LoadDate  
,LoadDate AS LastSeenDate 
,RecordSource  
,OpportunityHashKey
,   SalesforceUserHashKey
,   CRMAccountHashKey
,   CampaignHashKey

FROM  
StageDB.LAN.Opportunity
WHERE  
LoadDate      =  @_LoadDate  
)       AS Source   
(  
LinkOpportunityHashKey
,LoadDate  
,LastSeenDate 
,RecordSource  
,OpportunityHashKey
,   SalesforceUserHashKey
,   CRMAccountHashKey
,   CampaignHashKey

 
)  
ON  
(  
Target.RecordSource= Source.RecordSourceAND  
Target.OpportunityHashKey= Source.OpportunityHashKey AND
Target.SalesforceUserHashKey= Source.SalesforceUserHashKeyAND  
Target.CRMAccountHashKey= Source.CRMAccountHashKeyAND  
Target.CampaignHashKey= Source.CampaignHashKey

)  
WHEN MATCHED THEN  
UPDATE 
SET LastSeenDate= Source.LastSeenDate  
WHEN NOT MATCHED THEN  
INSERT  (LinkOpportunityHashKey, LoadDate , LastSeenDate, RecordSource, OpportunityHashKey, SalesforceUserHashKey, CRMAccountHashKey, CampaignHashKey)
VALUES (Source.LinkOpportunityHashKey, Source.LoadDate , Source.LastSeenDate, Source.RecordSource, Source.OpportunityHashKey, Source.SalesforceUserHashKey, Source.CRMAccountHashKey, Source.CampaignHashKey)

OUTPUT $action INTO @SummaryOfChanges;  
  
INSERT INTO 
MetricsMart.etl.UpdateRawLog
(
SourceTable
,TargetTable
,Change
,CountPerChange
,DateCreated
)      
SELECT 'StageDB.LAN.Opportunity', 'DVDB1.Raw.LinkOpportunity',Change,COUNT(*), GETDATE()       
FROM @SummaryOfChanges      
GROUP BY Change;      
  
EXEC DVDB1.dbo.TrackAndLog_sp @ObjectID = @@PROCID; 
END
GO

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Steve Collins SSCrazy Points: 2069 More actions · Answer 1

Merge statements don't have WHERE clauses which is why the target is typically defined in a CTE. In this case there's no CTE so you're merging against the entire DVDB1.Raw.LinkOpportunity table. Yada yada yada... to just cut through an explaination... it seems to me this could all be boiled down to 2 things, 1) a table with all LoadDates, and 2) An INSERT WHERE NOT EXISTS statement. Where your code uses "OUTPUT $action INTO @SummaryOfChanges; " it would be equivalent to @@rowcount from the insert. From @@rowcount and dm views the log tables could be populated with updated UPDATE counts. Instead of updating the LastSeenDate column with the LoadDate you could just do nothing. As long as the history of prior LoadDates are records in a table then the number of times a particular record has been seen (by this process) would be queryable.

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

astrid 69000 Hall of Fame Points: 3208 More actions · Answer 2

Thanks.

I am not able to change the logic of the query as here they follow the datavault philosophy to the teeth and I have no way around it.

My question was, how bad will it be to do a ROWLOCK instead of a merge on a large table.

I will set a test to run a few day and check timing, but I wanted to know what other people might think.

Thanks

Astrid

Jonathan AC Roberts SSCoach Points: 18263 More actions · Answer 3

Performance Tip: The conditional behavior described for the MERGE statement works best when the two tables have a complex mixture of matching characteristics. For example, inserting a row if it doesn't exist, or updating a row if it matches. When simply updating one table based on the rows of another table, improve the performance and scalability with basic INSERT, UPDATE, and DELETE statements.

https://docs.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 4

Row locks run slower than page locks, which are the default locks used by SQL Server. (https://littlekendra.com/2016/02/04/why-rowlock-hints-can-make-queries-slower-and-blocking-worse-in-sql-server/) I'm not sure what you think Rowlock will buy you.

Merges run best when the source table and the target table have the same clustered indexes used for the MATCHING ON clause. You could benefit from doing your union into a worktable with clustered index characteristics identical to the target. Then use the worktable as your source. (See https://docs.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15 section on "BEST PRACTICES")

Also, MERGE is unnecessary unless you are doing a mix of operations. If all you want to do is insert rows, just use INSERT. Not sure what the purpose of the [LastSeen] column is.

Finally, inserting 5 million rows all at once may be bogging you down because of logging. Consider writing a loop to load 100,000 rows at a time. If you do this, be careful to write your loop to walk through ranges of keys, so you aren't starting at the beginning of your table each time. Since you are outputting 5 million rows to the table variable, you are also hammering tempdb. Again, looping logic can ease the pain here by resetting every 100,000 rows.

P.S. I have no idea what another author meant by the target table being defined by a CTE. Could someone enlighten me?

This reply was modified 4 years, 5 months ago by The Dixie Flatline.
This reply was modified 4 years, 5 months ago by The Dixie Flatline.
This reply was modified 4 years, 5 months ago by The Dixie Flatline.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

astrid 69000 Hall of Fame Points: 3208 More actions · Answer 5

There is a book my boss uses called building a scalable datawarehouse with data vault 2.

and it gives specifics instructions on how to use the code. that is from where the last seen date comes from. and how the merge is written.

i dont have a lot of room to do what i think is best, i need to follow that methodology.

i will try the loop idea, and also doing it through ssis, which I personally think is not the best idea.

have anyone have success with the data vault methodology?