Merge vs other options

  • astrid 69000

    Hall of Fame

    Points: 3145


    I need to load a table that has 5,000,000 records, I usually use a merge to populate the tables. But big tables can take quite a bit.

    I was considering doing an update insert with two different procedures and use a ROWLOCK, but I am not even sure it is a good idea, and on top I am scared that the table will be scanned twice to be updated and inserted if I use 2 different procedure.

    I am open for opinions and ideas, here is a sample of the merge we use.



    @LoadDate DATETIME

    -- Declare variables
    @_LoadDate DATETIME = @LoadDate


    @SummaryOfChanges TABLE(Change VARCHAR(20));
    DVDB1.Raw.LinkOpportunity AS Target
    CONVERT(CHAR(40), Replicate (0,40)) AS LinkOpportunityHashKey
    , CONVERT(DATETIME,'1753-01-01 00:00:00.00') AS LoadDate
    , @_LoadDate AS LastSeenDate
    , 'SYSTEM' AS RecordSource
    , CONVERT(CHAR(40), Replicate (0,40)) AS OpportunityHashKey
    , CONVERT(CHAR(40), Replicate (0,40)) AS SalesforceUserHashKey
    , CONVERT(CHAR(40), Replicate (0,40)) AS CRMAccountHashKey
    , CONVERT(CHAR(40), Replicate (0,40)) AS CampaignHashKey

    , LoadDate
    , LoadDate AS LastSeenDate
    , RecordSource
    , OpportunityHashKey
    , SalesforceUserHashKey
    , CRMAccountHashKey
    , CampaignHashKey

    LoadDate = @_LoadDate
    ) AS Source
    , LoadDate
    , LastSeenDate
    , RecordSource
    , OpportunityHashKey
    , SalesforceUserHashKey
    , CRMAccountHashKey
    , CampaignHashKey

    Target.RecordSource = Source.RecordSource AND
    Target.OpportunityHashKey = Source.OpportunityHashKey AND
    Target.SalesforceUserHashKey = Source.SalesforceUserHashKey AND
    Target.CRMAccountHashKey = Source.CRMAccountHashKey AND
    Target.CampaignHashKey = Source.CampaignHashKey

    SET LastSeenDate = Source.LastSeenDate
    INSERT (LinkOpportunityHashKey, LoadDate , LastSeenDate, RecordSource, OpportunityHashKey, SalesforceUserHashKey, CRMAccountHashKey, CampaignHashKey)
    VALUES (Source.LinkOpportunityHashKey, Source.LoadDate , Source.LastSeenDate, Source.RecordSource, Source.OpportunityHashKey, Source.SalesforceUserHashKey, Source.CRMAccountHashKey, Source.CampaignHashKey)

    OUTPUT $action INTO @SummaryOfChanges;

    , TargetTable
    , Change
    , CountPerChange
    , DateCreated
    SELECT 'StageDB.LAN.Opportunity', 'DVDB1.Raw.LinkOpportunity', Change, COUNT(*), GETDATE()
    FROM @SummaryOfChanges
    GROUP BY Change;

    EXEC DVDB1.dbo.TrackAndLog_sp @ObjectID = @@PROCID;
  • scdecade

    SSC Eights!

    Points: 825

    Merge statements don't have WHERE clauses which is why the target is typically defined in a CTE.  In this case there's no CTE so you're merging against the entire DVDB1.Raw.LinkOpportunity table.  Yada yada yada... to just cut through an explaination... it seems to me this could all be boiled down to 2 things, 1) a table with all LoadDates, and 2) An INSERT WHERE NOT EXISTS statement.  Where your code uses "OUTPUT $action INTO @SummaryOfChanges; " it would be equivalent to @@rowcount from the insert.  From @@rowcount and dm views the log tables could be populated with updated UPDATE counts.  Instead of updating the LastSeenDate column with the LoadDate you could just do nothing.  As long as the history of prior LoadDates are records in a table then the number of times a particular record has been seen (by this process) would be queryable.

  • astrid 69000

    Hall of Fame

    Points: 3145


    I am not able to change the logic of the query as here they follow the datavault philosophy to the teeth and I have no way around it.

    My question was, how bad will it be to do a ROWLOCK instead of a merge on a large table.

    I will set a test to run a few day and check timing, but I wanted to know what other people might think.




  • Jonathan AC Roberts


    Points: 17296

    Performance Tip: The conditional behavior described for the MERGE statement works best when the two tables have a complex mixture of matching characteristics. For example, inserting a row if it doesn't exist, or updating a row if it matches. When simply updating one table based on the rows of another table, improve the performance and scalability with basic INSERT, UPDATE, and DELETE statements.

  • The Dixie Flatline

    SSC Guru

    Points: 53253

    Row locks run  slower than page locks, which are the default locks used by SQL Server.     (   I'm not sure what you think Rowlock will buy you.

    Merges run best when the source table and the target table have the same clustered indexes used for the MATCHING ON clause.   You could benefit from doing your union into a worktable with clustered index characteristics identical to the target.    Then use the worktable as your source.     (See section on "BEST PRACTICES")

    Also, MERGE is unnecessary unless you are doing a mix of operations.    If all you want to do is insert rows, just use INSERT.   Not sure what the purpose of the [LastSeen] column is.

    Finally, inserting 5 million rows all at once may be bogging you down because of logging.  Consider writing a loop to load 100,000 rows at a time.     If you do this, be careful to write your loop to walk through ranges of keys, so you aren't starting at the beginning of your table each time.     Since you are outputting 5 million rows to the table variable, you are also hammering tempdb.     Again,  looping logic can ease the pain here by resetting every 100,000 rows.

    P.S.    I have no idea what another author meant by the target table being defined by a CTE.    Could someone enlighten me?




    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • astrid 69000

    Hall of Fame

    Points: 3145

    There is a book my boss uses called building a scalable datawarehouse with data vault 2.

    and it gives specifics instructions on how to use the code. that is from where the last seen date comes from. and how the merge is written.

    i dont have a lot of room to do what i think is best, i need to follow that methodology.

    i will try the loop idea, and also doing it through ssis, which I personally think is not the best idea.

    have anyone have success with the data vault methodology?


Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply