Merge vs other options

  • Hi,

    I need to load a table that has 5,000,000 records, I usually use a merge to populate the tables. But big tables can take quite a bit.

    I was considering doing an update insert with two different procedures and use a ROWLOCK, but I am not even sure it is a good idea, and on top I am scared that the table will be scanned twice to be updated and inserted if I use 2 different procedure.

    I am open for opinions and ideas, here is a sample of the merge we use.

    thanks

     

    CREATE PROCEDURE 
    [Raw].[LoadLinkOpportunity_sp]
    @LoadDateDATETIME
    WITH RECOMPILE

    AS
    -- Declare variables
    DECLARE
    @_LoadDateDATETIME= @LoadDate

    BEGIN
    SET NOCOUNT ON;

    DECLARE
    @SummaryOfChanges TABLE(Change VARCHAR(20));
    MERGE
    DVDB1.Raw.LinkOpportunity AS Target
    USING
    (
    SELECT
    CONVERT(CHAR(40), Replicate (0,40)) AS LinkOpportunityHashKey
    ,CONVERT(DATETIME,'1753-01-01 00:00:00.00') AS LoadDate
    ,@_LoadDate AS LastSeenDate
    ,'SYSTEM' AS RecordSource
    ,CONVERT(CHAR(40), Replicate (0,40)) AS OpportunityHashKey
    ,CONVERT(CHAR(40), Replicate (0,40)) AS SalesforceUserHashKey
    ,CONVERT(CHAR(40), Replicate (0,40)) AS CRMAccountHashKey
    ,CONVERT(CHAR(40), Replicate (0,40)) AS CampaignHashKey


    UNION
    SELECT
    DISTINCT
    LinkOpportunityHashKey
    ,LoadDate
    ,LoadDate AS LastSeenDate
    ,RecordSource
    ,OpportunityHashKey
    , SalesforceUserHashKey
    , CRMAccountHashKey
    , CampaignHashKey

    FROM
    StageDB.LAN.Opportunity
    WHERE
    LoadDate = @_LoadDate
    ) AS Source
    (
    LinkOpportunityHashKey
    ,LoadDate
    ,LastSeenDate
    ,RecordSource
    ,OpportunityHashKey
    , SalesforceUserHashKey
    , CRMAccountHashKey
    , CampaignHashKey


    )
    ON
    (
    Target.RecordSource= Source.RecordSourceAND
    Target.OpportunityHashKey= Source.OpportunityHashKey AND
    Target.SalesforceUserHashKey= Source.SalesforceUserHashKeyAND
    Target.CRMAccountHashKey= Source.CRMAccountHashKeyAND
    Target.CampaignHashKey= Source.CampaignHashKey

    )
    WHEN MATCHED THEN
    UPDATE
    SET LastSeenDate= Source.LastSeenDate
    WHEN NOT MATCHED THEN
    INSERT (LinkOpportunityHashKey, LoadDate , LastSeenDate, RecordSource, OpportunityHashKey, SalesforceUserHashKey, CRMAccountHashKey, CampaignHashKey)
    VALUES (Source.LinkOpportunityHashKey, Source.LoadDate , Source.LastSeenDate, Source.RecordSource, Source.OpportunityHashKey, Source.SalesforceUserHashKey, Source.CRMAccountHashKey, Source.CampaignHashKey)

    OUTPUT $action INTO @SummaryOfChanges;

    INSERT INTO
    MetricsMart.etl.UpdateRawLog
    (
    SourceTable
    ,TargetTable
    ,Change
    ,CountPerChange
    ,DateCreated
    )
    SELECT 'StageDB.LAN.Opportunity', 'DVDB1.Raw.LinkOpportunity',Change,COUNT(*), GETDATE()
    FROM @SummaryOfChanges
    GROUP BY Change;

    EXEC DVDB1.dbo.TrackAndLog_sp @ObjectID = @@PROCID;
    END
    GO
  • Merge statements don't have WHERE clauses which is why the target is typically defined in a CTE.  In this case there's no CTE so you're merging against the entire DVDB1.Raw.LinkOpportunity table.  Yada yada yada... to just cut through an explaination... it seems to me this could all be boiled down to 2 things, 1) a table with all LoadDates, and 2) An INSERT WHERE NOT EXISTS statement.  Where your code uses "OUTPUT $action INTO @SummaryOfChanges; " it would be equivalent to @@rowcount from the insert.  From @@rowcount and dm views the log tables could be populated with updated UPDATE counts.  Instead of updating the LastSeenDate column with the LoadDate you could just do nothing.  As long as the history of prior LoadDates are records in a table then the number of times a particular record has been seen (by this process) would be queryable.

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Thanks.

    I am not able to change the logic of the query as here they follow the datavault philosophy to the teeth and I have no way around it.

    My question was, how bad will it be to do a ROWLOCK instead of a merge on a large table.

    I will set a test to run a few day and check timing, but I wanted to know what other people might think.

     

    Thanks

    Astrid

  • Performance Tip: The conditional behavior described for the MERGE statement works best when the two tables have a complex mixture of matching characteristics. For example, inserting a row if it doesn't exist, or updating a row if it matches. When simply updating one table based on the rows of another table, improve the performance and scalability with basic INSERT, UPDATE, and DELETE statements.

    https://docs.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15

  • Row locks run  slower than page locks, which are the default locks used by SQL Server.     (https://littlekendra.com/2016/02/04/why-rowlock-hints-can-make-queries-slower-and-blocking-worse-in-sql-server/)   I'm not sure what you think Rowlock will buy you.

    Merges run best when the source table and the target table have the same clustered indexes used for the MATCHING ON clause.   You could benefit from doing your union into a worktable with clustered index characteristics identical to the target.    Then use the worktable as your source.     (See https://docs.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver15 section on "BEST PRACTICES")

    Also, MERGE is unnecessary unless you are doing a mix of operations.    If all you want to do is insert rows, just use INSERT.   Not sure what the purpose of the [LastSeen] column is.

    Finally, inserting 5 million rows all at once may be bogging you down because of logging.  Consider writing a loop to load 100,000 rows at a time.     If you do this, be careful to write your loop to walk through ranges of keys, so you aren't starting at the beginning of your table each time.     Since you are outputting 5 million rows to the table variable, you are also hammering tempdb.     Again,  looping logic can ease the pain here by resetting every 100,000 rows.

    P.S.    I have no idea what another author meant by the target table being defined by a CTE.    Could someone enlighten me?

     

     

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • There is a book my boss uses called building a scalable datawarehouse with data vault 2.

    and it gives specifics instructions on how to use the code. that is from where the last seen date comes from. and how the merge is written.

    i dont have a lot of room to do what i think is best, i need to follow that methodology.

    i will try the loop idea, and also doing it through ssis, which I personally think is not the best idea.

    have anyone have success with the data vault methodology?

     

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply