• The main issue is the cardinality estimation for the Filter in the MERGE plan: estimated rows 32 million, actual 10 thousand.

    The sorts in the plan require a memory grant, which is sized based on the estimated number of rows. Once the merge update actions have been Split into separate deletes and inserts, the estimate at the sort is 57.5 million rows. Ironically, none of the non-clustered indexes are changed in the example plan you gave, so the sorts actually encounter zero rows!

    As you may know, queries cannot start executing until workspace memory has been granted. Your example MERGE plan ended up acquiring over 11GB of memory (!) though it may have asked for even more to begin with. It is highly likely the query had to wait a considerable time before 11GB could be granted. You can monitor memory grants using sys.dm_exec_query_memory_grants. If you had been using SQL Server 2012, the wait time would also have been reported in the execution plan.

    The wait for memory grant is the reason the 80-row run took so long, and the wildly inaccurate cardinality estimate is the reason for the ridiculous memory grant.

    The cardinality estimate is wrong because this Filter eliminates rows for which no merge action (insert, update, or delete) is necessary. The unusual structure of the MERGE statement makes estimating the number of rows that require no changes very difficult, and the cardinality estimator gets it very wrong. The offending part of the MERGE statement is:

    WHEN NOT MATCHED BY SOURCE

    AND Production.ParticipantID IN ( SELECT ParticipantID FROM ETL.ParticipantResponseBuild )

    I understand what you are trying to do here, and why you wrote this clause this way but that doesn't change the fact that it is unusual and makes an already-difficult cardinality estimation problem just about impossible. To the optimizer, it looks very much as if almost all rows will result in an insert, update, or delete.

    If you cannot reformulate the MERGE to handle the DELETE option using more transparent logic, use the MERGE for the INSERT and UPDATE and perform a separate DELETE. Or just use the three separate statements, of course, if you find that performs better. I would not say "never use MERGE" but it can require skilled tuning in many cases, and is rarely the best choice where parallel execution is needed. Cardinality estimation is much easier in the case of separate statements.

    Other things, if you are interested:

    MERGE is not optimized for large set processing, despite the common intuition that one statement ought to be faster than three. On the other hand, MERGE does contain some very specific optimizations for OLTP (in particular, the elimination of Halloween Protection in certain cases). There are particular considerations for MERGE that make it unsuitable for large sets. Some of these are due to implementation decisions, some are optimizer limitations, and others just come down to the immense complexity of the merging operation itself.

    The Compute Scalar that determines the merge action is a Compute Sequence. This operator cannot tolerate parallel execution, so a parallel MERGE plan will stop and start parallel execution either side of it. The costs of stopping and restarting often result in MERGE plans that do not use parallelism, where it might otherwise be expected.

    You can improve the cardinality estimates in some areas of the plans by adding OPTION (RECOMPILE) to the queries that reference the local variable @LastKeyCompleted. This hint allows the optimizer to see the current value of the variable and optimize accordingly. Otherwise, the plan is based on a guess. Recompiling may not take long here compared to the run time of the query, so it could be a price worth paying to get a plan tailored to the current value of @LastKeyCompleted.

    The construct OUTER JOIN ... WHERE NULL is almost never preferable to writing a NOT EXISTS. The logical requirement is an Anti Semi Join, performing a full join and then rejecting NULLs is an odd way to express it. This topic has been written about many times, always with the same results. Use NOT EXISTS.

    HOLDLOCK = SERIALIZABLE. There are good reasons to use this with MERGE for highly concurrent systems, but it would be quite unusual for ETL. Be sure you need this hint.