Blog Post

The Strange Story About the MERGE that Killed the Cluster

,

I don’t like the MERGE statement. The syntax is weird, it’s tricky in terms of locks, and it has a history of weird errors. The other day a developer wanted my help in troubleshooting a MERGE that caused an ACCESS VIOLATION error and a stack dump generation. That led me to tweet this:

I may have exaggerated a bit. It might be limited just to our galaxy (thanks Doc Brown). Anyways, we started troubleshooting and eventually figured out the above error happens because of a specific code section (the whole scenario is dependent on other factors and is very hard to reproduce. We opened a bug and I’ll update when I have more details).

The problematic code block looked like this:

DECLARE @Var int = 0;
 MERGE TargetTable AS target
 USING (Column_A, Column_B, Column_C FROM SourceTable)
 AS source (Column_A, Column_B, Column_C)
 ON (target.Column_A = source.Column_A)
 WHEN MATCHED THEN
 UPDATE SET @Var = 1
 WHEN NOT MATCHED THEN
 INSERT (Column_A, Column_B, Column_C)
 VALUES (Column_A, Column_B, Column_C);

Unlike me, the developer likes MERGE, and what he accomplished here is “Insert where not exists”. But for some reason, updating the variable was problematic.

Meanwhile:

We started getting alerts about an Availability Group that kept failing. I went to Failover Cluster Manager and saw the Availability Group was offline. I brought it back online, but it failed again almost immediately. I did it a few more times, but the Availability Group refused to stay online.

And then I recalled I once saw a video by Jonathan Kehayias, where he talked about “FailureConditionLevel”, a Windows Cluster configuration option that determines the severity at which there will be a cluster Failover:

  • 0 – No automatic failover
  • 1 – Failover/Restart when the primary node is down
  • 2 – Failover/Restart when the SQL Server service is up but not responding
  • 3 – Failover/Restart on critical errors like a stack dump
  • 4 – Failover/Restart when a resource is unhealthy, for example, memory
  • 5 – Failover/Restart on any qualifying condition and when there’s a query processing error

* Based on the output of sp_server_diagnostics. Click here for a more detailed exaplanation

The value in our cluster was the default – 3. Obviously the cluster kept failing and stayed offline because of the stack dumps generated by the MERGE statement. I changed the value to 1, and the cluster resource stopped failing.

We then moved on to fixing the MERGE code, which was very simple: Omitting the WHEN MATCHED part.

Key Takeaways:

  1. Use caution with MERGE
  2. Dig into your clusters. Know the various options and decide when a Failover is needed

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating