Introduction
As you may know, in the early days of SQL Server Clustering, Microsoft advised to provide each clustered SQL Server instance with its own Microsoft Distributed Transaction Coordinator (MSDTC) resource. Since SQL Server 2012 this advice has been withdrawn. Microsoft now recommends to provide your instance with its own MSDTC resource, once you notice the Local MSDTC resource cannot cope with the load it receives.
Now we are using Windows 2022 and MS clustering improved massively over time.
How did this escalate?
As many of us, I overlooked this change in guidance. Until now, as I was reading the guidelines once more for SQL Server 2019 clustered instances. Since I have a structured setup procedure handing the initial install of the clustered instance on the needed nodes, one of the pop-ups to reply to is “Do you need a clustered MSDTC resource”.
This has always been confirmed and the clustered resource has been added using PowerShell. As the current guidelines state to only create a clustered MSDTC resource when actually needed, and the instance which was being handled was a low volume instance, I removed the MSDTC resource using the Failover Cluster Manager.
When I handed over the instance to the developers team, they tested the application for the instance, and I got the “GO” for production. At the agreed time frame, we migrated the databases, etc. for production and everything seemed to be running smooth. Happy customers.
All went fine, until the weekend passed and someone noticed that some data was not being refreshed on one of their mainframe reports. Our developers team immediately addressed the incident and found out their synchronization program (SQL Server vs DB2) had crashed on every run ever since the migration.
They escalated the issue to me.
My Troubleshooting steps
First, verify the error using SSMS: Is the error information accurate?
Because I followed the same install procedure on my DBA test cluster and also had remove the DTC resource, the distributed transactions also failed with that instance. I performed further testing on my DBA test cluster, without interfering with the production instance.
Next, a double check: Is the local MSDTC configured to be used?
This seemed to be configured as it should be. Why did it not work?
AI to the rescue? Gemini, Copilot, Perplexity …. All directed to a misconfigured MSDTC … no help at all.
First try for a workaround
Add the clustered MSDTC resource to the instance and configure it to be usable.
This did not fix the problem !
Running Begin Distributed Transaction still failed stating the MSDTC resource was not available.
Then I failed over the instance and ran the test again.
Still the same problem.
A double check of the registry only confirmed it should be working, because that Clustered defaultVirtualServer pointed to the virtual SQL Server computer name !
Then I removed the MSDTC resource again ( using Failover Cluster Manager )
I rebooted the passive node, failed over the instance and rebooted the new passive node. Once again, this did not fix the situation.
Checking the registry we noticed the attributes ware still there, but their values had been cleared.
Then we removed the attribute keys (marked), but to no reveal.
Rebooting, failovers, testing, …. did not solve the problem.
The quick and dirty temporary workaround
As a temporary measure, I asked the dev team to exclude the distributed transaction from their code and monitor that application close in case synchronization problems would arise. Of course, I also suggested to make this startup parameter driven, so they should be able to test this fairly quick, once a solid solution had been provided.
Then we reinstalled my DBA Test cluster and put the SQL Server clustered instance on it, this time directly without the clustered MSDTC resource. Now, on my DBA Test environment, things worked as they should.
So, We prepared for another production instance migration. New VMs, new Clustered SQL Server instance. However, for this instance to be able to migrate, I needed a special firewall – not managed in house – to be modified to allow for the new nodes communication. This could take up to a week to get configured.
In the meantime on my DBA Test environment
As this instance did not have a clustered DTC resrouce bound to the clustered SQL Server instance, I added one. The distributed transactions kept on working. Of course, this was to be expected, because that matched the "old" install procedure.
As soon as I removed the clustered DTC resource again and stopped/started the SQL Server instance, the distributed transactions failed again. Adding the clustered MSDTC resource once more, did not solve the issue.
I removed it again. The first removal must have left some residue causing things to fail.
The Breakthrough
Searching the registry for the SQL Server instance name, I found this registry key hive:
[HKEY_LOCAL_MACHINE\Cluster\MSDTC\TMMapping\Service\MAPinstancename]
"ClusterResourceId"="257adcc8-3075-42d9-a85a-db70a903d1ac"
"ApplicationType"=dword:00000001
"Name"="MSSQL$instancename"
I was working on my DBA TEST instance, so I could allow myself to mess up the whole installation if I wanted to.
I decided to remove this registry hive on both nodes. Even without instance failover or reboot, distributed transactions started working again.
Why did it not work after I reinstalled the DTC clustered resource, bound to the instance?
When installing the Clustered Distributed Transaction Resource for the second time, this map was not refreshed, because a MAP resource was found having the same name as the intended resource name.
In such case, we should drop and recreate that MAP resource, but that would not fix the issue that it is not being removed when the Clustered Distributed Transaction Resource is being removed using the GUI.
Final conclusion
A bug in Failover Cluster Manager caused the Clustered Distributed Transaction Resource not to be uninstalled completely. The Cluster\MSDTC\TMMapping\Service\MAPinstancename hive is not being removed.
Removing this [HKEY_LOCAL_MACHINE\Cluster\MSDTC\TMMapping\Service\MAPinstancename] hive solves the issue.
We planned to apply this fix in production ASAP.
After this fix has been applied, the application parameter to avoid distributed transactions has been removed form the application’s configuration file, and distributed transactions are being used without a problem.
Another story from the trenches of the universe called SQL Server.
Resources