How HADR_SYNC_COMMIT waits can wreak havoc

,

It has been a decade since I have blogged. Or at least it feels like it. Today I would like inform about HADR_SYNC_COMMIT. Even though it looks like a wait type that does not wreak a havoc, it actually does in some cases. Here is one of the cases.

First let me explain our DB server set up. We have AG (Synchronous) set with Primary and Read_Only secondary. Both the servers are 4x10 CPU, with one TB RAM. All the disk are SSD’s. We use the read_only secondary node for reads and some reads hit the primary along with the writes. Our primary DB handles around 12K transactions per second and the CPU usually stays below 10%. We also have a transactional replication set up. Not all tables are replicated to our subscriber database.

Our secondary DB has some queries that has some stored procedures that has high amount of parallelism. Usually it is not a problem since those stored procedures are not called that often. That all changed about three weeks back. Our website got hit by some very aggressive BOT that was scraping the contents of our site. Unfortunately for us, the way this BOT was configured, it was reading the part of the website which was calling the stored procedures that had the parallel plan. The aftermath of this was a huge spike in CPU usage. The CPU usage was spiked at around 90% causing WAITS on CPU.

When the secondary got bogged down by heavy parallel queries, the data updates from the Primary had to wait for resource to write to log. Thus caused blocking on the primary server. These kind of blocks are caused to the HADR_SYNC_COMMIT waits.

Since our website does 12K transactions per second, the blocking started going up exponentially. The exponential growth of these blocks failed to release the locks on the underlying tables. This in turn started causing multiple dead lock. Since our DB server process so many transactions per second, this usually brings all the applications down to its knees.

To resolve this issue, I moved that particular query to execute in our replicated DB (Transactional replication). I hope this lesson I learned will help others and they don’t end up with issues like I faced.

Rate

Share

Share

Rate