This is going to be a quick look at an issue that we recently encountered, I’m not going to go in depth into spinlocks and what they are (perhaps that could be the subject of a future post).
We were enjoying a nice peaceful afternoon when we hear panicked shouting that a SQL Server had become unresponsive and the customers were unable to do anything.
We moseyed on down to the server in question to take a look at it. One thing stood out immediately, CPU was pegged out at 100% but SQL itself didn’t actually seem to be doing anything, transactions\second was on the floor. Unfortunately this happened a while back and I didn’t think to capture any graphs or metrics at the time so you’re just going to have to take my word for this.
After looking into a few different things, the mention of spinlock contention came up. I’ll be honest here, actual spinlock contention is rare and is something that I’ve seen cause an issue only a handful of times so it’s something that I don’t generally get to until I’ve ruled out just about everything else.
A quick look at sys.dm_os_spinlock_stats gave me back the following results.
Remember that DMV shows cumulative data going back to the last time that the SQL Server was started so you do have to take it was a bit of a pinch of salt. The better way to look at it is to look at the change in values over a given time period.
But, look at the number of spins for QUERY_STORE_ASYNC_PERSIST, that is a BIG number and even though the number of backoffs doesn’t look too bad, that’s still a sizable number and at the time of the incident it was ramping up quickly.
So it was looking like our problem was being caused by Query Store related spinlocks.
The next step, switch Query Store to read only on all databases, actually we first tried disabling it but that just hung (I assume, because SQL was just too pegged to be able to clear out the Query Store tables, which is what happens with you disable Query Store. Switching to read only leaves all the data in tact). As soon as that was done, CPU dropped down to reasonable levels and SQL Server suddenly started talking again.
A bit of Googling later on and we stumbled across this from Microsoft, https://support.microsoft.com/en-gb/help/4340759/slow-performance-when-query-store-is-enabled-in-sql-server-2016
Assume that you have a database that has Query Store enabled in Microsoft SQL Server 2016. When the database is under heavy workloads, spinlock contention may occur and server performance may become slow. In particular, you may see heavy contention on the QUERY_STORE_ASYNC_PERSIST spinlock or SPL_QUERY_STORE_STATS_COOKIE_CACHE spinlock.
This is fixed in SQL2016 SP2 CU2 according to that KB article.
The server giving us trouble was running SQL 2016 SP1 and you could certainly call it busy, hosting over 300 databases.
There doesn’t seem to be much out there on this particular issue, except for that KB and the a single forum mention so hopefully this post will be able to point anyone else who hits it in the right direction.