Yesterday morning around 6AM, my Blackberry started going off. The subject line of every email was, “SQL Timeouts”. In the past, this was a direct result of my Index Rebuild or Defrag nightly process. However, since I have implemented a process that is able to recognize when the Index job is blocking, I knew this could not be it. I quickly ran, the following script:
FROM sys.dm_exec_requests r
INNER JOIN sys.dm_exec_sessions s
ON r.session_id = s.session_id
CROSS APPLY sys.dm_exec_sql_text(sql_handle) t
r.session_id <> @@spid --and blocking_session_id = -4
ORDER BY 2 DESC
As expected, there was a lot of blocking. The strange thing was the –4 session_id. I have experienced –2 session_ids and I understand how to fix this problem. Click here for more information on –2 session_ids. I had never experienced a –4. After a few minutes I noticed (2:1:103) in the wait_resource column of the above query, which represents to DatabaseID, FileNum, and PageNum. I stumbled upon KB328551, which suggest adding a Trace Flag 1118 and Increasing the number of data files in TempDB. Increasing the number of data files was not a problem, however I was skeptical about adding the trace flag since the article applied to SQL Server 2000 and we are currently running SQL Server 2008. Adding the trace flag forces SQL Server to use uniform extent allocations instead of mixed page allocations.
SQL Server uses a special allocation page called a Shared Global Allocation Map (SGAM) to find unallocated pages in mixed extents. When there are several concurrent processes that need to access the SGAM, in our case 4000 instead of 100, the SGAM could become the bottleneck. This is because each process will need to access the SGAM several times.
Since this was my production system I decided to place a call to Microsoft. While I was on hold I found out that a new process that increased calls to the database by a magnitude of 100 was placed into production on the previous evening. When the Microsoft Engineer came on he poked around for about 15 minutes and then he suggested adding, what else, Trace Flag 1118 as a startup parameter. I was still a little skeptical about adding the trace flag because the article was so specific to 2000, but the engineer pointed me to a section of the KB that also suggested to use of the trace flag for SQL 2005 and 2008. Even further, I found a page by Microsoft Customer Server Support (CSS) recommending the usage of the trace flag to assist in TEMPDB scalability. We added the trace flag as a startup parameter for SQL Server and restarted the SQL Service. Before we added the trace flag I had the developer stop the process that was making the thousands of calls to the database. When I restarted the Service everything was up and running. I asked the developer to restart his service. After a few minutes the –4’s returned. They were not producing as much contention as before and all the systems were still available, but it was slightly slower than normal. As a result, we disabled only the part of the process that was causing the contention, reworked it and redeployed. Plans have been made to add additional database files to TEMPDB also.
Explaining why this problem occurred and how the Trace Flag and additional data files helped solve the problem is beyond the scope of a blog posting. However, reading the KB article, the Blog posting from CSS and this article http://technet.microsoft.com/en-us/library/cc966545.aspx will provide good information regarding these issues.
Talk to you soon
Patrick LeBlanc Founder TSQLScripts.com and SQLLunch
SQL Down South