Key findings on the troubled system:
- From Perfmon, 15 second intervals
- - System remains at 95%-100% CPU busy
- - SQL Server buffer page lookups < 10000 per/second
- From wait and spinlock DMVs, 5 minute intervals
- - High CMEMTHREAD waiters and wait time
- - High SOS_SUSPEND_QUEUE spins and backoffs
Bob Dorr's CSS Engineer Blog post on trace flag 8048 indicates that systems with more than 8 cores per NUMA node can run into similar symptoms due to bottleneck in query memory grant. Trace flag 8048 will change the strategy to per-core instead of per-NUMA node.
MSSQL was restarted with -T8048 in place. The difference was immediately evident: buffer page lookup rate rose over 1 million and spike to 8 million per second. The troubled batch workload, which previously couldn't complete in 24 hours, completed in less than 4 hours. Another batch workload which was not the focus of investigation or intervention was submitted as part of validating the corrective value of trace flag 8048 (and ensuring that its unwanted side effects were minimal). This report batch previously completed in 2 hours; with trace flag 8048 in place the report batch completed in roughly 20 minutes.
Nightly ETL also encountered a benefit. ETL time dropped from roughly 60 minutes to 40 minutes.
Pulling together information from several places, I speculate that the high degree of report queuing, the concurrent report count greater than hardware thread count, and the single user account for all reports combined to put pressure on one NUMA node until worker thread pressure caused it to be disfavored for the next incoming connection request for the same user account, at which point the next NUMA node would get some number of connections near instantly. Each NUMA node would end up with a high probability of stressing the query memory grant bottleneck.
Opening more lanes for query memory grant removed the bottleneck. But, I'm not sure the cost. Bob Dorr's CSS post makes it clear that there is additional memory overhead with trace flag 8048. Is that overhead within the single-page allocator region governed by MSSQL 2008 R2 max server memory? If so, I guess the system will just have some number fewer database pages in the buffer pool cache. If not, should max server memory be lowered to accomodate?
Finally... trace flag 8048 was a 'fix', but was it the best fix? Would SQL Server ignoring physical NUMA due to trace flag 8015 have accomplished the same thing? What about setting the BIOS to interleave memory, leaving the server with SMP-imitating SUMA behavior instead of NUMA behavior?