We've recently upgraded to Windows 2008 R2, SQL Server 2008 R2 SP + CU4, multi-instance cluster. 32 cores (64 with h/t), 384GB per node. Maxdop 4 server wide, 320GB max server memory. It used to be Win2k3 R2, SQL Server 2005 SP3, 2 x single instance clusters - 8 cores, 96GB RAM, Maxdop 0 server wide.
We have 6 databases on this server, the main one being an Enterprise Data Management Platform that munges data from lots of different sources. It's a 3rd party product that has something like 18000 encrypted stored procedures, 12000 tables, 3000 views. Good, eh ;-). There are triggers on a lot of these tables that send insert/update/deletes to a local staging database, then on to a repository database across a linked server.
For the past few years we have been using a SQL monitoring product (name witheld to protect the innocent) against out production cluster (multi-instance) without any problems. However, on the day we upgraded, when we turned on a particular feature that monitors the types of queries being executed against the server, some data load went from completing in 15 mins, to taking 14 hrs to run. When we turned off this feature, it returned to normal. No other databases appear to be affected.
We contacted the vendor of the monitoring software and found out in essense that this feature runs a SQL trace capturing the following events: SQL:BatchCompleted, SQL:StmtCompleted, SP:Completed for events with a run duration of greater than 5000ms. Sure enough, just running the trace against the server whilst processing data had exactly the same effect as using the query monitoring feature of the software. So, I think we say that although the monitoring software causes the problem, it does so only by leveraging a feature native to SQL Server.
The EDM vendor was aware of the conflict between their product and the monitoring software but have been unable to explain the cause, or reproduce it in their environment. They've also tried against a version of the DB that doesn't have any encrypted procedures (I don't have that copy).
During the slow down, I query sys.dm_exec_requests and all of the sessions associated with the hung jobs have a RUNNABLE status with a 'last_wait_type' of SOS_SCHEDULER_YIELD. There's nothing in sys.dm_os_waiting_tasks. No evident blocking. CPU's are completely idle. I tried changing MAXDOP settings, affinity, hyperthreading, etc. as well (to make it more like the old system), but no effect. As soon as I stop the trace, everything resumes at normal speeds and runs through to completion.
These are the top 10 waitstats:
wait_type wait_time_ms signal_wait_time_ms resource_wait_time_ms percent_total_waits percent_total_signal_waits percent_total_resource_waits
PREEMPTIVE_OS_WRITEFILEGATHER 4729935 0 8553549 0.308682089780738 0.000000000000000 0.308682089780738
BACKUPTHREAD 4721076 2466 8620446 0.311185274808784 0.001188989695061 0.311096281335619
ASYNC_IO_COMPLETION 1543967 532 18797472 0.678385914247604 0.000256505481659 0.678366715331252
OLEDB 1019783 0 93509304 3.374581401551842 0.000000000000000 3.374581401551842
LCK_M_IX 192111 707 458001 0.016553940830775 0.000340882284837 0.016528426481413
MSQL_DQ 167309 0 2099902 0.075781659483654 0.000000000000000 0.075781659483654
LCK_M_IS 94128 69 1338334 0.048300539928959 0.000033268568109 0.048298049843943
LCK_M_U 86708 316 270195 0.009762251994894 0.000152360398880 0.009750848127286
LCK_M_S 77223 1699 245245 0.008911761653415 0.000819178220563 0.008850447820931
LCK_M_X 61582 1207 241601 0.008762500905236 0.000581958865344 0.008718942461557
Any ideas what else I might check? Like I said, this used to work fine under SQL 2005/Win2k3, with the same version of the monitoring software.