I'm having some issues with an active/active SQL 2008R2 cluster that I'd like some thoughts on.
The system is as follows:
OS: Windows Server 2008 R2(x64)
SQL: 2008 R2 Enterprise SP2, (x64)
64 Procs, 768 GB memory per node.
Min memory is set to 256 GB, Max Memory set to 350GB.
After a fail-over (for patching or other maintenance), withing 48 hours one of the instances will go unavailable until such time that it grabs memory up to its max setting (350GB per instance). The system will grab 150-250 GB of memory over a ~30 minute period until it reaches the max memory setting. During this time-frame the server is up, however severe blocking causes it to be very sluggish and unavailable for most application traffic. The connections timeout before the results are returned. A simple sp_who2 may take upwards of a minute to return results.
Once it gets to its max memory setting it returns to normal, the db engine resolves the blocking and everything is ok until the next maintenance window.
My theory is that a poorly performing query causes the system to grab resources until it hits its max. During this time-frame a DBA typically intercedes and kills the offending spid(s), however it never completes the rollback until the max memory setting is reached.
Also, we end up with Latchwait timeouts of type 2 and 4 for class: 'ACCESS_METHODS_DATASET_PARENT'
and Buffer Latch timeouts of type 2 for tempdb and type 4 for MSDB.
From what I've read those basically indicate a resource issue of some kind. CPU utilization ramps up to 60%, but i/o is fairly low, never really going above 1500 iops. This system pushes over 50K iops at times.
With the exception of lowering the max memory setting does anyone have any thoughts on possible causes and/or ways to remediate this issue?
Thanks in advance,
To help us help you read this[/url]For better help with performance problems please read this[/url]