Unexplained Failovers

  • We have a three node SQL Server 2012 SP1 cluster on Windows Server 2008 R2 that has nine named instances on it. Normally, one node (let's call it Node C) should always be dormant, and the workload is split across the remaining two nodes, A and B. Node A (which hosts the 5 same instances all the time) does not have any failover issues. Node B, however, hosts the same 4 instances all the time and has periodic failovers to Node C. If the workload stays on Node C long enough (~49 hours or so), all of the instances will fail back to Node B. So, we have a slow ping pong game between Node B and C, which disrupts some apps that don't reconnect properly at times after a failover. We have a ticket open with Microsoft, but, so far, we do not have a resolution, and the failure abruptly started around 02/25/2015. There were no changes made by the DBAs in that time period. However, we are a patch-happy shop with HBSS in the mix. There were likely a number of patches applied in this time frame.

    We capture the contents of sys.dm_os_sys_memory to a table every ~10 minutes, and it appears that available_page_file_kb starts out around 132,000 (the amount of physical RAM in the server) and drops about about 3 GB per hour until the failover of the four instances occurs. Once the failover occurs, those instances *cannot* be failed back manually until the (now dormant) server is rebooted.

    We also see some periodic messages about SCCM (one of the 4 instances cited above) just before the failure:

    Date3/19/2015 2:03:29 AM

    LogSQL Server (Archive #7 - 3/19/2015 2:07:00 AM)

    Sourcespid24s

    Message

    AppDomain 6 (SMS_xxx.dbo[runtime].5) is marked for unload due to memory pressure.

    Any ideas?

  • How much physical RAM does NodeC have, what is the total of Max mem assigned to the 4 instances?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • All three nodes are essentially identical blades with 128 GB of RAM. I don't have access to the servers from home, but, as I recall (per Microsoft recommendation), we have 12 GB reserved for the operating system, and we essentially split the remaining RAM fairly evenly between the 9 instances so that all 9 instances can coexist on the same physical box if that would happen in an emergency situation. (And all 9 instances have ended up on the same box a few times.)

  • so node b and node c both have 128GB RAM and the 4 instances on node b have max memory set to 12GB?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Yes. We are using:

    (128 GB - 12 GB) / 9 instances = ~12 GB max memory for each instance

    We took this route because we did not want to overcommit RAM if all 9 instances landed on the same machine.

  • Well firstly, the message you posted originally details the issue, you're facing memory pressure within the instance. This is confirmed by the fact you have 80GB RAM free on node B.

    Raising the max server memory on each instance would be a sensible option, it should be fairly easy to dynamically alter the max server memory based on how many instances occupy the same cluster node 😉

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • >>Raising the max server memory on each instance would be a sensible option

    Yes, I understand your line of thinking, however, the failures *abruptly* started around 02/25/2015. Prior to that date, available_page_file_kb from sys.dm_os_sys_memory (that we have saved to a SQL Server table) *never* spiraled down to 0 until a failover like it does after that date. I've graphed the older sys.dm_os_sys_memory data in Excel at work, and available_page_file_kb was essentially a reasonably flat line over time until 02/25/2015--it had a slight decline over time, but not a drastic decline to 0 like we have now. It now looks like we are burning through 3 GB of RAM per hour until all 4 instances failover. Once they failover, they will not fail back without a reboot.

    I'm thinking that changing the amount of RAM assigned to the instances at this point may only mask the issue.

    Since we did not make any SQL Server changes in the 02/25/2015 time frame, I'm thinking that either a patch or some sort of application change is responsible for the new behavior.

    Do you know of any way to chase down which spid is consuming the most virtual memory?

  • By the way, I tried using "DBCC MEMORYSTATUS," but I'm not having any luck interpreting the output. Even on my laptop, that command generates 700+ lines of output. When I ran it at work, it generated a ton of output as well.

  • Please see the attached Excel graph of available_page_file_kb from sys.dm_os_sys_memory. We have two scheduled failovers per week to make sure that every node can be patched quickly. This explains the peaks and valleys until around 02/20/2015.

    Prior to 02/20/2015, available_page_file_kb from sys.dm_os_sys_memory was always around 120,000,000 or higher. At this point, we are looking for assistance in determining what is eroding available_page_file_kb. When it reaches a near zero value, all instances failover and will not failback unless the original node (prior to failover) is rebooted.

    For such a drastic change in the graph, I would expect that there is a memory leak somewhere, but we have not been able to find it yet.

  • so what patches were applied since the date you have identified?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Hmm.... I finally found the syntax to identify the installed patches. There were more patches installed than I thought, but the patches agree between Node B and C. It looks like Node C was patched about the same time as the issue started happening. We'll look into the patches to see what we can learn.

    # Node B

    Get-HotFix -ComputerName NodeB | Select-Object CSName,hotfixid,installedby,@{l="InstalledOn";e={[DateTime]::Parse($_.psbase.properties["installedon"].value,$([System.Globalization.CultureInfo]::GetCultureInfo("en-US")))}} | where {$_.InstalledOn -gt "02/12/2015"} | sort-object InstalledOn,hotfixid

    CSName hotfixid installedby InstalledOn

    ------ -------- ----------- -----------

    NodeB KB3000483 NT AUTHORITY\SYSTEM 2/13/2015 12:00:00 AM

    NodeB KB3004361 NT AUTHORITY\SYSTEM 2/13/2015 12:00:00 AM

    NodeB KB3004375 NT AUTHORITY\SYSTEM 2/13/2015 12:00:00 AM

    NodeB KB3004394 NT AUTHORITY\SYSTEM 2/13/2015 12:00:00 AM

    NodeB KB3013455 NT AUTHORITY\SYSTEM 2/13/2015 12:00:00 AM

    NodeB KB3020338 NT AUTHORITY\SYSTEM 2/13/2015 12:00:00 AM

    NodeB KB3021952 NT AUTHORITY\SYSTEM 2/13/2015 12:00:00 AM

    NodeB KB3023562 NT AUTHORITY\SYSTEM 2/13/2015 12:00:00 AM

    NodeB KB3029944 NT AUTHORITY\SYSTEM 2/13/2015 12:00:00 AM

    NodeB KB3031432 NT AUTHORITY\SYSTEM 2/13/2015 12:00:00 AM

    NodeB KB3006137 NT AUTHORITY\SYSTEM 3/16/2015 12:00:00 AM

    NodeB KB3030377 NT AUTHORITY\SYSTEM 3/16/2015 12:00:00 AM

    NodeB KB3032323 NT AUTHORITY\SYSTEM 3/16/2015 12:00:00 AM

    NodeB KB3032359 NT AUTHORITY\SYSTEM 3/16/2015 12:00:00 AM

    NodeB KB3033889 NT AUTHORITY\SYSTEM 3/16/2015 12:00:00 AM

    NodeB KB3033929 NT AUTHORITY\SYSTEM 3/16/2015 12:00:00 AM

    NodeB KB3034344 NT AUTHORITY\SYSTEM 3/16/2015 12:00:00 AM

    NodeB KB3035126 NT AUTHORITY\SYSTEM 3/16/2015 12:00:00 AM

    NodeB KB3035131 NT AUTHORITY\SYSTEM 3/16/2015 12:00:00 AM

    NodeB KB3035132 NT AUTHORITY\SYSTEM 3/16/2015 12:00:00 AM

    NodeB KB3039066 NT AUTHORITY\SYSTEM 3/16/2015 12:00:00 AM

    NodeB KB3046049 NT AUTHORITY\SYSTEM 3/16/2015 12:00:00 AM

    # Node C

    Get-HotFix -ComputerName NodeC | Select-Object CSName,hotfixid,installedby,@{l="InstalledOn";e={[DateTime]::Parse($_.psbase.properties["installedon"].value,$([System.Globalization.CultureInfo]::GetCultureInfo("en-US")))}} | where {$_.InstalledOn -gt "02/19/2015"} | sort-object InstalledOn,hotfixid

    CSName hotfixid installedby InstalledOn

    ------ -------- ----------- -----------

    NodeC KB3000483 NT AUTHORITY\SYSTEM 2/24/2015 12:00:00 AM

    NodeC KB3004361 NT AUTHORITY\SYSTEM 2/24/2015 12:00:00 AM

    NodeC KB3004375 NT AUTHORITY\SYSTEM 2/24/2015 12:00:00 AM

    NodeC KB3004394 NT AUTHORITY\SYSTEM 2/24/2015 12:00:00 AM

    NodeC KB3013455 NT AUTHORITY\SYSTEM 2/24/2015 12:00:00 AM

    NodeC KB3020338 NT AUTHORITY\SYSTEM 2/24/2015 12:00:00 AM

    NodeC KB3021952 NT AUTHORITY\SYSTEM 2/24/2015 12:00:00 AM

    NodeC KB3023562 NT AUTHORITY\SYSTEM 2/24/2015 12:00:00 AM

    NodeC KB3029944 NT AUTHORITY\SYSTEM 2/24/2015 12:00:00 AM

    NodeC KB3031432 NT AUTHORITY\SYSTEM 2/24/2015 12:00:00 AM

    NodeC KB3006137 NT AUTHORITY\SYSTEM 3/17/2015 12:00:00 AM

    NodeC KB3030377 NT AUTHORITY\SYSTEM 3/17/2015 12:00:00 AM

    NodeC KB3032323 NT AUTHORITY\SYSTEM 3/17/2015 12:00:00 AM

    NodeC KB3032359 NT AUTHORITY\SYSTEM 3/17/2015 12:00:00 AM

    NodeC KB3033889 NT AUTHORITY\SYSTEM 3/17/2015 12:00:00 AM

    NodeC KB3033929 NT AUTHORITY\SYSTEM 3/17/2015 12:00:00 AM

    NodeC KB3034344 NT AUTHORITY\SYSTEM 3/17/2015 12:00:00 AM

    NodeC KB3035126 NT AUTHORITY\SYSTEM 3/17/2015 12:00:00 AM

    NodeC KB3035131 NT AUTHORITY\SYSTEM 3/17/2015 12:00:00 AM

    NodeC KB3035132 NT AUTHORITY\SYSTEM 3/17/2015 12:00:00 AM

    NodeC KB3039066 NT AUTHORITY\SYSTEM 3/17/2015 12:00:00 AM

    NodeC KB3046049 NT AUTHORITY\SYSTEM 3/17/2015 12:00:00 AM

  • When I googled the KB's that were installed on 02/24/2015, everyone of them came back with hits on "memory leak."

    KB3000483

    KB3004361

    KB3004375

    KB3004394

    KB3013455

    KB3020338

    KB3021952

    KB3023562

    KB3029944

    KB3031432

    We're still looking...

  • back them out then via add\remove programs

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • I think we found the issue.

    A while back, I started tracking output from sys.dm_os_sys_memory over time based on code from Glenn Berry (http://www.sqlservercentral.com/blogs/glennberry/2010/08/23/dmv-emergency-queries-from-sql-saturday-_2300_51-in-nashville/).

    Despite Microsoft's best efforts, the data we were already collecting (i.e., available_page_file_kb from sys.dm_os_sys_memory) was the only reliable predictor of an upcoming unplanned instance failover.

    We decided to ask our security folks to turn off HBSS (Host Based Security System), and available_page_file_kb jumped up by 50 GB as if a light switch had been flipped. HBSS appears to be the issue.

  • By the way, we think the reason that the instances would not failback to the original node *without* a reboot of the original node is that the IBM SVC software that we use to interface with the SAN could no longer present the LUN to Windows once virtual memory was depleted.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply