Issue with NUMA, Buffer Cache, and Connection pooling on 4S Westmere

Question

Issue with NUMA, Buffer Cache, and Connection pooling on 4S Westmere

chandleya

Right there with Babe

Points: 725
More actions
December 18, 2014 at 8:33 am

#301900

Running SQL Server 2012 SP1 CU4 (yep), have HADR across 3 nodes, no FCI. This issue seems to happen to all nodes, however, the primary node seems to take the beating, perhaps for obvious reasons. Servers are HP DL580 G7 Westmere Xeon E7-4850 with 256GB (32x 8GB DIMMs spread evenly). SQL is configured with MAXDOP= 2, CTP = 5, and MAXRAM = 200000 w/ LPM enabled.
We're experiencing a condition where buffer pages & associated PLE seems to vacate one of four nodes on a rather consistent (but unpredictable) basis. We use SQL Sentry for monitoring, which helps us to visualize when memory allocations aren't right. This issue manifests itself by a node (or 2 nodes in some cases) showing a very, very low amount of buffer pages allocated, whereas one or more NUMA nodes will show more buffer pages allocated than the node even has locally. For example, NODE000 has 1.5GB, NODE001 has 50.0GB, NODE002 has 75.5GB, and NODE003 has 25.5GB. If we let this go on continuously, there is a remote chance that SQL will actually get near 0GB on NODE0 and stop accepting connections. During this behavior, PLE for node 0 hovers in the 1-5 minute range, yet disk IO doesn't seem to suffer much.
We maintain a very highly used eCommerce site with a number of web servers, service engines, and other applications that make the business work. They're all MS .Net4 apps running from a couple dozen VMs scattered across a number of states over low latency MPLS circuits. Performance, for what it's worth, is par or better. One thing that we do that I find very intriguing, especially in this scenario, is that we use both connection pooling as well as multiple active result sets (MARS). So, for our frequent callers, if you will, the traffic is always coming from the same connections, and many of those connections will be kicking off a dozen+ queries at the same time. Another intriguing behavior is that we generally have our servers in groups of 3's; so we have 3 of this, 3 of that, and so on. Given that the apps are wonderful .Net deals, we recycle app pools twice daily; we stagger those recycling proceses over an hour or so period so as to not shock any system or create any web delays.
We have only identified three manners of handling this situation: 1) Do nothing; ride til we die. This can take many days or many hours; zero known predictability. Given our eCommerce stature, we don't have any interest in failure. We do use AG's for availability, but we've come to learn that they never take over correctly when things go to shit. So even though SQL is not accepting new connections, the AG's already open and works just fine. Yay. 2) Issue a DBCC DROPCLEANBUFFERS (sometimes twice, just to get a good punch in). Takes 3-5 minutes to execute, but the process is successful. After 15-20 minutes, the buffers are filled again and relatively well balanced with 35-40GB per Node allocated as buffer. In doing this, we don't actually drop any connections, so the skewed memory allocations come back within 8-16 hours. 3) Restart the box and/or failover to another AG node. This seems to help a lot. We get anywhere from 1-2 weeks in before things seemingly go south. The sync node generally doesn't have a picture-perfect memory landscape, either, and simply failing over doesn't buy us a crapload of time. The restart seems to help a great deal.
Has anyone else experienced this, or even paid attention to per-node PLE instead of just per-instance? (Thanks perfmon & spotlight for blinding us with meaningless data!) I think there's a strong chance that our connection pooling ways encourage an unequal distribution of connection weight. But I could be wrong or at least not picking our what matters. It feels like a memory leak, but at the same time we still have access to all of the bytes we thought we should. When I look at this from a performance perspective, I get wrapped up in the fact that one or more NUMA nodes are holding remote pages hostage indefinitely.
What do you think?
Update1 - we have a package/process that we know has a fault in that it leaves connections open. We run an hourly process looking for connections that are unused for more than 4 hours, and kill them. Puts lots of noise in our errorlog, but at least we're not whoring our connection spids. We've had issues in the past where we'd get to 10000 connections and start having out of memory issues - also had issues pre-SP1.
Also posted at: http://www.reddit.com/r/SQLServer/comments/2posov/sql2012_issue_with_numa_buffer_cache_and/

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 1

Don't have time to dig into it right now, but over the years and versions there have been any number of buffer-flushing bugs hit SQL Server. Some of them are internal and some from Windows or even firmware/drivers (HP iLO comes to mind here from way back).

Do some web searching for such things. Also review all CU/SP after your current build for issues like you are seeing that may have been patched. There was a huge thread here on a 2012 bug that flushed randomly IIRC.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service