SQL Clone
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

NUMA metric strangeness

NUMA metric strangeness

SSC Eights!
SSC Eights! (945 reputation)SSC Eights! (945 reputation)SSC Eights! (945 reputation)SSC Eights! (945 reputation)SSC Eights! (945 reputation)SSC Eights! (945 reputation)SSC Eights! (945 reputation)SSC Eights! (945 reputation)

Group: General Forum Members
Points: 945 Visits: 210

Any of you seen the following metrics on a NUMA box?

I'm seeing some strange metrics in Perfmon related to our server's NUMA architecture, and it seems to correlate to some import runtime spiking that we are facing. I've read up on NUMA via BOL and Slava Oks's blog, so I've done the necessary homework. :-)

-----Server Specs-----
Windows 2003 SP2 (not an R2 install), Enterprise Edition
SQL 2005 SP2 (build 3042), Standard Edition
16 GB of memory (all visible both to OS and SQL of course via the boot.ini and 'awe enabled' switches)
4 Dual-Core AMD Opteron 2.81 Ghz processors
4-node NUMA at the hardware level. No additional SQL soft-NUMA installed. No connection affinity set up. (We've done nothing special...essentially just a default SQL 2005 installation).

SQL "Max Server Memory" setting initially set to 15000 MB. "Min Server Memory" was initially set to 14000 MB. (long story)

------Strange Behavior-----
Perfmon-->SQLServer:BufferManager (the "global" counters) showed a 99% or higher BCHR, but a terrible Page Life Expectancy, consistently below 10 for large portions of the day, even during periods of application inactivity.

After drilling down to Perfmon-->SQLServer:BufferNode, I discovered the following state of affairs (note that these numbers are consistent trends, not just 1 snapshot I took):

All Nodes:
Target Pages: ~430k
Total Pages: varying...in between 380k and 420k

Node 0, 1, 2:
Foreign Pages: 0-10k
PLE: 2k-5k
Free pages: 3k-30k depending on which node

Node 3:
Foreign Pages: 75k
PLE: almost always 0
Free pages: almost always 0

Node 3's PLE was obviously affecting the Server-wide PLE metric. And its free list was just hammered...
Again, note that these metrics stayed true even after the application's business hours when nothing was running.

SQL had a global target of 14000 MB for the BP based on my config setting, but was never quite reaching it, even during periods where it obviously was encountering internal memory pressure (i.e. there were <200 plans in cache). It would get to about 13500 MB and then just stop.

After I lowered "Max Server Memory" to 13000 MB, the metrics immediately (but gradually) started to improve. By the end of the day, I had a global PLE of 5k, ~2k plans in the procedure cache, and several thousand Free Pages on Node 3.

So it seems like Node 3 was getting "stuck" as it tried to allocate more memory. Perhaps it was dropping pages continually, thus keeping PLE down? But somehow never actually getting the free list populated and thus Free Pages was stuck near zero?

Anyone seen this before?

(please note that there seemed to be no real IO actually occurring...i.e. Pages/Sec and Avg. Disk Queue Length were both near zero. Also, the OS had plenty of Available MB of memory, over 1 GB)



You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum