Page Life Expectancy - dmv counters give occasional massive spikes

  • THIS SHOULD REALLY BY IN SQL2008 FORUM - DON'T KNOW HOW I POSTED IT HERE! Darn.

    I'm collecting performance stats on one of our servers using sys.dm_os_performance_counters. The output is then being passed into the codeplex PAL tool for display.

    Most counters work fine, but I'm getting weird behaviour with those for "SQLServer:Buffer Node\Page Life Expectancy"

    We have 8 NUMA buffer nodes, and most of the time the PLE for each node is about 85,000 seconds (this is realistic; we have a LOT of memory!). However, occasionally I get a massive spike in the performance counters for one or more nodes (never all nodes simultaneously) that gives a value of several million before immediately dropping back to normal again.

    This is really irritating as it drastically skews the scale on the PAL chart, and most of the meaningful chart data for all nodes is now compressed into the very bottom of the chart, making it unreadable. Has anyone else experienced erratic counter values like this, especially related to Buffer Nodes and PLE?

  • No, that's almost the opposite of what I see in most cases. Usually PLE is a constantly increasing value which suddenly drops to near zero once a day or once a week, prompted by a nightly/weekly data load. Problematic PLE is either flat & low or a very fast series of rises & falls. What you're seeing is the value shooting up, not down. That's odd.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I agree Grant, and it's not something I ever noticed when we used to monitor with Perfmon, so I suspect it's either a) a quirk of the dm_os_performance_counters themselves, or b) a quirk of my calculations used to derive the value. Since PLE has a cntr_type = 65792 it's quite hard to get b) wrong, so I'm definitely puzzled.

    The spurious values only occur 4 or 5 times a day (sampling once every 15 seconds), so they aren't common.

    An example of consecutive counters around one of these spikes looks like:

    (NODE 5)

    42377

    42377

    4294963

    20449

    20449

    Then later on:

    (NODE 6)

    80052

    80066

    4294966

    80085

    80085

    What's interesting is that the peak values, when they occur, are almost (but not quite) the same value, regardless of the node they occur on:

    NODE 0: 4,294,958

    NODE 1: 4,294,966

    NODE 2: 4,294,963

    NODE 5: 4,294,963

    NODE 6: 4,294,966

    Presumably this has something to do with the total memory for each node, but I can't quite grasp why this should give a near-constant peak value for PLE?

  • That is a weird one. I think I'd open up a Connect item. Since the raw data is coming from Microsoft metrics using Microsoft tools, surely they need to provide some answers.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • We have seen the same behavior, and almost the exact same value for PLE, I'm looking at one right now reading 4,294,405. This is on a Hyper-V VM running SQL Server 2008 R2 on Windows 2008 R2.

    We have been able to consistently reproduce the scenario by throwing a heavy workload at the server. Clearly this is an erroneous value, as the workload is designed to put the server under memory pressure, and we see other indicators that it is having the intended effect.

  • Ah - interesting. I'm wondering if this is odd behaviour because of the very high spec of the hardware we use and the workload it's under. I'd rather not give too much away about the specifications, but the server has 1Tb RAM (SQL has 984Gb of that). That's still not enough for our database (Target Pages = Total pages) so we don't have more RAM than we need and cache pages are still being turned over. Add to that CPU cores in 3 figures and transaction rate (TPS) well into 5 figures and it's safe to say this is a busy box. We have seen problems in the past with Perfmon not being able to handle very high-end configurations, which is the main reason for exploring the DMV counters as an alternative.

    Can I ask how much memory your server has, Sfallen? Like you, we're on 2008R2 throughout.

    I'm monitoring some other servers now, also very busy but somewhat less so, and with slightly lower specs. It'll be interesting to see if they show the same behaviour.

    I've also got access to a loadtest replica of the big server. I'm about to start monitoring this with no load at all, and then later on with increasing levels of load to see how this affects the counter behaviour.

  • Just noticed - I've posted this in TOTALLY the wrong section! Should be under SQL2008. Damn. Anyone got any idea how I move it?

  • The servers we have encountered this on are much smaller, as the VMs are used for testing and demonstration. I've reproduced this as low as 1GB. It doesn't appear to be a NUMA issue either, as we see this on single-processor quad-core boxes.

    This is also not limited to just the DMV counters. I see the same value in Perfmon.

  • Just to confirm Sfallen - is it just the Buffer Node PLE counter you're seeing this on (even on single-node VMs), or do you see similar behaviour on any other counters? We don't - it's just PLE that's not playing ball.

    Interesting what you say about Perfmon ... I'll set up a Perfmon trace too and see if I get the same. (Buffer Node PLE was never something we previously monitored under Perfmon, just Buffer Manager PLE)

  • Correct, we are only seeing unexpected values on Buffer Node PLE counters. We also see it on the Buffer Manager PLE counter, but as that's just an average of all the nodes, on a single node that is to be expected.

  • I've now found two production servers that are of identical build and specification (built at the same time using the same standard build images, identical hardware and storage specs) where one of them shows the occasional PLE spike, but the other one does not. Over a 3 day continuous monitoring session, one node performed impeccably and gave perfect PLE graphs, the other one spiked several times a day and so gave the very compressed graphs that I've been complaining about.

    Currently discussing with our Architecture team the possibilities (if any) about some very subtle difference in the Windows build that might explain this behaviour, but so far we can't find anything.

    No news yet on the Perfmon comparison; I did set up a data collector set, but no matter what I try it consistently produces empty output, so I'm still investigating that.

  • Did you find the root cause of this issue? We are seeing the same exact behavior on few servers ( Win 2008 R2, QL 2008 R2). PLE spikes to millions of seconds in a very short time . It coincides with the Monthly Windows patches and restarts. I expect the PLE to drop when the server restarted , but it spikes to millions of seconds.

    I have opened a MS connect incident and see if it gets any traction?

    MS Connect Item:

  • srlanka - Thursday, June 9, 2016 11:03 AM

    Did you find the root cause of this issue? We are seeing the same exact behavior on few servers ( Win 2008 R2, QL 2008 R2). PLE spikes to millions of seconds in a very short time . It coincides with the Monthly Windows patches and restarts. I expect the PLE to drop when the server restarted , but it spikes to millions of seconds.I have opened a MS connect incident and see if it gets any traction?MS Connect Item:

    Hi srlanka, did you receive any update regarding the PLE spikes issue? I have the same behavior.
    thanks!

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply