Page Life Expectancy - dmv counters give occasional massive spikes

Question

Page Life Expectancy - dmv counters give occasional massive spikes

Philip Yale-193937

SSCarpal Tunnel

Points: 4603
More actions
June 14, 2012 at 5:04 am

#258070

THIS SHOULD REALLY BY IN SQL2008 FORUM - DON'T KNOW HOW I POSTED IT HERE! Darn.
I'm collecting performance stats on one of our servers using sys.dm_os_performance_counters. The output is then being passed into the codeplex PAL tool for display.
Most counters work fine, but I'm getting weird behaviour with those for "SQLServer:Buffer Node\Page Life Expectancy"
We have 8 NUMA buffer nodes, and most of the time the PLE for each node is about 85,000 seconds (this is realistic; we have a LOT of memory!). However, occasionally I get a massive spike in the performance counters for one or more nodes (never all nodes simultaneously) that gives a value of several million before immediately dropping back to normal again.
This is really irritating as it drastically skews the scale on the PAL chart, and most of the meaningful chart data for all nodes is now compressed into the very bottom of the chart, making it unreadable. Has anyone else experienced erratic counter values like this, especially related to Buffer Nodes and PLE?

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 398690 More actions · Answer 1

No, that's almost the opposite of what I see in most cases. Usually PLE is a constantly increasing value which suddenly drops to near zero once a day or once a week, prompted by a nightly/weekly data load. Problematic PLE is either flat & low or a very fast series of rises & falls. What you're seeing is the value shooting up, not down. That's odd.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Philip Yale-193937 SSCarpal Tunnel Points: 4603 More actions · Answer 2

I agree Grant, and it's not something I ever noticed when we used to monitor with Perfmon, so I suspect it's either a) a quirk of the dm_os_performance_counters themselves, or b) a quirk of my calculations used to derive the value. Since PLE has a cntr_type = 65792 it's quite hard to get b) wrong, so I'm definitely puzzled.

The spurious values only occur 4 or 5 times a day (sampling once every 15 seconds), so they aren't common.

An example of consecutive counters around one of these spikes looks like:

(NODE 5)

42377

4294963

20449

Then later on:

(NODE 6)

80052

80066

4294966

80085

What's interesting is that the peak values, when they occur, are almost (but not quite) the same value, regardless of the node they occur on:

NODE 0: 4,294,958

NODE 1: 4,294,966

NODE 2: 4,294,963

NODE 5: 4,294,963

NODE 6: 4,294,966

Presumably this has something to do with the total memory for each node, but I can't quite grasp why this should give a near-constant peak value for PLE?

Grant Fritchey SSC Guru Points: 398690 More actions · Answer 3

That is a weird one. I think I'd open up a Connect item. Since the raw data is coming from Microsoft metrics using Microsoft tools, surely they need to provide some answers.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

sfallen SSC Veteran Points: 293 More actions · Answer 4

We have seen the same behavior, and almost the exact same value for PLE, I'm looking at one right now reading 4,294,405. This is on a Hyper-V VM running SQL Server 2008 R2 on Windows 2008 R2.

We have been able to consistently reproduce the scenario by throwing a heavy workload at the server. Clearly this is an erroneous value, as the workload is designed to put the server under memory pressure, and we see other indicators that it is having the intended effect.

Philip Yale-193937 SSCarpal Tunnel Points: 4603 More actions · Answer 5

Ah - interesting. I'm wondering if this is odd behaviour because of the very high spec of the hardware we use and the workload it's under. I'd rather not give too much away about the specifications, but the server has 1Tb RAM (SQL has 984Gb of that). That's still not enough for our database (Target Pages = Total pages) so we don't have more RAM than we need and cache pages are still being turned over. Add to that CPU cores in 3 figures and transaction rate (TPS) well into 5 figures and it's safe to say this is a busy box. We have seen problems in the past with Perfmon not being able to handle very high-end configurations, which is the main reason for exploring the DMV counters as an alternative.

Can I ask how much memory your server has, Sfallen? Like you, we're on 2008R2 throughout.

I'm monitoring some other servers now, also very busy but somewhat less so, and with slightly lower specs. It'll be interesting to see if they show the same behaviour.

I've also got access to a loadtest replica of the big server. I'm about to start monitoring this with no load at all, and then later on with increasing levels of load to see how this affects the counter behaviour.

Philip Yale-193937 SSCarpal Tunnel Points: 4603 More actions · Answer 6

Just noticed - I've posted this in TOTALLY the wrong section! Should be under SQL2008. Damn. Anyone got any idea how I move it?

sfallen SSC Veteran Points: 293 More actions · Answer 7

The servers we have encountered this on are much smaller, as the VMs are used for testing and demonstration. I've reproduced this as low as 1GB. It doesn't appear to be a NUMA issue either, as we see this on single-processor quad-core boxes.

This is also not limited to just the DMV counters. I see the same value in Perfmon.

Philip Yale-193937 SSCarpal Tunnel Points: 4603 More actions · Answer 8

Just to confirm Sfallen - is it just the Buffer Node PLE counter you're seeing this on (even on single-node VMs), or do you see similar behaviour on any other counters? We don't - it's just PLE that's not playing ball.

Interesting what you say about Perfmon ... I'll set up a Perfmon trace too and see if I get the same. (Buffer Node PLE was never something we previously monitored under Perfmon, just Buffer Manager PLE)

sfallen SSC Veteran Points: 293 More actions · Answer 9

Correct, we are only seeing unexpected values on Buffer Node PLE counters. We also see it on the Buffer Manager PLE counter, but as that's just an average of all the nodes, on a single node that is to be expected.

Philip Yale-193937 SSCarpal Tunnel Points: 4603 More actions · Answer 10

I've now found two production servers that are of identical build and specification (built at the same time using the same standard build images, identical hardware and storage specs) where one of them shows the occasional PLE spike, but the other one does not. Over a 3 day continuous monitoring session, one node performed impeccably and gave perfect PLE graphs, the other one spiked several times a day and so gave the very compressed graphs that I've been complaining about.

Currently discussing with our Architecture team the possibilities (if any) about some very subtle difference in the Windows build that might explain this behaviour, but so far we can't find anything.

No news yet on the Perfmon comparison; I did set up a data collector set, but no matter what I try it consistently produces empty output, so I'm still investigating that.

srlanka SSC Rookie Points: 49 More actions · Answer 11

Did you find the root cause of this issue? We are seeing the same exact behavior on few servers ( Win 2008 R2, QL 2008 R2). PLE spikes to millions of seconds in a very short time . It coincides with the Monthly Windows patches and restarts. I expect the PLE to drop when the server restarted , but it spikes to millions of seconds.

I have opened a MS connect incident and see if it gets any traction?

MS Connect Item:

1974lg Old Hand Points: 374 More actions · Answer 12

srlanka - Thursday, June 9, 2016 11:03 AM
Did you find the root cause of this issue? We are seeing the same exact behavior on few servers ( Win 2008 R2, QL 2008 R2). PLE spikes to millions of seconds in a very short time . It coincides with the Monthly Windows patches and restarts. I expect the PLE to drop when the server restarted , but it spikes to millions of seconds.I have opened a MS connect incident and see if it gets any traction?MS Connect Item:

Hi srlanka, did you receive any update regarding the PLE spikes issue? I have the same behavior.
thanks!