Page life expectancy diving to sub-10 on a 128 GB server

  • Last weekend we migrated our primary ERP system from a Windows 2003, SQL 2008R2 server to a windows 2008R2, SQL 2012 server.

    The new servers are much beefier than the old ones, including an extra 32 gig of ram.

    The ERP database is ~500 gig, of which a good amount is audit data. On the old server I generally had page life expectancy values around 1000 or more for normal daily activity.

    On the new server I have been capturing perfmon stats for the past week and I see the PLE jumping all over the place. It climbs up into the several hundred but then suiddenly dives to ridiculously low values like 12 or even 7.

    The server usage profile has not changed, it's still the same users and applications doing the same things they were doing last week.

    The instance is set to have min 70 gig, max 110 gig allocated to SQL.

    The service account has lock pages in memory permission.

    Can anyone think of some setting I might have missed during migraiton, or some new option to be set in SQL2012 that I've overlooked, that could explain this?

  • Well, if we could see what was running on the server when PLE drops we might be able to help you out, unfortunately we can't see it.

    One guess would be to ask if you are running CHECKDB on the database at the time PLE drops. Would also help to know if there are any other maintenance routines that might be running then as well.

  • Yeh, right now I'm running a server side trace and a few more measures to see if I can match up activity to counters. As I said, though, there's no new usage pattern on the server, no new apps deployed, just all the same jobs, users and apps that were there before. A sudden usage pattern change doesn't seem as likely as me possibly missing some new setting or other.

  • No server options that could cause this. Probably CheckDB or index rebuilds that move a huge amount of data through the cache.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • A few guesses (based on very limited information provided and a few answered questions):

    1) You have much faster IO now and it is able to "swap out data" in RAM (i.e. read it off disk) much more quickly. This can lead to lower PLE although query performance would still likely be better.

    2) Another possibility is that someone messed with the schema and dropped one or more important indexes from large tables, leading to scans instead of seeks.

    3) Oh, what about MAXDOP server setting?? If you had it set to 1 (not uncommon in the ERP world) and now it is back to default of 0, you would be parallelizing potentially many more queries, which would be consuming data more quickly.

    4) Yet another is that you simply did monitoring while some very large reports were being run for some muckety-muck VP. 🙂

    5) It used to be that upgrading editions meant having to do a full-scan statistics update on all indexes/stats. But I don't think that is required when going from SQL 2008R2 to 2012.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • It could be that the old server had a different number of system buses (NUMA nodes) and this is affecting PLE. I assume you're looking at the Buffer Manager: PLE counter? Given the specs you've said above, you must have multiple NUMA nodes and this counter is the average across all of these. Buffer Node: PLE gives the stats for individual pools.

    Threads can access foreign memory to their local NUMA node, but there's a preference for locality, which I'm assuming would end up recycling individual (smaller) pools faster than one big one, therefore have a lower average PLE. In some situations, that could be worse for overall system performance, but generally not.

    What are you seeing in terms of overall system performance, e.g. wait stats, average transaction times etc.?

  • Thanks for the responses. I've been looking at this for a few days on and off. Today I've been sitting here watching the counters because I am seeing such weird, weird output.

    Here are some numbers from right this instant:

    Resource Monitor\Memory (Instance allocation is min 70,000 max 100,000)

    In use: 105200

    Standby: 25000

    Free: 70

    Standby and free look weird, I would expect 15 gig less in standby and 15 gig more in free.

    Perfmon:

    Buffer Manager:PLE = 12

    % Disk Time (data) = 92,000 (!!!!)

    Avg Queue Length (data) = 920

    Disk Reads / sec (data) = 6000 (!!!!)

    Disk Writes / sec (data) = 55

    % Disk Time (log) = --

    Avg Queue Length (log) = 0.02

    Disk Reads / sec (log) = 0

    Disk Write / sec (log) = 2

    % Processor Time = 14.5

    Instantaneous output from sp_whoisactive at the same time shows no signficant reads from query activity:

    reads writes physical_reads

    ------ ------ --------------

    54 0 64

    29 0 96

    6 0 0

    75,567 0 28,832

    4 0 0

    4 0 0

    4 0 0

    4 0 0

    4 0 0

    5 0 0

    40 0 14

    18 0 0

    5 0 0

    5 0 0

    4 0 0

    NULL NULL NULL

    13 0 16

    5 0 0

    5 0 0

    NULL NULL NULL

    NULL NULL NULL

    56,959 5 56,780

    NULL NULL NULL

    NULL NULL NULL

    NULL NULL NULL

    69,398 36 3,417

    1,452 0 1,722

    NULL NULL NULL

    NULL NULL NULL

    NULL NULL NULL

    120 3 16

    NULL NULL NULL

    NULL NULL NULL

    4 0 0

    1,756 8 304

    NULL NULL NULL

    7 0 7

    NULL NULL NULL

    NULL NULL NULL

    NULL NULL NULL

    207 2 23

    NULL NULL NULL

    90 0 0

    NULL NULL NULL

    8 0 0

    NULL NULL NULL

    NULL NULL NULL

    NULL NULL NULL

    27 0 0

    98 0 0

    NULL NULL NULL

    8 0 0

    NULL NULL NULL

    NULL NULL NULL

    101 3 92

    224 6 16

    NULL NULL NULL

    NULL NULL NULL

    NULL NULL NULL

    NULL NULL NULL

    NULL NULL NULL

    NULL NULL NULL

    NULL NULL NULL

    NULL NULL NULL

    sys.dm_os_wait_stats

    CXPACKET is at the top (15%), followed by...

    SP_SERVER_DIAGNOSTICS_SLEEP

    XE_TIMER_EVENT

    HADR_FILESTREAM_IOMGR_IOCOMPLETION

    DIRTY_PAGE_POLL

    BROKER_EVENTHANDLER

    XE_DISPATCHER_WAIT

    LOGMGR_QUEUE

    PAGEIOLATCH_SH is down in 12th place (4.3%)

    NUMA

    Unfortunately I know virtually nothing about NUMA yet. I guess this is me learning about it, heh.

    I see 5 nodes in sys.dm_os_nodes, 2 "ONLINE", 1 "ONLINE DAC", 2 "OFFLINE".

    In sys.dm_os_memory_nodes I see:

    memory_node_id virtual_address_space_reserved_kb virtual_address_space_committed_kb locked_page_allocations_kb pages_kb shared_memory_reserved_kb shared_memory_committed_kb cpu_affinity_mask online_scheduler_mask processor_group foreign_committed_kb

    -------------- --------------------------------- ---------------------------------- -------------------------- -------------------- ------------------------- -------------------------- -------------------- --------------------- --------------- --------------------

    0 203592464 2183796 48977664 10796224 1920 1920 1048575 1048575 0 21068516

    1 35712 35732 2784 2784 0 0 1048575 0 1 0

    2 35712 35732 2784 2784 0 0 1099510579200 0 1 0

    3 32768 32788 51128584 11795384 0 0 1099510579200 1099510579200 0 23373076

    64 0 20 0 10797256 0 0 1048575 1048575 0 0

    It looks really weird to me. Query activity is minimal but disk activity is thorugh the roof and PLE is collapsing at semi-regular intervals. Any interpretations would be most welcome.

  • Just curious about a couple of BIOS settings.

    1. What is the hardware reserve memory set at for the server? This could be impacting your "standby" memory count.

    2. Are the energy saving features enabled for this server? I know you are looking at memory and disk. But I am curious if it is being caused by a CPU that is being taxed due to the energy saving features being enabled.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Hardware reserved set at 64

    CPU use is way down at 5 - 15%.

  • Hardware reserved, is that in MB or GB that you are showing?

    NUMA - did you make any changes to CPU affinity or is the default setting still active (SQL Server managing)?

    HowardW asked a question concerning Buffer Node v Buffer Manager in regards to which PLE you are monitoring. Which one is reporting the erratic stats?

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • SQLRNNR (3/7/2013)


    Hardware reserved, is that in MB or GB that you are showing?

    NUMA - did you make any changes to CPU affinity or is the default setting still active (SQL Server managing)?

    HowardW asked a question concerning Buffer Node v Buffer Manager in regards to which PLE you are monitoring. Which one is reporting the erratic stats?

    Sorry, hardware reserved is 64 MB.

    NUMA is default.

    Pefmon node breakdowns hows PLE on node "000" as 38, node "003" as 43 right now (ie, seems both are equally low)

    Gail:

    Index rebuilds are set for Sunday nights only (and don't appear in sp_whoisactive when I see PLE plummit). CheckDB is not run against this system, I run it on the DR site.

  • What are the maximum values you have seen for PLE on each node?

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • I haven't been tracking PLE by node before now, but the max I've seen for the buffer manager counter is plenty high: 50,000.

    If I graph the this counter over the last week I see a dozen-and-a-half or so precipitous drops from very high values (20,000+) down to virtually zero. There's no obvious pattern to the timing, even if I round to the nearest hour. Sometimes it's at around 4 am (start of ETL's), but it also can occur at seemingly any time day or night.

  • Is SQL Server the only thing on this box or do you have other apps and things like AntiVirus installed on this box too?

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • All SQL baby!

    The next highest memory consumer (right now, per resource monitor) is the DTS service with a commit charge of 353 meg (this is a 2 node clustered instance on a 4 node windows cluster).

    Integration services 2008 is (unfortunately) installed in order to run half a dozen legacy packages that weren't migrated prior to the big server move, but the packages are tiny and execute in seconds/minutes, and only run once a day at around 2am.

    Cheers for the continued attention BTW. I hope you're as confused as I am 😛

Viewing 15 posts - 1 through 15 (of 55 total)

You must be logged in to reply to this topic. Login to reply