Flatlining PLE

  • Background. On our new VM infrastructure, I'm seeing a pattern of intermittent episodes of PLE dropping from a reasonably healthy level (several thousand) and just tanking - badly, single figures. This is not a gradual thing, it’s an immediate plummet, and the value slowly rises after that point until it happens again.

    Config. Mostly consolidated 2014/12 VM's, mostly 2 core 32 Gb with 2 node cluster and AG group to remote site on VMware. Also some non AG boxes for lower versions of SQL Server, or where - mostly for security reasons an app needs it's own box. At present, vCPUs < Host CPU's (80 Xeon E5-2640 v3 @ 2.60GHz) and provisioned RAM is about 1/3 of available (1.25 Tb).

    Memory reservation = memory provision

    Ballooning turned off at host level

    VMEMMGMT Start was set to 2 (running) now changed to stop (4) and VM restarted.

    LPIM has been granted to the SQL Server engine service account

    We collect PLE every 30 seconds and CPU use from the DMV's. sp_WhoIsActive is run and the output saved to a table when CPU is high (90%+) or PLE is low (1.2 * 300 * (RAM/4)) with 30s wait before re-triggering. MDW is also running.

    When we look at CPU at the time of the crash it’s generally low and the WhoIsActive output is mostly nothing, occasionally a replication or other trivial task. MDW for the 15 minutes slot around the crash doesn’t show anything significant going on – frequently the highest resource use is the query collecting the CPU stats or the sending of the PLE warning email. So from that perspective, the pattern did look like ballooning was taking place. However, we’re still getting the same pattern (although seemingly less so, so far) and I can’t see where ballooning could be more turned off than it is. Also no nothing in ESX Manager. Can see nothing in the SQL Server or windows logs of interest / pointing to a reason – no “a substantial portion of memory has been … etc.”

    I’m obviously not expecting anyone to automagically fix this via telepathy or something, apart from the difficulty I think that’s more than one can expect from free advice even from a group as generous as you lot. But if anyone has any tips or suggestions where I can/should look, they’d be appreciated. At the moment I’m out of ideas other than a POSH script to log processes running to see if that throws anything up,

    Cheers

    Andrew

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

  • How do total and target server memory behave at the time of the dip in PLE?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I don't know Gail - I'll add that to the PLE collection and get back.

    thanks

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

  • Hi Gail,

    both the same - here's an example, immediately before and after PLE being squashed

    RunTimeCounter_Name Cntr_Value

    2016-08-30 07:25:00Page life expectancy 5063

    2016-08-30 07:25:00Target Server Memory (KB) 26214400

    2016-08-30 07:25:00Total Server Memory (KB) 26214400

    2016-08-30 07:24:30Page life expectancy 37426

    2016-08-30 07:24:30Target Server Memory (KB) 26214400

    2016-08-30 07:24:30Total Server Memory (KB) 26214400

    when we do, from time to time, get a variance, as far as I can see they are - surprisingly given my understanding - Total Memory > Target Memory

    RunTime Counter_NameCntr_Value

    2016-08-30 07:57:30Target Server Memory (KB) 26214400

    2016-08-30 07:57:30Total Server Memory (KB) 26214600

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

  • Ok, so that doesn't look like balloon drivers or anything else external. They'd cause target server memory to drop sharply (as Windows came under memory pressure), with total server memory following it closely.

    I would suggest you set up an extended event session and see exactly what's running against the DBs at the time that PLE drops.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi, yes, see exactly what you're saying. Most of the time (although not always by any means) the box is actually pretty quiet - CPU low and our WhoIsActive capture and MDW data show little activity in SQL Server either. That was why I though ballooning in the first place, but like you said if total server memory is stable, it isn't (as far as I understand the measure anyway).

    Cheers for the help, much appreciated

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply