I have a physical server running SQL 2016 (13.0.5216.0) Enterprise Edition with 2 sockets and 28 cores. Memory is 271gb with 235 allocated to SQL split into two NUMA nodes. Data is stored on SSD drives. There server is involved with several availability groups as both a primary and secondary replica. The server is also a replication publisher.
The server usage is almost entirely OLTP with 3-5 million new records being created daily in about a dozen different databases. The total size of the databases is measured in terabytes but that is data for more than ten years and the vast bulk of active data being queried is for the most recent couple of months.
For several month, I have noticing that PLE drops every dramatically every 50-80 mins - often to zero. There appears to be no performance impact that I can detect; the data and plan cache don't change any more during the drop than they do at any other time, there are no major queries that can be correlated to the PLE plunges either and IO is also relatively static with no spikes that can be aligned with the changes in PLE.
I have been pretty much ignoring this behavior and regarding as some kind of anomaly regarding the PLE count. The PLE is being recorded every 5 minutes and has never exceeded 7500.
The other day, I dug a little deeper into PLE by NUMA mode and found one node appears to establish a ceiling of some kind and PLE for that node remains constant for an extended period - sometimes for close to an hour. The value for this 'ceiling' changes each time it occurs and has varied from 1000-6000 seconds. When the PLE for the other node gets to a few minutes of this ceiling, it plunges dramatically and starts climbing back up and repeating the cycle.
The screenshot below show this pattern over a 4 hour period. The black line is the overall page life expectancy and the other two like are the PLE for each node.
Now it may just be that the data cache is far more volatile than I think it but, other than PLE, there is no evidence of significant memory pressure that the average data page can't stay in cache longer than 90 minutes.
Has anyone got a good explanation for this behavior?