Page life expectancy diving to sub-10 on a 128 GB server

  • That's an awesome discussion of some very deep scalability issues!! I look forward to the blog post!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thanks for the detailed information on those trace flags. I also eagerly await the blog post.

    In an ideal world I would definitely be running replays on our lab environment to observe the effects of these different options. I must say, I am.... suspicious.... of NUMA.

    In the real world... well, we'll see if I can get it prioritized. I've already been drawn off in other directions because the server is running "relatively OK" on RTM.

    Here's a picture of PLE by node, and free memory (orange line) on RTM. Ignore the giant free memory spike, that was me allocating 10 more Gig to SQL.

    In my opinion this is still a pretty wild ride, but it's definitely better than it looked under SP1.

  • That's just FUBAR right there!! :w00t:

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Er, just noticed I had cut off the X axis markers. That graph represents about 3 days.

  • By way of comparison, here's what it was on SP1 without the private fix. Also three days.

    As a reminder, this is a 40 physical core, 4 NUMA node, 128 gig server, 110 allocated to the SQL instance. Which suggests to me that if SQL has consumed all the memory allocated to it and then PLE on one of the nodes crashes to zero (which, as the graphs show, happens far less on RTM than SP1), it represents a truly massive amount of cache flush. Almost 30 gig. Microsoft is suggesting that the hotfix works (except for the dll version issues) and that the remaining craziness is probably related to query activity.

    Them's some mighty big queries if you ask me.

    To try and get some quick wins, our next steps are to enable compression (obvious choice with that much CPU power) and set up an always-on replica to act as the source for operational SSRS reports. These are things we always planned to do, but when we saw things go bad after migrating to 2012 attention was, for obvious reasons, shifted.

  • Just wondering if either of the two posters experiencing this issue got any further with MS support on it?

    I'm having what appears to be a similar issue - PLE crashes randomly during business hours when memory available for SQL Server to use is abundant. I've looked for corresponding workloads to match the times this occurs, but there's generally nothing of note. I've checked and it's not a VMWare ballooning issue, and none of the other metrics from the VMWare console look suspicious. Stats from SQL Sentry show plenty of free memory not allocated to SQL Server available (generally 4-6GB), and regularly show 30-35GB of the memory allocated to SQL Server available as well. The buffer will build up to about 35GB a couple of times a day, but then regularly crashes down as low as 1GB.

    We're on SQL 2012 SP1/CU2.

    Just trying to get some more information before I determine whether to contact MS, or bring in a more experienced DBA resource. Thanks for any updates you can provide!

  • Wasn't SQL 2012 SP1 CU4 supposed to be the build that had the fix?

    In my experience you usually have to be able to tell Microsoft that all BIOS/FIRMWARE/DRIVER/VMWARE/WIN/SQL/ETC patches have been applied. That done? firmware/drivers have been most common cause of memory flushes in my past...

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Kevin Dahl - Some relevant details will be whether NUMA pass through is enabled in VMWare. This is enabled by default with more than 8 vCPUs, but for 8 vCPUs and below it is disabled unless configured manually.

    Here's why that is relevant: when SQL Server is aware of the NUMA configuration, it attempts to allocate a buffer pool of equal size on each NUMA node, the sum of which is the total buffer pool target size. However, especially in a VM, each NUMA node may not have the same number of physical cores, and may not have the same amount of physical RAM. Sizable amounts of "foreign/away pages" can result. In some cases, especially during buffer pool ramp up, a large amount of memory can end up orphaned as memory allocation requests are first granted with foreign memory and almost immediately afterward with local memory. In these cases, the local memory is used (to optimize memory latency), but the foreign memory is not released... in order to prevent repetitive cycles of allocation and release while the buffer pool is growing. While that may be a reasonable strategy, there are scenarios where the result is a large amount of orphaned, unused memory.

    Bob Dorr describes this a bit in this post.

    http://blogs.msdn.com/b/psssql/archive/2012/12/13/how-it-works-sql-server-numa-local-foreign-and-away-memory-blocks.aspx

    SQL Server 2012 SP1 CU3 contains this fix which addresses some of the scenarios which can lead to orphaned memory and memory stalls.

    http://support.microsoft.com/kb/2819662

    However, in some cases, using trace flag 8048 (to remove spinlock contention among serialized threads during memory allocation) and trace flag 8015 (to ignore NUMA configuration and manage SQL Server CPU resources and memory resources in single large pools instead of NUMA node-aligned pools) produce benefits far beyond what the SQL Server 2012 SP1 CU3 fix alone provide.

    All that said, that type of issue has been around since SQL Server 2005. It seems that the other posters in this thread have found a few similar memory management behaviors that may have been newly introduced in SQL Server 2012.

  • In our case, this was remediated in CU4 for SQL Server 2012 SP1.

    An update is available for SQL Server 2012 Memory Management

    http://support.microsoft.com/kb/2845380

    CU4 for SQL Server 2012 SP1

    http://support.microsoft.com/kb/2833645/en-us

    http://www.sqlservercentral.com/Forums/Topic1415833-2799-3.aspx#bm1458905

  • Yep, the link Tommy has provided is the fix for the issue he and I were having.

    We have not yet updated to CU4, as we're still seeing issues with performance on RTM. We did an ERP upgrade recently that took 20 hours to run on the production box, but only 8 on a significantly less powerful UAT box. The workload is identical in each case, basically a bunch of .sql files being run against a database, configurations are identical except that the UAT box is not NUMA, is not clustered, and has far less CPU and RAM. Both instances are 2012 RTM, so this issue is not related to the fix provided in CU4.

  • Thanks for all the updates. Some very useful information!

Viewing 11 posts - 46 through 55 (of 55 total)

You must be logged in to reply to this topic. Login to reply