NUMA and PLE on SQL Server 2012

Question

NUMA and PLE on SQL Server 2012

Tommy Bollhofer

SSChampion

Points: 14940
More actions
February 5, 2013 at 7:26 am

#268125

I've read both Paul and Jonathan's blogs regarding this issue (http://www.sqlskills.com/blogs/paul/page-life-expectancy-isnt-what-you-think/) and started looking at the PLE counters for each individual NUMA node. I can't seem to wrap my head around why there is such a widespread discrepancy between the NUMA nodes. We are running SQL Server 2012 Enterprise Core. Any insight would be greatly appreciated.
Thanks,
Tommy
https://www.sqlservercentral.com/Forums/Uploads/image-unavailable.png
Tommy
Follow @sqlscribe

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 398711 More actions · Answer 1

I'm pretty sure that it just comes down to the fact that while processes are distributed, the work isn't and one node can be done most/all of the work while others are effectively asleep. In general, but absolutely not always, I've seen the distribution between the nodes to be reasonably consistent over time, but weird spikes are fairly normal.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Tommy Bollhofer SSChampion Points: 14940 More actions · Answer 2

Tommy Bollhofer

SSChampion

Points: 14940

February 5, 2013 at 9:46 am

#1584488

Thank you Grant -

Tommy

Follow @sqlscribe

Tommy Bollhofer SSChampion Points: 14940 More actions · Answer 3

FYI, we opened an SR for this with MS about a few weeks ago and it is currently escalated to the DEV team. The consensus thus far is that there does appear to be a defect calculating the PLE value in 2012; I will update when more information from MS is available in the event anyone else is experiencing this.

Some additional info:

SQL Server 2012 Core Edition

4 Socket 8 Core Xeon w/ HT enabled, 128GB, 112GB max memory

We have been collecting the ring buffer outputs for RING_BUFFER_RESOURCE_MONITOR and they do not show any new entries around the time when PLE dropped. Moreover during this timeframe, Page Reads/sec value do not fluctuate much indicating that we did not have to read a bunch of new pages into memory. The query with max reads around this time frame was 32K

Tommy

Follow @sqlscribe

Don Halloran SSCertifiable Points: 5937 More actions · Answer 4

I've been watching PLE closely on our new 2012 box (see thread: PLE drops to sub 10 on 128gb server). Grabbing two nodes and graphing them over several days shows that the lines match almost perfectly a lot of the time, but after large drops (eg, from 30,000+ to low hundreds) they tend to deviate from each other, wiggle around a little bit, and then resynch as PLE regrows back into the several hundreds / thousands.

Tommy Bollhofer SSChampion Points: 14940 More actions · Answer 5

Thanks for the info, I will be sure to check out the thread. We are seeing the same exact behavior. Physical memory is always reported high during these large drops which seem to occur at random and oddly enough during periods of low activity with the largest read at 32K. I opened a connect item for this prior to the Sev-B case (which has been open for over a month now) however the product team closed it rather quickly opting to work offline with us on it instead...

Tommy

Follow @sqlscribe

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 6

One thing I don't see mentioned is if PERFORMANCE or actual IO or other metrics change when PLE falls off a cliff. If it doesn't, I would chalk it up as an internal bug that is "meaningless" (and which thus might get a lower priority by MS to fix). If it does, then the issue becomes finding out WHY stuff actually gets flushed out of the buffer pool and address that.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Tommy Bollhofer SSChampion Points: 14940 More actions · Answer 7

TheSQLGuru (3/14/2013)
One thing I don't see mentioned is if PERFORMANCE or actual IO or other metrics change when PLE falls off a cliff. If it doesn't, I would chalk it up as an internal bug that is "meaningless" (and which thus might get a lower priority by MS to fix). If it does, then the issue becomes finding out WHY stuff actually gets flushed out of the buffer pool and address that.

Hey Kevin, you are correct. For this particular issue, we are not experiencing any performance degradation whatsoever before or after the PLE drop and neither MS nor our team has been able to correlate it with any other metrics, IO or otherwise. So in a sense it is a "meaningless" internal bug that in all likelihood will be a much lower priority for MS to fix.

Tommy

Follow @sqlscribe

Don Halloran SSCertifiable Points: 5937 More actions · Answer 8

Ah, well that's interesting. I am seeing performance degradation with the PLE drops as well as the movement of other related counters (eg disk % jumping to 100,000%... yes... one hundred thousand!) when the collapses occur

Tommy Bollhofer SSChampion Points: 14940 More actions · Answer 9

allmhuran (3/14/2013)
Ah, well that's interesting. I am seeing performance degradation with the PLE drops as well as the movement of other related counters (eg disk % jumping to 100,000%... yes... one hundred thousand!) when the collapses occur

Interesting; perhaps the particular issue you are experiencing is unrelated to ours. Nevertheless, there are a host of collections we ran to rule out other issues, if you would like, please feel free to contact email me offline and I will be more than happy to provide the scripts, collections, etc.

I would start by dumping the ring buffer resource monitor (script below) before, during, and after the event.

SELECT CONVERT (varchar(30), GETDATE(), 121) as [RunTime],

dateadd (ms, (rbf.[timestamp] - tme.ms_ticks), GETDATE()) as [Notification_Time],

cast(record as xml).value('(//Record/ResourceMonitor/Notification)[1]', 'varchar(30)') AS [Notification_type],

cast(record as xml).value('(//Record/MemoryRecord/MemoryUtilization)[1]', 'bigint') AS [MemoryUtilization %],

cast(record as xml).value('(//Record/MemoryNode/@id)[1]', 'bigint') AS [Node Id],

cast(record as xml).value('(//Record/ResourceMonitor/IndicatorsProcess)[1]', 'int') AS [Process_Indicator],

cast(record as xml).value('(//Record/ResourceMonitor/IndicatorsSystem)[1]', 'int') AS [System_Indicator],

cast(record as xml).value('(//Record/MemoryNode/ReservedMemory)[1]', 'bigint')/1024 AS [SQL_ReservedMemory_MB],

cast(record as xml).value('(//Record/MemoryNode/CommittedMemory)[1]', 'bigint')/1024 AS [SQL_CommittedMemory_MB],

cast(record as xml).value('(//Record/MemoryNode/AWEMemory)[1]', 'bigint')/1024 AS [SQL_AWEMemory_MB],

cast(record as xml).value('(//Record/MemoryNode/PagesMemory)[1]', 'bigint') AS [PagesMemory_MB],

cast(record as xml).value('(//Record/MemoryRecord/TotalPhysicalMemory)[1]', 'bigint')/1024 AS [TotalPhysicalMemory_MB],

cast(record as xml).value('(//Record/MemoryRecord/AvailablePhysicalMemory)[1]', 'bigint')/1024 AS [AvailablePhysicalMemory_MB],

cast(record as xml).value('(//Record/MemoryRecord/TotalPageFile)[1]', 'bigint')/1024 AS [TotalPageFile_MB],

cast(record as xml).value('(//Record/MemoryRecord/AvailablePageFile)[1]', 'bigint')/1024 AS [AvailablePageFile_MB],

cast(record as xml).value('(//Record/MemoryRecord/TotalVirtualAddressSpace)[1]', 'bigint')/1024 AS [TotalVirtualAddressSpace_MB],

cast(record as xml).value('(//Record/MemoryRecord/AvailableVirtualAddressSpace)[1]', 'bigint')/1024 AS [AvailableVirtualAddressSpace_MB],

cast(record as xml).value('(//Record/@id)[1]', 'bigint') AS [Record Id],

cast(record as xml).value('(//Record/@type)[1]', 'varchar(30)') AS [Type],

cast(record as xml).value('(//Record/@time)[1]', 'bigint') AS [Record Time],

tme.ms_ticks as [Current Time],

cast(record as xml) as [Record]

from sys.dm_os_ring_buffers rbf

cross join sys.dm_os_sys_info tme

where rbf.ring_buffer_type = 'RING_BUFFER_RESOURCE_MONITOR'

order by rbf.timestamp ASC

Tommy

Follow @sqlscribe

Don Halloran SSCertifiable Points: 5937 More actions · Answer 10

Thanks Tommy. I've already got collectors running and premier support provided me with some additional tools to help diagnose the issue... I sent about 750 meg worth of log data to them yesterday. I hope that's enough 😛

Tommy Bollhofer SSChampion Points: 14940 More actions · Answer 11

FYI, CU3 for SQL Server 2012 SP1 has been released which includes the following hot fix.

SQL Server 2012 experiences performance issues in NUMA environments

http://support.microsoft.com/kb/2819662

CU3 for SQL Server 2012 SP1 is available via:

http://support.microsoft.com/kb/2812412

Tommy

Follow @sqlscribe

sql_handle Say Hey Kid Points: 693 More actions · Answer 12

Here's the thing: the SQL Server 2012 SP1 CU3 fix is probably a good thing - I'm waiting on more details on it. But, the behavior it addresses goes all the way back to SQL Server 2005. No fixes yet for versions SQL Server 2005, 2008, 2008 R2. So for others experiencing similar issues, trace flag 8015 which the kb article http://support.microsoft.com/kb/2819662 may be a solution. But, please consider using startup trace flag 8015 together with startup trace flag 8048 to avoid spinlock contention (and the possibility of a spinlock convoy).

If the MSSQL 2012 SP1 CU3 fix is targeted to the handling of NUMA node foreign and away buffers, there still could be a considerable benefit to TF 8015 + 8048 on that version or later. That's because some workloads just work better with a single large bpool and a single large scheduler group.

Worked for quite a while to reproduce in a test lab the large numbers of persistent foreign pages I was seeing in the field on two, four, and eight NUMA node servers. I wasn't able to*. But, on a four NUMA node server with no foreign pages after achieving max server memory the test workload (thousands of batch queries submitted at concurrency of 120 queries) required ~25% more disk IO and ~10% more elapsed time to complete than the comparative test runs with trace flag 8048 and trace flag 8015 in place.

So, dealing with foreign/away buffers is an important piece of the puzzle, but its far from the only piece of the puzzle. For many systems SQL Server 2008 R2 and before, TF 8048 + 8015 will be the only available resolution until a future CU/SP. But, even then, depending on the scope of the fix therein, TF 8048 + 8015 may offer more performance benefits.

The tradeoff is that systems with TF 8015 in place will have only one lazywriter and one IO completion port, instead of one lazy writer and one IO completion port per NUMA node. But... that's how most SQL Server on VMWare or Hyper-V will also operate. And many systems will not be adversely effected by a single lazy writer or IO completion port (keep watch on checkpoint performance to verify).

Sooner or later I'll post at sql-sasquatch.blogspot.com more details of the test system and perfmon/spin/wait stats from the test workloads. Also diagrams of the mechanics as I understand them, and some deidentified stats from 4 and 8 NUMA node servers in the field. It takes me forever to prepare that stuff, sorry.

*I now believe that the high persistent foreign pages was due to interaction of complex ETL and/or SSIS packages with "right-sized" systems (relative to SQL Server max server memory and other expected SQL Server memory needs). Without much free memory on these systems, SQL Server couldn't continue to perform memory sorting to eliminate the foreign pages. On the other hand, the test lab systems always had super-sized memory, to allow for multiple iterations of tests at various memory configurations. With more free memory available at various max server memory levels, SQL Server was able to sort memory as it wanted and keep the number of foreign memory pages to a minimum.

http://msdn.microsoft.com/en-us/library/ms345403%28v=sql.105%29.aspx

Don Halloran SSCertifiable Points: 5937 More actions · Answer 13

We have been looking at a similar problem, you can find my thread on it here http://www.sqlservercentral.com/Forums/Topic1424826-2799-1.aspx

After working with MS on this for a couple of weeks we have received a hotfix.

The bug was apparently introduced by SP1. All we know so far is that it has something to do with SQL not flipping off the low memory flag, although we haven't heard why it would have been flipped on in our environment yet. For those who haven't seen the other thread, my PLE graph looked like shark's teeth, with gradual climbs into reasonable values and precipitous crashes back to zero happening half a dozen or more times a day.

We've tested the fix in a distributed replay environment and the graph now looks as it should... slow and steady PLE growth throughout the execution of the replay, and we'll be pushing it into prod tonight. I'd expect this to be integrated into CU4.

ewebb 23926 Valued Member Points: 71 More actions · Answer 14

Wow, we have been fighting this same issue and it is really driving us nuts. Is there any reference info on the hotfix you can provide so we can get it? We have the same issue where it appears that sql believes there is a low memory situation and begins dropping the buffer cache with a corresponding drop in PLE. All the memory is released to free memory and not used. The buffer will then try to fill back up then drop to zero again. This will eventually clear itself or if I run DBCC FREESYSTEMCACHE ('ALL') it will return to normal right away.

Again, any help would be appreciated.

Thanks,

Erik