Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««23456»»

Page life expectancy diving to sub-10 on a 128 GB server Expand / Collapse
Author
Message
Posted Wednesday, April 24, 2013 7:29 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 4:42 PM
Points: 323, Visits: 1,201
We haven't seen much of a problem with foreign memory requests, but it's an interesting idea. We did set up a lab a couple of weeks ago with a premiere support engineer on site to do some testing and send updates back to the product team, so we could give it a go there.

At the moment we are relatively stable on 2012 RTM with no CU's applied, though, so the priority of the issue has been dropped a notch or two. I still think 2012 is underperforming compared to 2008R2, particularly given it's running on almost double the hardware our old 2008R2 box was.


Sorry you were offended that "I" think it is silly to spend weeks going round and round for a solution to a production database issue


We had a Microsoft premiere support engineer on site working with myself and the bloody product team in the states during that time, which eventually resulted in a private hotfix being sent out. You might want to pull your head in.


allmhuran.com - download the SSMSDeploy addin for SSMS 2008
Blog on sqlservercentral
Post #1445960
Posted Wednesday, April 24, 2013 8:17 AM


SSCarpal Tunnel

SSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal Tunnel

Group: General Forum Members
Last Login: Today @ 7:16 PM
Points: 4,471, Visits: 6,401
Head pulled, and I have you on my list of posters who really know their SQL engine!

Best,

Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru at GMail
Post #1446005
Posted Wednesday, April 24, 2013 9:00 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 4:42 PM
Points: 323, Visits: 1,201
TheSQLGuru (4/24/2013)
Head pulled, and I have you on my list of posters who really know their SQL engine!


My apologies, I have been off the forum for a while and missed a few updates, it looks like you have been in the same boat, but added a new post to Tommy's thread which, on reading it, was obviously more sympathetic to, and understanding of, the situation at hand.

Unfortunately I came to this thread first and got my feathers ruffled.


allmhuran.com - download the SSMSDeploy addin for SSMS 2008
Blog on sqlservercentral
Post #1446047
Posted Wednesday, April 24, 2013 9:10 AM


SSCarpal Tunnel

SSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal Tunnel

Group: General Forum Members
Last Login: Today @ 7:16 PM
Points: 4,471, Visits: 6,401
allmhuran (4/24/2013)
TheSQLGuru (4/24/2013)
Head pulled, and I have you on my list of posters who really know their SQL engine!


My apologies, I have been off the forum for a while and missed a few updates, it looks like you have been in the same boat, but added a new post to Tommy's thread which, on reading it, was obviously more sympathetic to, and understanding of, the situation at hand.

Unfortunately I came to this thread first and got my feathers ruffled.


No apology needed. Disconnected comms like a forum are rife with opportunites for misunderstandings/misinterpretations/etc.

Sure hope the SQL Server team gets the scalability kinks worked out of SQL 2012, and quickly!! I am not liking what I am seeing and hearing about on a number of fronts honestly...


Best,

Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru at GMail
Post #1446054
Posted Wednesday, April 24, 2013 9:10 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Monday, November 3, 2014 12:54 PM
Points: 10, Visits: 109
"I still think 2012 is underperforming compared to 2008R2, particularly given it's running on almost double the hardware our old 2008R2 box was."

This is a big clue. Have you tested the workflows on SQL Server 2008 R2 on the new hardware, or was it SQL Server 2008 R2 on old hardware, SQL Server 2012 on new hardware only? SQL Server 2008 R2 might struggle just as hard as SQL Server 2012 on "twice the hardware", depending on what that means.

This is what I learned based on internal testing and production system observation across numerous 2, 4, and 8 NUMA node systems. The workloads within scope of our tests:
1) high concurrency ETL (typically twice as many client connections than logical CPUs)
2) higher concurrency batch report workload (driven by reports scheduled on report app servers, concurrency typically at least 4 times the logical CPU count).



1) assuming same core count, increase core count per NUMA node (or drive up query memory allocation concurrency per NUMA node in any manner) = poke the bear of per-NUMA-node query memory allocation spinlock contention. Indicated by CMEMTHREAD waits, and high CPU during low page lookups/logical reads per CPU % utilized. Can lead to high CPU utilized per core, aligned along NUMA node boundaries (if not across all cores). Resolved by trace flag 8048.

2) assuming same core count, increase NUMA node count = decrease the database cache efficiency for workloads whose queries/connections don't benefit from NUMA node affinity. Basically, imbalanced cache churn rates cause more disk IO than necessary*. Seems to vary with the square of NUMA node count according to my tests. This case is especially relevant when transitioning between Intel servers at one NUMA node per socket and the two-NUMA node per socket AMD servers. Resolved by trace flag 8015. (Don't use without trace flag 8048 or you'll poke the #1 bear even harder.)

3) for a given amount of RAM, increase NUMA node count = increased exposure to persistent high foreign memory page count on multiple SQL Server buffer nodes. I haven't seen confirmation of this anywhere, but especially after the SQL Server 2012 SP1 CU3 hotfix for NUMA behavior, I'm convinced that persistent high foreign page counts triggers a reaction within SQLOS, trying to sort away the high foreign page count. If the count can't be sorted away (perhaps because the amount of free memory on each NUMA node doesn't allow it, even if the amount of physical memory on each physical NUMA node is balanced) a significant amount of overhead is added to some SQL Server memory operations. That overhead seems to be hard to pinpoint with typical wait and spinlock measures. (It may be related to a spinlck category that isn't documented anywhere available to me yet.) This is also resolved with trace flag 8015.

4) for a given core count, increase NUMA node count = increased likelihood of imbalanced CPU utilization, aligned with NUMA node boundaries, due to the tendency of parallel workers for a given query to be assigned within a NUMA node. Closely related to number 2. Increases the CXPACKET and SOS_SUSPEND_QUEUE waits for a given workload, as well as signal wait time across all wait types. This is also resolved by trace flag 8015.

So, during your targeted workloads, you may want to monitor per CPU CPU busy, per buffer node PLE, stolen pages, database cache pages, and free pages, and overall stolen pages, and reserved memory, and granted workspace memory.

Use a stored procedre to capture and diff waits and spinlocks in small increments. At live production sites I was using 5 minute intervals, in-house testing we decreased the intervals.

Rohit Nayak's post on debugging CMEMTHREAD waits can really help, too - especially if you are skilled with extended events (I'm no good at them yet).
http://blogs.msdn.com/b/psssql/archive/2012/12/20/how-it-works-cmemthread-and-debugging-them.aspx

You may be in a situation where trace flag 8048 helps but 8015 provides no extra benefit. If 8015 helps, I'd warn against using it without 8048. Of course, monitoring CMEMTHREAD waits, spinlocks, per-node PLE and free memory, stolen memory, reserved memory, etc, might just show that neither of these will help at all.

Or, you may be in a situation where 8015 + 8048 could help, but going to a single lazy writer or a single IO completion port makes it a no-go.

I posted this at length here not to bore or attempt to impress you :) but because as far as I know, I'm the only person that's talking much about this potential optimization for "SMP type" batch workloads that don't require NUMA node connection affinity. Many data warehouses fall into this description. And, attempting to explain it here allows me to practice for an upcoming blog post where I'll include perfmon stats and maybe a graph or two. :)


*I often see an assumption that the main considerations in widely swinging PLE (when the PLE calculation is legit) are database block physical read rate and/or first database block write rate. When stolen memory (this includes the "active/used" portion of query memory grants - the not-yet-used query memory which may still cache database blocks is in "reserved memory") goes up, the database cache typically gets smaller - then PLE goes down even if physical read rate and first write rate remain the same. Same pace of blocks marching down the line, but its a shorter line til they age out :)
Post #1446055
Posted Wednesday, April 24, 2013 10:05 AM


SSCarpal Tunnel

SSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal Tunnel

Group: General Forum Members
Last Login: Today @ 7:16 PM
Points: 4,471, Visits: 6,401
That's an awesome discussion of some very deep scalability issues!! I look forward to the blog post!

Best,

Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru at GMail
Post #1446095
Posted Wednesday, April 24, 2013 5:43 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 4:42 PM
Points: 323, Visits: 1,201
Thanks for the detailed information on those trace flags. I also eagerly await the blog post.

In an ideal world I would definitely be running replays on our lab environment to observe the effects of these different options. I must say, I am.... suspicious.... of NUMA.

In the real world... well, we'll see if I can get it prioritized. I've already been drawn off in other directions because the server is running "relatively OK" on RTM.

Here's a picture of PLE by node, and free memory (orange line) on RTM. Ignore the giant free memory spike, that was me allocating 10 more Gig to SQL.

In my opinion this is still a pretty wild ride, but it's definitely better than it looked under SP1.





allmhuran.com - download the SSMSDeploy addin for SSMS 2008
Blog on sqlservercentral
Post #1446262
Posted Wednesday, April 24, 2013 6:01 PM


SSCarpal Tunnel

SSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal Tunnel

Group: General Forum Members
Last Login: Today @ 7:16 PM
Points: 4,471, Visits: 6,401
That's just FUBAR right there!!

Best,

Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru at GMail
Post #1446263
Posted Wednesday, April 24, 2013 6:20 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 4:42 PM
Points: 323, Visits: 1,201
Er, just noticed I had cut off the X axis markers. That graph represents about 3 days.

allmhuran.com - download the SSMSDeploy addin for SSMS 2008
Blog on sqlservercentral
Post #1446265
Posted Wednesday, April 24, 2013 6:35 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 4:42 PM
Points: 323, Visits: 1,201
By way of comparison, here's what it was on SP1 without the private fix. Also three days.

As a reminder, this is a 40 physical core, 4 NUMA node, 128 gig server, 110 allocated to the SQL instance. Which suggests to me that if SQL has consumed all the memory allocated to it and then PLE on one of the nodes crashes to zero (which, as the graphs show, happens far less on RTM than SP1), it represents a truly massive amount of cache flush. Almost 30 gig. Microsoft is suggesting that the hotfix works (except for the dll version issues) and that the remaining craziness is probably related to query activity.

Them's some mighty big queries if you ask me.

To try and get some quick wins, our next steps are to enable compression (obvious choice with that much CPU power) and set up an always-on replica to act as the source for operational SSRS reports. These are things we always planned to do, but when we saw things go bad after migrating to 2012 attention was, for obvious reasons, shifted.



allmhuran.com - download the SSMSDeploy addin for SSMS 2008
Blog on sqlservercentral
Post #1446266
« Prev Topic | Next Topic »

Add to briefcase «««23456»»

Permissions Expand / Collapse