New SQL 2012 R2 SP2 Cluster - Slow Performance - Bad Query Plan - High CPU

  • We built a SQL 2012 Enterprise R2 SP2 cluster under Windows Server 2012. The cluster consists of 2 blades, each having 2 procs with 12 cores per proc, and 256GB of memory. I ran a basic query and noticed it took anywhere from 29 to 32 seconds to complete. I failed over to the secondary node and saw the same results. Nothing else is running on the server. I then realized the per CAL exe we used limits the processors to 20 so I ran msconfig and set the procs to 20. Microsoft noticed a different query plan being used on the cluster as compared to a stand alone 2012 build that has 12 processors. And I noticed SQL Server was using individual processors much more than I would expect, with one proc spiking to 100% for a few seconds, then dropping to zero while another proc jumped to 100% for a few more seconds. I do not see this same behavior on the stand alone SQL 2012 server, or on our SQL 2008 R2 servers. Microsoft is researching why a different plan was chosen, and I am about to escalate the case. Does anyone have any idea what may be happening?

    Select DISTINCT

    SP.SPID,

    LoginId = LEFT(SP.Loginame,50),

    HostName=LEFT(SP.HostName,20),

    DBName=DB_NAME(DB.database_id),

    SP.Program_Name,

    SP.Last_Batch

    From master.sys.sysprocesses SP

    Join master.sys.databases DB

    On SP.dbid = DB.database_id

    Left Join master.sys.dm_tran_locks L

    On SP.spid = L.request_session_id

    Where DB_NAME(L.resource_database_id) = 'DBA_HOME' <-- Introducing this line causes the query to go from under 1 second to 32 seconds.

    The wait type is SOS_SCHEDULER_YIELD

    The difference in the query plans is the bad plan uses DISTINCT SORT, where the good plan uses a SORT earlier in the plan (not a DISTINCT SORT) and includes a Stream Aggregate

    Thanks

  • I somehow forced the sort to be earlier by doing this

    DECLARE @id INT = DB_ID('MyDB')

    SELECT DISTINCT

    SP.spid, LoginId = LEFT(SP.loginame, 50), HostName = LEFT(SP.hostname, 20), DBName = DB_NAME(DB.database_id),

    SP.program_name, SP.last_batch

    FROM sys.sysprocesses SP

    INNER JOIN sys.databases DB ON SP.dbid = DB.database_id

    INNER JOIN ( SELECT TOP 100000000000

    request_session_id

    FROM sys.dm_tran_locks L

    WHERE L.resource_database_id = @id ) window ON SP.spid = window.request_session_id

    So if you need to guarantee that it runs this way only, then try it out on both servers and see what happens.

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • Try this

    Where L.resource_database_id = DB_ID('DBA_HOME')

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • I'm not concerned with getting the query to run faster, but rather why it runs in under 1 second on non-clustered 2012 servers, and clustered 2008 R2 servers, but over 30 seconds on the 2012 R2 SP2 cluster. Microsoft doesn't understand why a different query plan is being used for the same query. But I'm wondering if this has something to do with the number of procs and the binaries associated with the 2012 R2 CAL installation. There used to be an issue in SQL 2005 with 24 proc servers, where you had to set the number of procs to 1 in order to install SQL Server, and then set it back to 24 after the service pack installs were finished. So I'm wondering if the CAL binaries have an issue with 12 core servers, or 24 total CPUs. Or perhaps I need to review the bootstrap log file to see if an unreported error occurred during the installation.

    Thanks

  • Have a look at these two articles, particularly the latter:

    http://sqlblog.com/blogs/aaron_bertrand/archive/2012/04/27/a-cautionary-tale-about-grandfathering-CAL-licenses-in-SQL-Server-2012-Enterprise.aspx

    http://sqlperformance.com/2012/11/system-configuration/2012-cal-problems

    In your case, the imbalance from doing round-robin between a socket using 12 cores and a socket that only has 8 is probably at least making the problem worse.

    What's the MAXDOP on the clustered instance and on the stand-alone instance?

    At any rate, it's not particularly surprising that an instance with 20 cores exposed would come up with a different plan than one with 12, and it's not all that unusual for the plan exploiting more cores to actually perform worse.

    My guess is SQL Server is seeing that it has 20 cores available on the clustered instance, and is incorrectly assuming that makes some operator less expensive than it actually does. If MAXDOP isn't set appropriately, that's very likely the case, as SQL Server tends to make poor choices with so many cores, because its parallel operators tend to scale poorly.

    If you use a query hint to force that query to run with the same MAXDOP on the 20-cores-exposed-to-SQL cluster as it does on the 12-core standalone, does the difference go away?

    I hope some of this information helps!

  • Thank you for the links. I actually found both of those links yesterday, which caused me to think it could be related to some type of imbalance, or possibly SQL Server choosing a bad plan. Good reads by the way. Didn't realize Jonathan wrote the one until I opened it again just now. Anyway, that's what led me to try running msconfig and reducing the procs, first to 20, then to 12, then to 1. I had MAXDOP at 8 when there were 20 procs. Then set it to zero for my other tests. I saw no improvements. When I ran Jonathan's code I was able to confirm that SQL was seeing the number of procs I had set in msconfig. Current_Task_Count was low (2 or 3 I believe), which is probably normal for what I was running.

    Could an imbalance still exist even when I told the server there was only one visible proc? I'm guessing no, but I don't know for certain. What I didn't do was look at SQL Server properties to see how many procs appeared in the GUI. I may give that a try now.

    Thanks again,

    Dave

  • Ah, I was hoping it would be more helpful. C'est la vie.

    At one visible core I wouldn't expect there to be any balancing issues.

    If you run the query with OPTION (MAXDOP 1) on both the standalone and the cluster, does the difference persist?

  • Same results, but it was worth a try 🙂

    I just can't reason why it runs fine everywhere else and not here. I'm very tempted to reinstall SQL Server using the per-core exe just to see if that makes a difference.

    Thanks again

  • Alas...well, as you say, it was worth a try.

    That is indeed quite puzzling. At the moment I don't really have any more ideas.

    I'm actually beginning to think that my whole parallelism idea may have been a red herring, as I don't think that query will ever run in parallel. At least, neither 8649 nor make_parallel suffice (see here for those who haven't seen those before).

    I wish I could have been more help!

    If you or MS ever get to the bottom of it, please let me know. My curiosity has been irreversibly piqued 🙂

  • Quick suggestion have a look at Glen Berry's article Balancing Your Available SQL Server Core Licenses Evenly Across NUMA Nodes[/url]

    😎

  • I thought I would provide an update. Microsoft has turned the case over to product support, as I believe this is a bug. I performed multiple installations involving SQL 2012, SQL 2012 R2, Window Server 2012, Windows Server 2012 R2, clustering, etc... and determined that with SQL Server 2012 and Windows Server 2012, the following query would take 25 - 30 seconds to complete "IF" MAX SERVER MEMORY on my 256GB cluster was set to 230GB. Lowering MAX SERVER MEMORY to 99GB and it completes in 5 seconds. Lowering MAX SERVER MEMORY to 2GB and it ran in under 1 second. So why would this happen??? The answer appears to lie with sys.dm_tran_locks and possibly Lock Manager.

    Select DISTINCT

    SP.SPID,

    LoginId = LEFT(SP.Loginame,50),

    HostName=LEFT(SP.HostName,20),

    DBName=DB_NAME(DB.database_id),

    SP.Program_Name,

    SP.Last_Batch

    From master.sys.sysprocesses SP

    Join master.sys.databases DB

    On SP.dbid = DB.database_id

    Left Join master.sys.dm_tran_locks L

    On SP.spid = L.request_session_id

    Where DB_NAME(L.resource_database_id) = 'DBA_HOME'

    When I replaced sms.dm_tran_locks with sys.syslockinfo I saw a similar performance delay based upon MAX SERVER MEMORY settings.

    Various tests were performed to exclude NUMA node, and to reduce the number of processors available to SQL Server via balanced affinity masks. This had no effect on the performance of the query. After further research I found that SQL 2012 made some enhancements to Lock Manager, which dm_tran_locks and syslockinfo have in common. I then reproduced the problem on a second cluster of identical hardware, a third blade install (non-clustered), and a fourth non-blade (non-clustered) install. Microsoft's product support team has the case and will hopefully reproduce the issue in their labs. The question now becomes is this an issue with "Lock Manager", or something else dm_tran_locks and syslockinfo have in common? Hopefully the problem is limited to DMVs and system views, and will not impact our user defined objects.

    Sorry for the long post. I'll provide another update once I hear from product support.

    Thanks, Dave

  • *** UPDATE ***

    In June Microsoft filed the above problem as a bug with MEMORY MANAGER, as they were able to reproduce the issue in their lab. Their development team is working on a fix. The only solution at this time is to reduce MAX MEMORY.

    I am opening another case with Microsoft related to a similar issue. On our SQL 2012 EE servers with 256GB of memory, DBCC CHECKDB operations perform considerably slower with MAX MEMORY set to 230GB vs MAX MEMORY set to a lower number such as 46GB. We also see performance hits with UPDATE STATISTICS and are in the process of checking REINDEXING, which on the surface appears to experience similar delays due to large amounts of memory. I'll create a different post for these issues because I'm curious if anyone else has noticed the same behavior.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply