Performance Problem that I can't pin down

Question

Performance Problem that I can't pin down

Bob Fazio

SSChampion

Points: 10794
More actions
September 2, 2011 at 10:21 am

#247405

I have an unpredictable performance problem.
Symptoms are Increased CPU, LATCH_EX overall.
On average CPU times increase per query 10x.
Loose correlation between re-compilations/sec and the CPU spikes.
System has 48 Cores with 128G Ram.
Typical CPU load is less than 5%, during issues will rise to 10%.
SQL Server Waits will normally hover around 2000 ms/sec, but during these issue times will exceed 30000 ms/s
During issue CPU is 50%
Breakdown SOS (Normally .5%) and Consumed even with Signal Wait (Normally 2%) equal to combined of the other two
Latch being 23% of total (normally 3.5%) and nearly all of it is LATCH_EX
All other stats are negligible.
What makes this confusing, is launching of threads on the sever (NON-SQL was also an issue) so we reduced Max memory to 96G and after it reached that value, the system settled down. Thinking we found the problem, we just left it there.
Now 3 weeks later, we had the problem again. Memory is still at 96G. For a test, reduced it to 64G and it fixed the issue.
I have started capturing historical data from sys.dm_os_performance_counters for the following counters.
Buffer cache hit ratio
Buffer cache hit ratio base
Cache Hit Ratio
Cache Hit Ratio Base
Lock Blocks
Lock Blocks Allocated
Lock Memory (KB)
Lock Owner Blocks
Lock Owner Blocks Allocated
Lock Requests/sec
Lock Timeouts (timeout > 0)/sec
Lock Timeouts/sec
Lock Wait Time (ms)
Lock waits
Lock Waits/sec
SQL Compilations/sec
SQL Re-Compilations/sec
SQL Trace IO Provider Lock Waits
Table Lock Escalations/sec
Any ideas or suggestions on where to look next.

Viewing 15 posts - 1 through 15 (of 34 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 1

Latch_Ex on what kind of latch?

Latches (as opposed to PageLatch and PageIOLatch) are memory-related waits. Hence need to know what kind of latchwaits you're seeing.

http://sqlcat.com/sqlcat/b/whitepapers/archive/2011/07/05/diagnosing-and-resolving-latch-contention-on-sql-server.aspx

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Bob Fazio SSChampion Points: 10794 More actions · Answer 2

Unfortunately I am not tracking that data historically, but looking at sys.dm_os_latch_Stats

ACCESS_METHODS_DATASET_PARENT Max of 8331 average of 9

BUFFER Max of 9922 average of 0

All others of note are "Internal use only" DBCC* and NESTING*

I'll start gathering values over time. (at least for those two)

PAGELATCH and PAGEIOLATCH have total wait time of 1/1000th of LATCH

Bob Fazio SSChampion Points: 10794 More actions · Answer 3

Reset counters

DBCC SQLPERF ('sys.dm_os_latch_stats', CLEAR);

GO

Setup monitor to track value ever 10 minutes... And Now off to do some reading. Just the way to start off a holiday weekend.

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 4

hmm.... @@version?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Bob Fazio SSChampion Points: 10794 More actions · Answer 5

Grant Fritchey SSC Guru Points: 399340 More actions · Answer 6

I was just reading about monitoring latch waits over on Paul Randal's blog. He suggested this white paper to understand what's happening with latch waits..

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 7

Grant Fritchey (9/2/2011)
I was just reading about monitoring latch waits over on Paul Randal's blog. He suggested this white paper to understand what's happening with latch waits..

There seems to be an echo in here... 😉

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Grant Fritchey SSC Guru Points: 399340 More actions · Answer 8

GilaMonster (9/2/2011)
Grant Fritchey (9/2/2011)
I was just reading about monitoring latch waits over on Paul Randal's blog. He suggested this white paper to understand what's happening with latch waits..
There seems to be an echo in here... 😉

HA! I didn't even click through on that one. Probably should have. If I'd noticed the head of the URL I would have known.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Bob Fazio SSChampion Points: 10794 More actions · Answer 9

Update.

SELECT * FROM sys.dm_os_wait_stats ORDER BY wait_time_ms desc

showed that CXPACKET was in the top 10 (actually 4th highest) waits.

So we concentrated on parallel queries.

First thing I found out is that apparently in practice breaking up a query more than 8 degree's will reduce performance, so the current rule of thumb for MAXDOP is to set the value to the number of cores in a single socket. So in our case it was 6. In addition, some people have seen a benefit in disabling Hyper threading. We are not currently considering that, but might in the future.

I found a great article on parallel queries http://sqlserverpedia.com/blog/sql-server-bloggers/find-query-plans-that-may-utilize-parallelism/

This query is great

--

-- Find query plans that may run in parallel

--

SELECT DB_NAME(p.dbid) AS DBName ,

OBJECT_NAME(p.objectid, p.dbid) AS OBJECT_NAME ,

cp.usecounts ,

p.query_plan ,

q.text ,

p.dbid ,

p.objectid ,

p.number ,

p.encrypted ,

cp.plan_handle

FROM sys.dm_exec_cached_plans cp

CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) p

CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) AS q

WHERE cp.cacheobjtype = 'Compiled Plan'

AND p.query_plan.value('declare namespace p="http://schemas.microsoft.com/sqlserver/2004/07/showplan"; max(//p:RelOp/@Parallel)', 'float') > 0

ORDER BY p.dbid, p.objectID

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 10

Cxpackt at 4th place??? I wouldn't bother, AT ALL, with it atm.

Cost threshold for parallelism (CXPACKET) http://sqlblog.com/blogs/jonathan_kehayias/archive/2010/01/19/tuning-cost-threshold-of-parallelism-from-the-plan-cache.aspx

Paul White: Understanding parallelism http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/

Microsoft White Paper on waits http://download.microsoft.com/download/4/7/a/47a548b9-249e-484c-abd7-29f31282b04d/Performance_Tuning_Waits_Queues.doc

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 11

How long's the peak at 50%?

Sounds to me like you just have a olap query taking a little more ressources the your other "normal" queries.

Bob Fazio SSChampion Points: 10794 More actions · Answer 12

The event's usually last between 5-10 minutes.

However, I should have clarified the 4th number. The other three were ignorable waits. In fact, all in the top 10 except CXPACKET where ignorable.

LAZYWRITER_SLEEP 716375247

DISPATCHER_QUEUE_SEMAPHORE 188715613

XE_DISPATCHER_WAIT 182711826

CXPACKET 179763111

BROKER_TASK_STOP 179128691

We found a table that was in need of indexing. Our best guess at this point, is we went into parallel paralysis, when a certain class of reports were run (which depended heavily on this table).

We suspect the table was the primary issue. The MAXDOP change is in place to hopefully protect us from future queries crippling the system. At this point, I am not comfortable going as far as setting it to 1.

At

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 13

Bob Fazio (9/12/2011)
The event's usually last between 5-10 minutes.
However, I should have clarified the 4th number. The other three were ignorable waits. In fact, all in the top 10 except CXPACKET where ignorable.
LAZYWRITER_SLEEP 716375247
DISPATCHER_QUEUE_SEMAPHORE 188715613
XE_DISPATCHER_WAIT 182711826
CXPACKET 179763111
BROKER_TASK_STOP 179128691
We found a table that was in need of indexing. Our best guess at this point, is we went into parallel paralysis, when a certain class of reports were run (which depended heavily on this table).
We suspect the table was the primary issue. The MAXDOP change is in place to hopefully protect us from future queries crippling the system. At this point, I am not comfortable going as far as setting it to 1.
At

I'm really no expert with that big of a server, but good sense tells me that there probably should be a maxdop setting at server level (maybe 8, 12 or 16).

That way a rogue query shouldn't kill the whole server.

That being said, what does this return?

/*

Cost threshold for parallelism (CXPACKET) http://sqlblog.com/blogs/jonathan_kehayias/archive/2010/01/19/tuning-cost-threshold-of-parallelism-from-the-plan-cache.aspx

Paul White: Understanding parallelism http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/

Microsoft White Paper on waits http://download.microsoft.com/download/4/7/a/47a548b9-249e-484c-abd7-29f31282b04d/Performance_Tuning_Waits_Queues.doc

Next query by Paul Randal http://www.sqlskills.com/BLOGS/PAUL/post/Wait-statistics-or-please-tell-me-where-it-hurts.aspx

*/

WITH Waits AS

(SELECT

wait_type,

wait_time_ms / 1000.0 AS WaitS,

(wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,

signal_wait_time_ms / 1000.0 AS SignalS,

waiting_tasks_count AS WaitCount,

100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,

ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum

FROM sys.dm_os_wait_stats

WHERE wait_type NOT IN (

'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',

'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',

'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',

'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',

'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',

'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 'BROKER_RECEIVE_WAITFOR')

)

SELECT

W1.wait_type AS WaitType,

CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,

CAST (W1.WaitS * 1000 / W1.WaitCount AS DECIMAL(14, 2)) AS AvgWait_MS,

CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,

CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,

W1.WaitCount AS WaitCount,

CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage

FROM Waits AS W1

INNER JOIN Waits AS W2

ON W2.RowNum <= W1.RowNum

GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage

HAVING SUM (W2.Percentage) - W1.Percentage < 97 -- percentage threshold

AND W1.WaitCount > 0;

GO

Bob Fazio SSChampion Points: 10794 More actions · Answer 14

WaitTypeWait_SAvgWait_MSResource_SSignal_SWaitCountPercentage

CXPACKET25955.363.8222734.463220.90678771938.08

OLEDB17137.440.1417137.440.0012570371925.14

PAGEIOLATCH_SH8344.7426.258298.3446.4031791312.24

SOS_SCHEDULER_YIELD4127.010.1851.294075.72223269846.05

BACKUPIO2285.140.822273.7311.4127762383.35

WRITELOG1565.000.741444.10120.9021012872.30

LATCH_EX1354.820.161092.97261.8583067901.99

BACKUPTHREAD1235.28135.541235.010.2791141.81

ASYNC_IO_COMPLETION1017.534086.451017.520.012491.49

BACKUPBUFFER980.150.16900.9279.2263068881.44

ASYNC_NETWORK_IO936.730.30899.3837.3631182731.37

SQLTRACE_LOCK487.045.49465.6721.37887360.71

PREEMPTIVE_OS_AUTHENTICATIONOPS389.930.17389.930.0022847360.57

SLEEP_BPOOL_FLUSH328.392.93324.913.481119840.48