Performance Problem that I can't pin down

  • I have an unpredictable performance problem.

    Symptoms are Increased CPU, LATCH_EX overall.

    On average CPU times increase per query 10x.

    Loose correlation between re-compilations/sec and the CPU spikes.

    System has 48 Cores with 128G Ram.

    Typical CPU load is less than 5%, during issues will rise to 10%.

    SQL Server Waits will normally hover around 2000 ms/sec, but during these issue times will exceed 30000 ms/s

    During issue CPU is 50%

    Breakdown SOS (Normally .5%) and Consumed even with Signal Wait (Normally 2%) equal to combined of the other two

    Latch being 23% of total (normally 3.5%) and nearly all of it is LATCH_EX

    All other stats are negligible.

    What makes this confusing, is launching of threads on the sever (NON-SQL was also an issue) so we reduced Max memory to 96G and after it reached that value, the system settled down. Thinking we found the problem, we just left it there.

    Now 3 weeks later, we had the problem again. Memory is still at 96G. For a test, reduced it to 64G and it fixed the issue.

    I have started capturing historical data from sys.dm_os_performance_counters for the following counters.

    Buffer cache hit ratio

    Buffer cache hit ratio base

    Cache Hit Ratio

    Cache Hit Ratio Base

    Lock Blocks

    Lock Blocks Allocated

    Lock Memory (KB)

    Lock Owner Blocks

    Lock Owner Blocks Allocated

    Lock Requests/sec

    Lock Timeouts (timeout > 0)/sec

    Lock Timeouts/sec

    Lock Wait Time (ms)

    Lock waits

    Lock Waits/sec

    SQL Compilations/sec

    SQL Re-Compilations/sec

    SQL Trace IO Provider Lock Waits

    Table Lock Escalations/sec

    Any ideas or suggestions on where to look next.

  • Latch_Ex on what kind of latch?

    Latches (as opposed to PageLatch and PageIOLatch) are memory-related waits. Hence need to know what kind of latchwaits you're seeing.

    http://sqlcat.com/sqlcat/b/whitepapers/archive/2011/07/05/diagnosing-and-resolving-latch-contention-on-sql-server.aspx

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Unfortunately I am not tracking that data historically, but looking at sys.dm_os_latch_Stats

    ACCESS_METHODS_DATASET_PARENT Max of 8331 average of 9

    BUFFER Max of 9922 average of 0

    All others of note are "Internal use only" DBCC* and NESTING*

    I'll start gathering values over time. (at least for those two)

    PAGELATCH and PAGEIOLATCH have total wait time of 1/1000th of LATCH

  • Reset counters

    DBCC SQLPERF ('sys.dm_os_latch_stats', CLEAR);

    GO

    Setup monitor to track value ever 10 minutes... And Now off to do some reading. Just the way to start off a holiday weekend.

  • hmm.... @@version?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Microsoft SQL Server 2008 (SP2) - 10.0.4266.0 (X64) Nov 5 2010 16:11:39 Copyright (c) 1988-2008 Microsoft Corporation Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7600: )

  • I was just reading about monitoring latch waits over on Paul Randal's blog. He suggested this white paper to understand what's happening with latch waits..

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
    The Scary DBA
    Author of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • Grant Fritchey (9/2/2011)


    I was just reading about monitoring latch waits over on Paul Randal's blog. He suggested this white paper to understand what's happening with latch waits..

    There seems to be an echo in here... 😉

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (9/2/2011)


    Grant Fritchey (9/2/2011)


    I was just reading about monitoring latch waits over on Paul Randal's blog. He suggested this white paper to understand what's happening with latch waits..

    There seems to be an echo in here... 😉

    HA! I didn't even click through on that one. Probably should have. If I'd noticed the head of the URL I would have known.

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
    The Scary DBA
    Author of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • Update.

    SELECT * FROM sys.dm_os_wait_stats ORDER BY wait_time_ms desc

    showed that CXPACKET was in the top 10 (actually 4th highest) waits.

    So we concentrated on parallel queries.

    First thing I found out is that apparently in practice breaking up a query more than 8 degree's will reduce performance, so the current rule of thumb for MAXDOP is to set the value to the number of cores in a single socket. So in our case it was 6. In addition, some people have seen a benefit in disabling Hyper threading. We are not currently considering that, but might in the future.

    I found a great article on parallel queries http://sqlserverpedia.com/blog/sql-server-bloggers/find-query-plans-that-may-utilize-parallelism/

    This query is great

    --

    -- Find query plans that may run in parallel

    --

    SELECT DB_NAME(p.dbid) AS DBName ,

    OBJECT_NAME(p.objectid, p.dbid) AS OBJECT_NAME ,

    cp.usecounts ,

    p.query_plan ,

    q.text ,

    p.dbid ,

    p.objectid ,

    p.number ,

    p.encrypted ,

    cp.plan_handle

    FROM sys.dm_exec_cached_plans cp

    CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) p

    CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) AS q

    WHERE cp.cacheobjtype = 'Compiled Plan'

    AND p.query_plan.value('declare namespace p="http://schemas.microsoft.com/sqlserver/2004/07/showplan"; max(//p:RelOp/@Parallel)', 'float') > 0

    ORDER BY p.dbid, p.objectID

  • How long's the peak at 50%?

    Sounds to me like you just have a olap query taking a little more ressources the your other "normal" queries.

  • The event's usually last between 5-10 minutes.

    However, I should have clarified the 4th number. The other three were ignorable waits. In fact, all in the top 10 except CXPACKET where ignorable.

    LAZYWRITER_SLEEP 716375247

    DISPATCHER_QUEUE_SEMAPHORE 188715613

    XE_DISPATCHER_WAIT 182711826

    CXPACKET 179763111

    BROKER_TASK_STOP 179128691

    We found a table that was in need of indexing. Our best guess at this point, is we went into parallel paralysis, when a certain class of reports were run (which depended heavily on this table).

    We suspect the table was the primary issue. The MAXDOP change is in place to hopefully protect us from future queries crippling the system. At this point, I am not comfortable going as far as setting it to 1.

    At

  • Bob Fazio (9/12/2011)


    The event's usually last between 5-10 minutes.

    However, I should have clarified the 4th number. The other three were ignorable waits. In fact, all in the top 10 except CXPACKET where ignorable.

    LAZYWRITER_SLEEP 716375247

    DISPATCHER_QUEUE_SEMAPHORE 188715613

    XE_DISPATCHER_WAIT 182711826

    CXPACKET 179763111

    BROKER_TASK_STOP 179128691

    We found a table that was in need of indexing. Our best guess at this point, is we went into parallel paralysis, when a certain class of reports were run (which depended heavily on this table).

    We suspect the table was the primary issue. The MAXDOP change is in place to hopefully protect us from future queries crippling the system. At this point, I am not comfortable going as far as setting it to 1.

    At

    I'm really no expert with that big of a server, but good sense tells me that there probably should be a maxdop setting at server level (maybe 8, 12 or 16).

    That way a rogue query shouldn't kill the whole server.

    That being said, what does this return?

    /*

    Cost threshold for parallelism (CXPACKET) http://sqlblog.com/blogs/jonathan_kehayias/archive/2010/01/19/tuning-cost-threshold-of-parallelism-from-the-plan-cache.aspx

    Paul White: Understanding parallelism http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/

    Microsoft White Paper on waits http://download.microsoft.com/download/4/7/a/47a548b9-249e-484c-abd7-29f31282b04d/Performance_Tuning_Waits_Queues.doc

    Next query by Paul Randal http://www.sqlskills.com/BLOGS/PAUL/post/Wait-statistics-or-please-tell-me-where-it-hurts.aspx

    */

    WITH Waits AS

    (SELECT

    wait_type,

    wait_time_ms / 1000.0 AS WaitS,

    (wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,

    signal_wait_time_ms / 1000.0 AS SignalS,

    waiting_tasks_count AS WaitCount,

    100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,

    ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum

    FROM sys.dm_os_wait_stats

    WHERE wait_type NOT IN (

    'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',

    'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',

    'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',

    'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',

    'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',

    'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 'BROKER_RECEIVE_WAITFOR')

    )

    SELECT

    W1.wait_type AS WaitType,

    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,

    CAST (W1.WaitS * 1000 / W1.WaitCount AS DECIMAL(14, 2)) AS AvgWait_MS,

    CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,

    CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,

    W1.WaitCount AS WaitCount,

    CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage

    FROM Waits AS W1

    INNER JOIN Waits AS W2

    ON W2.RowNum <= W1.RowNum

    GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage

    HAVING SUM (W2.Percentage) - W1.Percentage < 97 -- percentage threshold

    AND W1.WaitCount > 0;

    GO

  • WaitTypeWait_SAvgWait_MSResource_SSignal_SWaitCountPercentage

    CXPACKET25955.363.8222734.463220.90678771938.08

    OLEDB17137.440.1417137.440.0012570371925.14

    PAGEIOLATCH_SH8344.7426.258298.3446.4031791312.24

    SOS_SCHEDULER_YIELD4127.010.1851.294075.72223269846.05

    BACKUPIO2285.140.822273.7311.4127762383.35

    WRITELOG1565.000.741444.10120.9021012872.30

    LATCH_EX1354.820.161092.97261.8583067901.99

    BACKUPTHREAD1235.28135.541235.010.2791141.81

    ASYNC_IO_COMPLETION1017.534086.451017.520.012491.49

    BACKUPBUFFER980.150.16900.9279.2263068881.44

    ASYNC_NETWORK_IO936.730.30899.3837.3631182731.37

    SQLTRACE_LOCK487.045.49465.6721.37887360.71

    PREEMPTIVE_OS_AUTHENTICATIONOPS389.930.17389.930.0022847360.57

    SLEEP_BPOOL_FLUSH328.392.93324.913.481119840.48

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply