Performance Problem that I can't pin down

  • Latch_Ex on what kind of latch?

    Latches (as opposed to PageLatch and PageIOLatch) are memory-related waits. Hence need to know what kind of latchwaits you're seeing.

    http://sqlcat.com/sqlcat/b/whitepapers/archive/2011/07/05/diagnosing-and-resolving-latch-contention-on-sql-server.aspx

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Unfortunately I am not tracking that data historically, but looking at sys.dm_os_latch_Stats

    ACCESS_METHODS_DATASET_PARENT Max of 8331 average of 9

    BUFFER Max of 9922 average of 0

    All others of note are "Internal use only" DBCC* and NESTING*

    I'll start gathering values over time. (at least for those two)

    PAGELATCH and PAGEIOLATCH have total wait time of 1/1000th of LATCH

  • Reset counters

    DBCC SQLPERF ('sys.dm_os_latch_stats', CLEAR);

    GO

    Setup monitor to track value ever 10 minutes... And Now off to do some reading. Just the way to start off a holiday weekend.

  • hmm.... @@version?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Microsoft SQL Server 2008 (SP2) - 10.0.4266.0 (X64) Nov 5 2010 16:11:39 Copyright (c) 1988-2008 Microsoft Corporation Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7600: )

  • I was just reading about monitoring latch waits over on Paul Randal's blog. He suggested this white paper to understand what's happening with latch waits..

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (9/2/2011)


    I was just reading about monitoring latch waits over on Paul Randal's blog. He suggested this white paper to understand what's happening with latch waits..

    There seems to be an echo in here... 😉

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (9/2/2011)


    Grant Fritchey (9/2/2011)


    I was just reading about monitoring latch waits over on Paul Randal's blog. He suggested this white paper to understand what's happening with latch waits..

    There seems to be an echo in here... 😉

    HA! I didn't even click through on that one. Probably should have. If I'd noticed the head of the URL I would have known.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Update.

    SELECT * FROM sys.dm_os_wait_stats ORDER BY wait_time_ms desc

    showed that CXPACKET was in the top 10 (actually 4th highest) waits.

    So we concentrated on parallel queries.

    First thing I found out is that apparently in practice breaking up a query more than 8 degree's will reduce performance, so the current rule of thumb for MAXDOP is to set the value to the number of cores in a single socket. So in our case it was 6. In addition, some people have seen a benefit in disabling Hyper threading. We are not currently considering that, but might in the future.

    I found a great article on parallel queries http://sqlserverpedia.com/blog/sql-server-bloggers/find-query-plans-that-may-utilize-parallelism/

    This query is great

    --

    -- Find query plans that may run in parallel

    --

    SELECT DB_NAME(p.dbid) AS DBName ,

    OBJECT_NAME(p.objectid, p.dbid) AS OBJECT_NAME ,

    cp.usecounts ,

    p.query_plan ,

    q.text ,

    p.dbid ,

    p.objectid ,

    p.number ,

    p.encrypted ,

    cp.plan_handle

    FROM sys.dm_exec_cached_plans cp

    CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) p

    CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) AS q

    WHERE cp.cacheobjtype = 'Compiled Plan'

    AND p.query_plan.value('declare namespace p="http://schemas.microsoft.com/sqlserver/2004/07/showplan"; max(//p:RelOp/@Parallel)', 'float') > 0

    ORDER BY p.dbid, p.objectID

  • How long's the peak at 50%?

    Sounds to me like you just have a olap query taking a little more ressources the your other "normal" queries.

  • The event's usually last between 5-10 minutes.

    However, I should have clarified the 4th number. The other three were ignorable waits. In fact, all in the top 10 except CXPACKET where ignorable.

    LAZYWRITER_SLEEP 716375247

    DISPATCHER_QUEUE_SEMAPHORE 188715613

    XE_DISPATCHER_WAIT 182711826

    CXPACKET 179763111

    BROKER_TASK_STOP 179128691

    We found a table that was in need of indexing. Our best guess at this point, is we went into parallel paralysis, when a certain class of reports were run (which depended heavily on this table).

    We suspect the table was the primary issue. The MAXDOP change is in place to hopefully protect us from future queries crippling the system. At this point, I am not comfortable going as far as setting it to 1.

    At

  • Bob Fazio (9/12/2011)


    The event's usually last between 5-10 minutes.

    However, I should have clarified the 4th number. The other three were ignorable waits. In fact, all in the top 10 except CXPACKET where ignorable.

    LAZYWRITER_SLEEP 716375247

    DISPATCHER_QUEUE_SEMAPHORE 188715613

    XE_DISPATCHER_WAIT 182711826

    CXPACKET 179763111

    BROKER_TASK_STOP 179128691

    We found a table that was in need of indexing. Our best guess at this point, is we went into parallel paralysis, when a certain class of reports were run (which depended heavily on this table).

    We suspect the table was the primary issue. The MAXDOP change is in place to hopefully protect us from future queries crippling the system. At this point, I am not comfortable going as far as setting it to 1.

    At

    I'm really no expert with that big of a server, but good sense tells me that there probably should be a maxdop setting at server level (maybe 8, 12 or 16).

    That way a rogue query shouldn't kill the whole server.

    That being said, what does this return?

    /*

    Cost threshold for parallelism (CXPACKET) http://sqlblog.com/blogs/jonathan_kehayias/archive/2010/01/19/tuning-cost-threshold-of-parallelism-from-the-plan-cache.aspx

    Paul White: Understanding parallelism http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/

    Microsoft White Paper on waits http://download.microsoft.com/download/4/7/a/47a548b9-249e-484c-abd7-29f31282b04d/Performance_Tuning_Waits_Queues.doc

    Next query by Paul Randal http://www.sqlskills.com/BLOGS/PAUL/post/Wait-statistics-or-please-tell-me-where-it-hurts.aspx

    */

    WITH Waits AS

    (SELECT

    wait_type,

    wait_time_ms / 1000.0 AS WaitS,

    (wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,

    signal_wait_time_ms / 1000.0 AS SignalS,

    waiting_tasks_count AS WaitCount,

    100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,

    ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum

    FROM sys.dm_os_wait_stats

    WHERE wait_type NOT IN (

    'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',

    'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',

    'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',

    'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',

    'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',

    'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 'BROKER_RECEIVE_WAITFOR')

    )

    SELECT

    W1.wait_type AS WaitType,

    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,

    CAST (W1.WaitS * 1000 / W1.WaitCount AS DECIMAL(14, 2)) AS AvgWait_MS,

    CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,

    CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,

    W1.WaitCount AS WaitCount,

    CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage

    FROM Waits AS W1

    INNER JOIN Waits AS W2

    ON W2.RowNum <= W1.RowNum

    GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage

    HAVING SUM (W2.Percentage) - W1.Percentage < 97 -- percentage threshold

    AND W1.WaitCount > 0;

    GO

  • WaitTypeWait_SAvgWait_MSResource_SSignal_SWaitCountPercentage

    CXPACKET25955.363.8222734.463220.90678771938.08

    OLEDB17137.440.1417137.440.0012570371925.14

    PAGEIOLATCH_SH8344.7426.258298.3446.4031791312.24

    SOS_SCHEDULER_YIELD4127.010.1851.294075.72223269846.05

    BACKUPIO2285.140.822273.7311.4127762383.35

    WRITELOG1565.000.741444.10120.9021012872.30

    LATCH_EX1354.820.161092.97261.8583067901.99

    BACKUPTHREAD1235.28135.541235.010.2791141.81

    ASYNC_IO_COMPLETION1017.534086.451017.520.012491.49

    BACKUPBUFFER980.150.16900.9279.2263068881.44

    ASYNC_NETWORK_IO936.730.30899.3837.3631182731.37

    SQLTRACE_LOCK487.045.49465.6721.37887360.71

    PREEMPTIVE_OS_AUTHENTICATIONOPS389.930.17389.930.0022847360.57

    SLEEP_BPOOL_FLUSH328.392.93324.913.481119840.48

  • Tx, and now these 2?

    SELECT* FROM sys.dm_exec_query_optimizer_info

    Don't do any of the recommendations, it's just very rough guidelines for us.

    SELECT

    optz.counter

    , optz.occurrence

    , CONVERT(DECIMAL(18,2), optz.value) AS AvgValue

    , conf.name

    , conf.value

    , conf.value_in_use

    , Uptime.DaysUptime AS [Days Server UPTIME & Last Stats Reset]

    , CASE WHEN Uptime.DaysUptime < 45 THEN 'You may not have very meaningful stats because of a recent restart' ELSE NULL END AS [Stats Warning]

    , CASE WHEN optz.value < conf.value THEN 'Cost threshold for parallelism might be too low' ELSE NULL END AS [Cost Threshold Warning]

    , CASE WHEN conf.value_in_use <> conf.value THEN 'Server needs to be restarted for the setting to take effect' ELSE NULL END [Restart Warning]

    FROM

    sys.dm_exec_query_optimizer_info optz

    CROSS JOIN sys.configurations conf

    OUTER APPLY(

    SELECT

    CONVERT(DECIMAL(18 , 2) , DATEDIFF(hh , create_date , GETDATE()) / 24.0) AS DaysUptime

    FROM

    sys.databases

    WHERE

    name = 'tempdb'

    ) Uptime

    WHERE

    optz.counter = 'final cost'

    AND conf.name = 'cost threshold for parallelism'

    GO

Viewing 15 posts - 1 through 15 (of 34 total)

You must be logged in to reply to this topic. Login to reply