The Accidental DBA - Wait Stats

  • Hello All,

    Am after a little bit of advice.

    In my ever expanding career as a DBA with my current company, a role I fell in to I am trying to diagnose why a production SQL instance on occasion is reported as slowing down.

    This has been an ongoing battle with a consortium member who has their own application that interfaces with our system.

    One of the topics I am exploring is Waits, a previous DBD had gathered some stats as -

    WaitType NumWaits WaitSecs WaitPct
    CXPACKET 8862234 60996.36 87.76
    LATCH_EX 3348554 3249.08 4.68
    PREEMPTIVE_OLEDBOPS 16390 1114 1.61
    WRITELOG 249757 1004.57 1.45
    OLEDB 9269 879.97 1.27
    SOS_SCHEDULER_YIELD 599550 532.33 0.77
    LCK_M_IX 139 519.94 0.75
    LCK_M_S 16304 376.91 0.55
    PAGEIOLATCH_SH 18825 200.43 0.29
    PAGEIOLATCH_EX 18237 144.75 0.21

    And reported that -

    β€œThe large number of CXPACKET waits, LATCH_EX waits and ACCESS_METHODS_DATASET_PARENT latch waits indicate that parallel query is going wrong.
    The wrong way to fix this would be to switch parallelism off or to increase the cost threshold for parallelism further.
    Both of these approaches would prevent queries that would benefit from parallel query from being able to use it.”

    I have gathered some more up to date waits as 

    WaitType NumWaits WaitSecs WaitPct
    CXPACKET 1157053340 5625472.52 41.11
    WRITELOG 504317894 1346463.45 9.84
    OLEDB 26333852 1225484.54 8.95
    PREEMPTIVE_OLEDBOPS 28305166 1212049.53 8.86
    LCK_M_S 25643443 937286.31 6.85
    PAGEIOLATCH_SH 41028159 738076.13 5.39
    LCK_M_IX 63552 650726.77 4.76
    SOS_SCHEDULER_YIELD 679518705 512659.17 3.75
    PAGEIOLATCH_EX 24280253 327382.99 2.39
    LCK_M_U 24353974 248477.78 1.82
    ASYNC_IO_COMPLETION 42007 177588.45 1.3

    I am aware of the Wait types but could do with a little guidance.

    Thanks in advance.

    Stephen

  • Stephen

    Yes, the advice from the previous DBA is good.  What I would do is speak to users to find our excatly what they're doing when the application is running slow.  Find out which queries are the most expensive in terms of elapsed time and see if you can tune them.  That won't necessarily be the queries that take the longest to run, because you'll also want to take into account how often they are executed as well.  And read Paul Randal's pages on wait stats, especially with regard to the wait types in your list.

    Are you capturing wait stats regularly, or is what you posted the raw data from sys,dm_os_wait_stats, which is cumulative since wait stats were last cleared?

    John

  • I don't like looking at percentage waits, and imo, waits aggregated over more than an hour are useless, I can't tell what might have caused them.

    Create a job that writes wait stats to a table, run it every 30 minutes. Use Glenn Berry's latest DMV scripts for the list of waits to ignore. Track over a day or two and then look at the highest waits over periods that you're interested in.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • John Mitchell-245523 - Monday, March 6, 2017 3:25 AM

    Yes, the advice from the previous DBA is good.

    Yup, although since the latest waits have CXPacket but no Latch_EX, there's probably less of a problem with parallel queries now than there was when the initial waits were captured.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi Both,

    Thanks for thodse replies.

    Sadly this is a remote instance that I have no direct access to and have to do everything remotely.

    The first wait stats were taken in October 2015 and I made a site visit this February, I think they are all cumulative.

    I can't make changes like setting up a job, I had to jump through hoops to get them to put in some server side tracing so I could profile some specific RPC durations over 5 seconds.

    I understand than now that LATCH_EX is no longer present our issues with parallelism has reduced, but other things are now causing us issues which I am trying to identify.

    Shall take a look a Glenn's DMV to see if it's any later than the bits I've already pinched from the very useful armoury.

    Any further tips welcomed.

    Stephen

  • Stephen

    The only other thing I would say is that you need to explain to whoever owns the server that if they want you to manage it, you need proper access to it.  If they're stopping you from doing stuff like capturing wait stats then you're going to have no control when performance deteriorates even furher.

    John

  • I think they are all cumulative.

    They are. Cumulative since SQL started (most people don't clear wait stats).

    Can you get permission to add a job? Or ask whoever has access to run a query extracting wait stats every half an hour and send you the results?
    Because it's really hard to say anything useful about stats aggregated over days or weeks. They're probably skewed by maintenance jobs, and hence hard to interpret.

    If you want to use the waits to diagnose why the server is slow at particular times, then you need to see what waits were accumulated when the server was slow

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster - Monday, March 6, 2017 3:56 AM

    I think they are all cumulative.

    They are. Cumulative since SQL started (most people don't clear wait stats).

    Can you get permission to add a job? Or ask whoever has access to run a query extracting wait stats every half an hour and send you the results?
    Because it's really hard to say anything useful about stats aggregated over days or weeks. They're probably skewed by maintenance jobs, and hence hard to interpret.

    If you want to use the waits to diagnose why the server is slow at particular times, then you need to see what waits were accumulated when the server was slow

    Spot on as usual Gail.

    Differential wait stats analysis (and same for file IO stalls) are bread-and-butter tuning mechanisms. And you need shorter intervals to be helpful. Also, I very frequently run these two with just 3 or 5 minutes of delay for the interval, captured during a period that I know is experiencing performance problems.

    Have you also looked for top poorly performing queries? Or how many rows they are trying to spool back to the client? Or reasons OUTSIDE of SQL Server why the application could be slow?

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thanks for the replies, every day is a school day.

    I am trying to get some wait stats over a period of time starting with a 1 min delay so far for a starter.

    As mentioned, I have no access to this server me in the UK and the system in France with no outside country access so only get to hear about performance issues post incident.

    I am gathering some stats on queries this afternoon, also, the client application that hooks into our instance is Java based and uses our procs but exec sp_prepexec them?

    Thanks again

  • stephen.plant2 - Monday, March 6, 2017 7:12 AM

    Thanks for the replies, every day is a school day.

    I am trying to get some wait stats over a period of time starting with a 1 min delay so far for a starter.

    As mentioned, I have no access to this server me in the UK and the system in France with no outside country access so only get to hear about performance issues post incident.

    I am gathering some stats on queries this afternoon, also, the client application that hooks into our instance is Java based and uses our procs but exec sp_prepexec them?

    Thanks again

    All you mentioned is bad, sorry to say. I am EXCEPTIONALLY good at performance tuning SQL Server - finding and making problems go away. But often even I cannot fix things without some forensic information WHILE THE EVENT IS HAPPENING. It is just silly to be reactive and not be able to access the box you are supposed to be tuning. And Java apps are just horrible. I have never, in 20+ years of consulting on SQL Server, come across one that wasn't just awful. In all fairness I don't get called in for routine ops, and thus pretty much all applications I come across suck. πŸ˜Ž

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Stephen,

    Can you provide queries to the application care takers and ask them to run them right after a problem is reported?  If so, you can offer them something like sp_BlitzCache from Brent Ozar or a similar query against the cache to get the most expensive queries.  If that data is gathered within a few minutes of the reported problem, then you can work on the problem.

  • lmarkum - Monday, March 6, 2017 10:30 PM

    Stephen,

    Can you provide queries to the application care takers and ask them to run them right after a problem is reported?  If so, you can offer them something like sp_BlitzCache from Brent Ozar or a similar query against the cache to get the most expensive queries.  If that data is gathered within a few minutes of the reported problem, then you can work on the problem.

    It is quite a difficult task to undertake to get our consortium partners to act on this. They definitely would not nor would we run a standard Blitz script on our prod instances.

    We are investigating a method to give our first line support based in country to access a more real time analysis method now.

    Thanks for all the replies.

  • Middle and senior IT management at your company needs to be taken out behind the wood shed and soundly thrashed. Their decisions and failure to act are a root cause of what is going on.

    In any case, good luck with it! πŸ™‚

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply