Very erratic performance in last few days.

  • jay-h

    SSCoach

    Points: 18808

    I apologize for such a general question, and there is not enough info for a detailed answer, but ANY general suggestions would be welcome.

    We're running an SQL2008R2 instance on a VM with 12 cores and 32G ram. For years a number of nightly TSQL jobs have been run, each one consisting of multiple steps, each step (generally relatively small select queries and update steps) normally runs from a few mS to maybe 2 minutes. It has been in place for years.

    This week, the jobs suddenly are not complete in the morning. What is happening is that only one or two of the many steps during the night unexpectedly take an extraordinarily long time to run (a step that might normally be a few seconds to a minute suddenly takes an hour or two).

    * It's affecting DIFFERENT steps at DIFFERENT times during the night. Does not seem related to time, or to what kind of step is running.

    * During the rest of the night all the other steps take normal times.

    * There are no errors in the SQL logs or the Windows logs.

    * Monitoring software (Idera diagnostic manager) shows normal levels of locks, wait states, disk access times, memory, CPU etc during these time periods--nothing that would appear to cause such extreme slowdown.

    * After an affected step ends, the following steps occur at normal speed like nothing was ever wrong. For example a step that normally takes 50 seconds might suddenly take an hour, but the step before, and the step after seem perfectly normal.

    * VM administrator has examined VM logs and sees no evidence of extended processor 'ready' states, memory or disk conflicts, or other VM issues. Since SQL and OS don't "see" such slowdowns, we had theorized that VM issues might be the case.

    * Rerunning the job will often complete at a normal speed.

    Not sure where to look next. Never saw an issue like this before.

    ...

    -- FORTRAN manual for Xerox Computers --

  • Sue_H

    SSC Guru

    Points: 90260

    Did you check other guests using that host to see if they have experienced any issues around the same time that job has issues?
    Did you check sysjobs to see if the job has been modified recently?
    If possible, try changing the time the job runs.
    Could you add a step at the beginning to dump wait stats to a table and then last step to capture them again so that you can see what the wait stats were during that time period.

    Sue

  • jay-h

    SSCoach

    Points: 18808

    Thank you for your input so far. 

    Yes I did check for job changes, and in each case it was a different query step that failed (even though this is regular, very consistent data). The monitor software provided logging of wait states and did not show anything odd.

    Over the weekend, however, the server support admin updated patches and rebooted it off hours. So far today we've not seen any problems.

    I'll hang on and watch.

    ...

    -- FORTRAN manual for Xerox Computers --

  • Arsh

    SSCertifiable

    Points: 6031

    Did you look at ASYNC_NETWORK_IO wait type trend at time . Check this in conjunction with inputs from network team .
  • psheperia

    SSC Rookie

    Points: 47

    It's worth looking at the index fragmentation. Bad indexes could be hampering query performance.

  • jay-h

    SSCoach

    Points: 18808

    Thanks for additional suggestions.

    We were monitoring wait states and saw no increase at the times of the slowdowns.

    Indexing can slow things, but the statements involved were different each time, and the next time they ran things were fine. The failure points were different statements and at different times at night.

    So far we've not seen this repeat since the full restart.

    ...

    -- FORTRAN manual for Xerox Computers --

  • mig28mx

    Ten Centuries

    Points: 1022

    Hi,
    For the symptoms described, can be a bad cardinality issue or parameter sniffing issue.
    Basically, you ended with a wrong execution plan that causes the unexpected delays in your jobs.
    When you restart the instance, all the execution plans are wiped out and when the jobs run again, it creates a new one.

    If there is a bad cardinallity issue, by updating the statistics will solve it.
    If there is a parameter sniffing issue, removing the plan from memory, can mitigate the issue, but more research will be required.

    Hope this helps.

    .

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply