sql job taking more time

  • Hi All,

    Need some advice on troubleshooting sql jobs.

    We have a ongoing performance issue on one of our Production server.  Application teams says when Jobs kicked off from an application, is taking long time to complete(i.e. ~24 hours). Usually, these jobs are supposed to complete in 6-8 hours.

    App team says there are no changes made from Application side. Database Team says, from Database side, they do not see any issues. i.e. if they run the job sql stmts and run them is SSMS, they complete in expected time.

    Given a case like this, what would be the troubleshooting approach from application side and from database side?
    What kind of information do we need to collect? like version ? job script ? job steps what they do? sql statements within the job steps? tables and indexes involved? rows etc...

    Also, does sql server version, edition and service pack level play any role here? So, what would be the proper methodical troubleshooting approach for this problem?

    Please suggest.

    Thanks,
    Sam

  • Hi Sam,

    I'd like to see a proper, methodological approach to investigating this as well, I'll be watching this thread 🙂. Here's what I would do.

    I would log what's currently occurring at recurring intervals during the agent job's execution using sp_whoisactive and compare the output of that to the expected ("good") progress. An example of logging the output of that to a table at intervals is here. If I can't create the sp_whoisactive procedure, I would log the output of sys.dm_exec_requests and sys.dm_os_wait_stats to tables and review those.

    This would tell me if the job is slower than expected at certain parts of the job, and if so, focus on working out what it's doing there - especially if there's blocking or waits going on that might be the cause. 

    I'd also try to find whether there are any session setting differences between the SQL Agent job and the sessions being used by the database team, however I haven't yet seen any of these make the Agent job slower.

    Best,
    Andrew

  • Thanks Andrew.. We don't have waits for a good run but definitely we can collect waits using sp_whoisactive and review the output.

  • Have you considered a trace for locks? A theory to test would be see if when the dba's run it (If they're doing so intraday), whatever's locking the process during routine scheduled isn't happenening intraday?

  • You need to capture and analyse all of the differences between a good vs bad execution that you can to determine the root cause. You need to capture at least the following information:

    • Wait Statistics
    • Session Statistics (for the execution and for other sessions on the server).
    • Performance Counters
    Ideally, you'll also be capturing additional information such as:

    • Cache Usage details
    • Execution Plans
    • IO statistics
    • Index usage statistics

    A lot this information could be captured using tools like sp_whoisactive and Brent Ozar's Blitz procedures.

    This article has a lot of great information on troubleshooting performance and getting the basics of your troubleshooting methodology correct (https://www.red-gate.com/simple-talk/sql/performance/a-performance-troubleshooting-methodology-for-sql-server/).

    You should start there and see what the collected information reveals about the differences between your executions. Collect as much data as possible as this will help pinpoint the differences and potential solutions.

  • HandyD - Sunday, March 10, 2019 10:56 PM

    You need to capture and analyse all of the differences between a good vs bad execution that you can to determine the root cause. You need to capture at least the following information:

    • Wait Statistics
    • Session Statistics (for the execution and for other sessions on the server).
    • Performance Counters
    Ideally, you'll also be capturing additional information such as:

    • Cache Usage details
    • Execution Plans
    • IO statistics
    • Index usage statistics

    A lot this information could be captured using tools like sp_whoisactive and Brent Ozar's Blitz procedures.

    This article has a lot of great information on troubleshooting performance and getting the basics of your troubleshooting methodology correct (https://www.red-gate.com/simple-talk/sql/performance/a-performance-troubleshooting-methodology-for-sql-server/).

    You should start there and see what the collected information reveals about the differences between your executions. Collect as much data as possible as this will help pinpoint the differences and potential solutions.

    Thanks a lot. Good pointers.

  • Personally, I'd go back to the old ways of starting to troubleshoot.  First, you have a job that normally takes 6-8 hours to complete and, for me, that would be an issue to begin with.  The job is capable of writing output to a file.  Add some code to the procs in the job to measure time and use RAISERROR with 0,0 parameters to act as print statements to print the durations of each section of code to find out what the long runners are.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Monday, March 11, 2019 6:39 AM

    Personally, I'd go back to the old ways of starting to troubleshoot.  First, you have a job that normally takes 6-8 hours to complete and, for me, that would be an issue to begin with.

    It's hard to say though, without knowing what the job does or the data set it is dealing with, that 6-8 hours is a problem. If that is an acceptable runtime for this job, then that's the baseline to aim for when troubleshooting inconsistent performance. Improving the job performance and getting it back to baseline are different challenges and, while there might be some overlap, there are different strategies involved.

    If we want to improve performance below 6-8 hours, as you said we definitely need to approach the investigation by identifying the worst performing areas, getting to the root of those long runtimes and tune to improve. In this case, however, we need to get consistent runtimes between two seperate executions, which means identifying the differences between them and getting the known bad execution to the same state as the known good execution.

    Think of it as a single query that has a wide variation in runtime over the course of the day - sure I can potentially tune the query to get all executions to run quicker, but I might still end up with this wide variation in execution times. Although bad executions are now quicker, they're still slower than good executions, and that is something that also needs to be resolved. Identifying the cause and solution requires a different methodology than actually tuning the query itself.

  • check what else is running when job run,  check job schedule conflict, there might be other jobs/process run that might effect the job duration.

  • HandyD - Monday, March 11, 2019 5:55 PM

    Jeff Moden - Monday, March 11, 2019 6:39 AM

    Personally, I'd go back to the old ways of starting to troubleshoot.  First, you have a job that normally takes 6-8 hours to complete and, for me, that would be an issue to begin with.

    It's hard to say though, without knowing what the job does or the data set it is dealing with, that 6-8 hours is a problem. If that is an acceptable runtime for this job, then that's the baseline to aim for when troubleshooting inconsistent performance. Improving the job performance and getting it back to baseline are different challenges and, while there might be some overlap, there are different strategies involved.

    If we want to improve performance below 6-8 hours, as you said we definitely need to approach the investigation by identifying the worst performing areas, getting to the root of those long runtimes and tune to improve. In this case, however, we need to get consistent runtimes between two seperate executions, which means identifying the differences between them and getting the known bad execution to the same state as the known good execution.

    Think of it as a single query that has a wide variation in runtime over the course of the day - sure I can potentially tune the query to get all executions to run quicker, but I might still end up with this wide variation in execution times. Although bad executions are now quicker, they're still slower than good executions, and that is something that also needs to be resolved. Identifying the cause and solution requires a different methodology than actually tuning the query itself.

    I agree that, without knowing what the job does, that 6-8 hour run times may not be a problem.  I've also found that them not actually being a problem is the exception rather than the rule. 😀  Of course, a lot of people will sometimes say such a job isn't a problem because it runs overnight in a non-24 hour shop... that's also usually when all the heavy lifting is done in such shops and you don't need many 6-8 hour jobs to screw that up.

    In other words, a 6-8 hour job MUST be examined and, if there's someone that's good at such things, there's usually a way to knock it down to an hour or less.  My experience has been that getting such things down to 10 or 15 minutes isn't that difficult.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply