Strange Job Slowdown 30 secs to 30 minutes

  • I have a job that runs 22 times a day.

    Four of those times, it is kicked off by another job.

    We recently moved to a SAN (EMC SAN is a VNX5300). That seemed great, all jobs sped up.

    Then we virtualized the machine to VMWare 5.1

    Now all jobs are fine except the one that runs 22 times a day. That one runs fine too except when it is kicked off by another job at 3:30 AM. It runs fine the other three times it is kicked off by the other job, it only slows down at 3:30 AM.

    Normally, the job takes between 30 seconds and 5 minutes depending on the volume of data.

    Before virtualization, the run at 3:30 AM was taking between 30 seconds and 1 minute.

    Now it is taking 28 minutes.

    Nothing else is running on the server at this time. Everything else has finished around 3:00.

    Backups (EMC, third party solution) start at 12:30 AM, finish at 3:28 AM, and don't slow down any of the other jobs that run between 12:30 and 3:30.

    The step in the job that is taking longer is an SSIS package.

    The job moved data from one database to another on the same server, then does a query. The data from the query is fed to an SSIS loop which creates files.

    All the files that are created have a timestamp within a minute (i.e. all are 3:55)

    This leads me to believe the slowdown is in one of the first two queries.

    The first, larger query, that inserts data to another table, typically has 500-600 records at this hour. The indexes all are under 5% fragmentation.

    I plan on enabling logging on the SSIS package tomorrow (today is a business-critical day, no changes allowed)

    Any ideas?

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Have you checked with the VMware/Server Team? Maybe they're running snapshots backups of the VM, or doing some other maintenance on the host which is causing the VM layer to slow down.

    It's very difficult to monitor or see slowdowns that are in the virtualisation layer from the guest, so definitely work along side the server team to diagnose this.

  • Another idea, to at least try to narrow down the scope, can you re-schedule the 3:30am job? Maybe move it to 4:00am so it's going after the backup runs?

    If it stops happening, then you start working with the SAN and VMWare teams to see what is happening around that time that might be causing the slowdown...

    If it keeps happening, then you keep digging...

  • HowardW (8/8/2013)


    Have you checked with the VMware/Server Team? Maybe they're running snapshots backups of the VM, or doing some other maintenance on the host which is causing the VM layer to slow down.

    It's very difficult to monitor or see slowdowns that are in the virtualisation layer from the guest, so definitely work along side the server team to diagnose this.

    We have been. They tell us nothing is happening on their side.

    They've provided graphs of RAM, IO and CPU and there's no spike at all for this time in any of them. In fact, it lies in something of a trough. I don't think it breaks 20% usage in any of the graphs, while jobs before and after hit 40% to 60%

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • jasona.work (8/8/2013)


    Another idea, to at least try to narrow down the scope, can you re-schedule the 3:30am job? Maybe move it to 4:00am so it's going after the backup runs?

    If it stops happening, then you start working with the SAN and VMWare teams to see what is happening around that time that might be causing the slowdown...

    If it keeps happening, then you keep digging...

    We're considering moving it to 3:00. We'll see if that helps.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • How many other VMs are on this box? If there are others, are they doing something around that time frame that the VM team is unaware of?

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (8/8/2013)


    How many other VMs are on this box? If there are others, are they doing something around that time frame that the VM team is unaware of?

    According to the VM team, this is the only one.

    We initially had problems with slowdown on all jobs because they didn't maintain the RAM from pre-VM, but they fixed that quickly and everything except this one instance of this one job sped back up.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Hrm. Have you asked them to double-check the page file?

    Even though this job works well the other times, consider what data it might be picking up at the 3:30 run that isn't available at other run times. Data generated by other jobs running right before that window. So, the job at 3:30 might actually require a bigger page file than the other run times and if they didn't maintain this at its original or bigger size (like the RAM issue you mentioned) then this could also be an issue.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (8/8/2013)


    Hrm. Have you asked them to double-check the page file?

    Even though this job works well the other times, consider what data it might be picking up at the 3:30 run that isn't available at other run times. Data generated by other jobs running right before that window. So, the job at 3:30 might actually require a bigger page file than the other run times and if they didn't maintain this at its original or bigger size (like the RAM issue you mentioned) then this could also be an issue.

    The data this job has at this time is much smaller than most of the times it runs. However, just before this, starting at 2 and ending between 2:45 and 3:10 is the most intensive job of the day. Could it be leaving something behind or hanging on to resources? The 4:00 job (also more intensive) runs just fine.

    And I'm not sure what you mean when you say "the page file". I'd like to be able to explain if they ask.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Stefan Krzywicki (8/8/2013)


    Brandie Tarvin (8/8/2013)


    Hrm. Have you asked them to double-check the page file?

    Even though this job works well the other times, consider what data it might be picking up at the 3:30 run that isn't available at other run times. Data generated by other jobs running right before that window. So, the job at 3:30 might actually require a bigger page file than the other run times and if they didn't maintain this at its original or bigger size (like the RAM issue you mentioned) then this could also be an issue.

    The data this job has at this time is much smaller than most of the times it runs. However, just before this, starting at 2 and ending between 2:45 and 3:10 is the most intensive job of the day. Could it be leaving something behind or hanging on to resources? The 4:00 job (also more intensive) runs just fine.

    And I'm not sure what you mean when you say "the page file". I'd like to be able to explain if they ask.

    The official name is paging file or swap file.

    Every Win machine has one unless it's been deliberately disabled. It's a file on a hard drive (hidden) usually set at about 1 1/2 times the machine's RAM to hold parts of programs and data files that don't fit into memory. The RAM plus this file is what makes up the total virtual memory of a machine.

    Go to your PC or laptop, right click on My Computer, pick Properties and then Advanced. Click the Settings button in the Performance box. The Advanced tab of that has a Virtual Memory box down at the bottom. This is where the paging file properties are set.

    While advice has differed in recent years about the size of the paging file, I have always found that the older standard of 1 & 1/2 times of physical RAM (rather than smaller) is the best setting for my OS performances. In fact, when I first started my current job, I was able to trace back a SQL Server problem to the fact that our server guys had set too small a size on the paging file. And when I resized to the above recommended size, it resolve tons of performance issues.

    RE: that other job? Yes, I think it's probably hanging on to resources. So setting this job back to 3:00 is probably not going to resolve the issue. You might want to switch the time to 4:00 3:45 instead (if you can) as a better test.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (8/8/2013)


    Stefan Krzywicki (8/8/2013)


    Brandie Tarvin (8/8/2013)


    Hrm. Have you asked them to double-check the page file?

    Even though this job works well the other times, consider what data it might be picking up at the 3:30 run that isn't available at other run times. Data generated by other jobs running right before that window. So, the job at 3:30 might actually require a bigger page file than the other run times and if they didn't maintain this at its original or bigger size (like the RAM issue you mentioned) then this could also be an issue.

    The data this job has at this time is much smaller than most of the times it runs. However, just before this, starting at 2 and ending between 2:45 and 3:10 is the most intensive job of the day. Could it be leaving something behind or hanging on to resources? The 4:00 job (also more intensive) runs just fine.

    And I'm not sure what you mean when you say "the page file". I'd like to be able to explain if they ask.

    The official name is paging file or swap file.

    Every Win machine has one unless it's been deliberately disabled. It's a file on a hard drive (hidden) usually set at about 1 1/2 times the machine's RAM to hold parts of programs and data files that don't fit into memory. The RAM plus this file is what makes up the total virtual memory of a machine.

    Go to your PC or laptop, right click on My Computer, pick Properties and then Advanced. Click the Settings button in the Performance box. The Advanced tab of that has a Virtual Memory box down at the bottom. This is where the paging file properties are set.

    While advice has differed in recent years about the size of the paging file, I have always found that the older standard of 1 & 1/2 times of physical RAM (rather than smaller) is the best setting for my OS performances. In fact, when I first started my current job, I was able to trace back a SQL Server problem to the fact that our server guys had set too small a size on the paging file. And when I resized to the above recommended size, it resolve tons of performance issues.

    RE: that other job? Yes, I think it's probably hanging on to resources. So setting this job back to 3:00 is probably not going to resolve the issue. You might want to switch the time to 4:00 instead (if you can) as a better test.

    I mentioned "Page file" to him and he said he knew what that was. I'll send him the rest of this though! Thanks!

    Moving it to 4 is rough since we have another job at 4 and it needs to be done by 6 and sometimes takes over an hour. I'll see though.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • You may want to schedule a trace to see exactly where the pause occurs.

    Virtual host with 1 virtual machine? Are you sure - this does not seem right.

    The page file - is this 2008 or R2?

    And is it purely SQL, or does SSAS, RS, etc. run on the machine?

    If for some reason these fight for memory, everything can page out and look fine, but things can almost become comatose.

    If the service can be restarted before this run, and it works fine, it might be a clue.

  • Yeah, I caught that "other job at 4" comment after I posted.

    Try 3:45 instead.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Greg Edwards-268690 (8/8/2013)


    You may want to schedule a trace to see exactly where the pause occurs.

    Virtual host with 1 virtual machine? Are you sure - this does not seem right.

    The page file - is this 2008 or R2?

    And is it purely SQL, or does SSAS, RS, etc. run on the machine?

    If for some reason these fight for memory, everything can page out and look fine, but things can almost become comatose.

    If the service can be restarted before this run, and it works fine, it might be a clue.

    Virtual host with 1 virtual machine. It is our warehouse machine, it needs it. There are other VMs, but they're on other boxes. This was discussed at a local SQL Server meeting not too long ago, it can be a good idea, especially for restores, moves, etc...

    SQL Server 2008 R2

    SSIS, SSAS & SSRS are all on the machine, I believe.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Brandie Tarvin (8/8/2013)


    Yeah, I caught that "other job at 4" comment after I posted.

    Try 3:45 instead.

    If the page file examination doesn't reveal anything, we will.

    I talked to the Server/VM guy. He said it isn't just one page file any more, so he's going to go through them all and add up their space.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

Viewing 15 posts - 1 through 15 (of 50 total)

You must be logged in to reply to this topic. Login to reply