15 seconds pending execution

  • Been a little while since I posted anything, but ran into some interesting problems after installing the latest windows updates on my SSIS server.  Installed both windows and SQL updates so it is fully patched (except optional updates).

    Ran into 2 problems, one which I resolved.  I did test this on my test system before updating, but apparently not well enough.
    The resolved issue was that if I executed an SSIS package from an SSIS catalog either directly or through a SQL job, it was running a lot slower than before updates.  The problem was I had set up environments for the packages and was using those and for some reason if I used the "ServerName\InstanceName" as the variable, the package would take exceptionally long to complete, but if I replaced that with the SQL Alias, it completed faster.  Still not as fast as prior to the updates, but faster.  One example:
    prior to any updates - 5 seconds for the job to complete
    after updates - 3 minutes 20 seconds to complete
    after changing environment variable to use the SQL alias - 16 seconds to complete

    Problem 2, which is what I am on here today for, is that now every SSIS package I start, either through a job or directly, takes exactly 15 seconds to get through the "pending execution" phase.  After that completes, the package executes in the normal time.  

    It is on SQL server Standard 2012 version 11.0.7001 and this performance issue seems to be unrelated to what package we run.  A simple data-copy one (no transforms, just moving data from database A to database B) has a 15 second pending execution phase then runs.  Or a more complex one with a lot of transforms takes 15 seconds to do the pending execution.
    I have looked around online to the best of my ability and have yet to come across a solution.  I did find that adding "127.0.0.1 crl.microsoft.com" to the hosts file may help, but I didn't notice any improvement after adding this.  The Windows server is 2012 R2 Standard version 6.3 build 9600.

    Has anyone seen this behaviuor before?  When double checking the test and live servers I have, they both exhibit this behaviour.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I'd start with having your network team analyze the traffic and see exactly what's going on.   While awaiting that response, I'd search online for any issues with the various KB articles that are part of all your updates.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson - Thursday, March 8, 2018 9:49 AM

    I'd start with having your network team analyze the traffic and see exactly what's going on.   While awaiting that response, I'd search online for any issues with the various KB articles that are part of all your updates.

    Thanks.  I'll shoot some stuff off to my IT guys and see what they can see.
    I did find some KB's but they were all about the certificate revocation list which takes roughly 15 seconds to time out so that was a big red flag, but they suggested blocking it at the firewall (which IT didn't want to do) or blocking it in the hosts file (which is what I ended up doing) but it made no difference.
    Normally I wouldn't be too concerned about a 15 second delay, but we have some SQL jobs that have 30+ steps in them that are all smaller SSIS packages for moving and transforming data.  That adds a lot of extra time onto a data load.

    Thanks for the tip though; I'll poke the IT guys to check out the network traffic and see what is happening.  If that doesn't help, I may need to look at increasing the complexity of our SSIS packages, but I don't like that solution.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • bmg002 - Thursday, March 8, 2018 10:21 AM

    sgmunson - Thursday, March 8, 2018 9:49 AM

    I'd start with having your network team analyze the traffic and see exactly what's going on.   While awaiting that response, I'd search online for any issues with the various KB articles that are part of all your updates.

    Thanks.  I'll shoot some stuff off to my IT guys and see what they can see.
    I did find some KB's but they were all about the certificate revocation list which takes roughly 15 seconds to time out so that was a big red flag, but they suggested blocking it at the firewall (which IT didn't want to do) or blocking it in the hosts file (which is what I ended up doing) but it made no difference.
    Normally I wouldn't be too concerned about a 15 second delay, but we have some SQL jobs that have 30+ steps in them that are all smaller SSIS packages for moving and transforming data.  That adds a lot of extra time onto a data load.

    Thanks for the tip though; I'll poke the IT guys to check out the network traffic and see what is happening.  If that doesn't help, I may need to look at increasing the complexity of our SSIS packages, but I don't like that solution.

    Certificate revocation list?  Sounds like something that you'd have to prevent from occurring at all, because I suspect that just blocking it doesn't do anything except keeping it from succeeding at all, and thus it times out.  Not sure if that's something you can configure, or it's some security feature that is effectively blocked by some existing firewall rule on protocol or odd port number or some such.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson - Tuesday, March 13, 2018 1:24 PM

    bmg002 - Thursday, March 8, 2018 10:21 AM

    sgmunson - Thursday, March 8, 2018 9:49 AM

    I'd start with having your network team analyze the traffic and see exactly what's going on.   While awaiting that response, I'd search online for any issues with the various KB articles that are part of all your updates.

    Thanks.  I'll shoot some stuff off to my IT guys and see what they can see.
    I did find some KB's but they were all about the certificate revocation list which takes roughly 15 seconds to time out so that was a big red flag, but they suggested blocking it at the firewall (which IT didn't want to do) or blocking it in the hosts file (which is what I ended up doing) but it made no difference.
    Normally I wouldn't be too concerned about a 15 second delay, but we have some SQL jobs that have 30+ steps in them that are all smaller SSIS packages for moving and transforming data.  That adds a lot of extra time onto a data load.

    Thanks for the tip though; I'll poke the IT guys to check out the network traffic and see what is happening.  If that doesn't help, I may need to look at increasing the complexity of our SSIS packages, but I don't like that solution.

    Certificate revocation list?  Sounds like something that you'd have to prevent from occurring at all, because I suspect that just blocking it doesn't do anything except keeping it from succeeding at all, and thus it times out.  Not sure if that's something you can configure, or it's some security feature that is effectively blocked by some existing firewall rule on protocol or odd port number or some such.

    yeah, from what I've read online on various places (including technet), they indicate adding 127.0.0.1 crl.microsoft.com to your hosts file should result in a quick-fail response to the site and give much better performance, but it seemed to make no difference on my server.  I checked the webpage in IE and it comes back rather promptly (with a 400 error mind you, but it is a fast response), so I am thinking my issues are unrelated to the CRL.  Blocking it in the hosts file seemed to make no difference.
    The other interesting thing I noticed is that a 1 step SQL job (the step just calls a simple SSIS package) runs slowly.  The job step completes in 20 seconds, the job itself completes in 33 seconds.  Prior to the updates, the job step would complete in 5 seconds tops.

    when I look at the sysjobactivity table, prior to the update, I see data like:
    Run_Reuqested_Date - 2018-03-02 20:05:00.000
    Start_execution_date - 2018-03-02 20:05:00.000
    Last_executed_step_date - 2018-03-02 20:05:00.000
    Stop_execution_date - 2018-03-02 20:05:05.000
    After the update, it looks more like:
    Run_requested_date - 2018-03-13 15:05:00.000
    Start_execution_date - 2018-03-13 15:05:00.000
    last_executed_step_date - 2018-03-13 15:05:04.000
    stop_execution_date - 2018-03-13 15:05:32.000

    If I go into the overview of that same SSIS package, I an see the duration of the individual package steps (using today's date) it shows the 3 steps (the whole package, the "clear existing data" step in the package and the "move data" step in the package) is only taking a little over 1 second to complete.  BUT the duration of the whole package is 16.067 seconds.  The duration of the job though is 32 seconds for the above example.
    I can't find anything useful in either the SQL logs or the windows logs either.  I did try doing some stuff with netstat, but it didn't tell me anything interesting.  I tried turning off the firewall on the server (short term) but it made no difference either.

    This one is confusing me as we didn't change the job at all.  Just windows updates (including SQL updates).  It did result in a .NET update... I wonder if that is being problematic...

    As for blocking the CRL, best I could find, it is not a configurable thing as it is baked into .NET.

    Forgot to mention - my IT department told me they see nothing bad happening network wise.  I imagine this is going to be something simple that I am just not thinking to check (like some service got switched from automatic to manual with the update), but I am just running out of ideas on what to look at or for.

    Last edit for this post - I tried running the 64-bit dtexec and it took 17.969 seconds to run the package.  The 32-bit dtexec took 20.453 seconds.  running it in visual studio takes 5 seconds but that is not a realistic scenario.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Hi,

    Did you ever get an answer to this ? I have a similar problem with a number of packages on a number of different servers (various versions of SQL 2012 - 2017) and I can see that the package is trying to connect to the original server that one of my connection strings was pointing to. I have overwritten it in the SSIS configuration on the new, deployed to server and also in the SQL job that calls it. Both the overall SSIS job and the connection in SSIS have DelayValidation set to False. In some cases I have up to 2 minutes  of 'Pending' state before the job actually commences and I can see in the messages that there is a 15s connection timeout for every time the SSIS package tries to connect using the connection in question (ie: Every time a stored proc or piece of SQL is run), hence if it happens 8 times I have a 2 minute waiting period. I can't find anything online about this and I am completely baffled.

    Regards

  • My solution was a "dirty" one and not one that I'd recommend, but it was required to get the performance back that I needed: modify the hosts file.

    Basically, if I am adding the IP and hostname of the machine to the hosts file.  For example, if the machine hosting my SQL instance was called SQLLIVE1 and had an IP if 192.168.0.12, I would add the following line to the end of the file C:\Windows\System32\Drivers\etc\HOSTS:

    192.168.0.12 SQLLIVE1

    After making this change, I would test to see if there was a performance change.  The performance change for everything that connected to SQLLIVE1 was quite apparent (15 seconds dropped to 0), so I repeated the process with all the SQL Server host machines.  You WILL need:

    1. to know the hostname of the machine(s) hosting the SQL instance
    2. to know the IP address of the machine(s) hosting the SQL instance
    3. to have admin access to the machine(s) hosting SSIS

    Is this a good solution?  No.  Did it solve the problem?  No.

    Now this is not a good solution because in the event you have failover setup or you need to spin up a new VM/Server for your SQL instances, you have to remember to update the HOSTS file.  If you have your SSIS set up for failover, you need to make sure the HOSTS file matches on all the machines.

    Now why do I say this didn't solve the problem you may ask.  The reason I say that is what was the problem?  I know the SYMPTOM is that there is a delay before an SSIS package would actually start.  But the actual problem, I never figured that out.  My guess is something in the network stack is causing a delayed lookup of the DNS name to IP, but I was not able to get good help from the internal network where I work to figure out where the problem was.  I found a band-aid fix, ran it past the IT team who told me "that won't make a difference" and then when I showed them it fixed the symptoms, they told me to make that live and we could investigate it in the future if/when it acted up again.

    So my advice is to use the above "fix" with caution.  It may fix the slow start of an SSIS package, but it is really just hiding an underlying issue.  We implemented this on October 16th, 2018 after trying a LOT of other things and hoping a future patch would fix things and it has been working well for us.  I am still concerned that this fix will become problematic in the future, but thankfully our failover software fails over the IP and hostname when failing over to the secondary or tertiary machines.  So as long as we keep the hostname and IP (which I can't see us changing), we shouldn't have any additional performance issues.

    After implementing the "fix" we changed some of the environment variables back to use the server\instance format instead of the SQL alias.  It seems to make no performance difference if we use the alias or server\instance once we added the lines to the hosts file.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply