SSIS Package started failing for no obvious reason

  • I've been troubleshooting this for hours now and have run out of ideas.

    I have a package which has been executed thousands (literally) of times since it was last modified, back in 2019.

    It last executed successfully on 16 August. Since then, it has failed 100% of the time.

    This failure happens in both QA and Production servers, which are separate machines (in Azure).

    No Windows updates have been installed since 16 August. The only change we can find was an update to the A/V software, so we tried deinstalling the A/V software & testing, which had no effect on the failure.

    This is what the All Executions report looks like:

    2021-08-20_13-19-57

    There is nothing there except 'Unexpected Termination'. It looks to me as though the package invocation fails catastrophically.

    I checked Event Viewer on one of the servers. Nothing useful there, beyond 'package failed'.

    The package continues to run exactly as it should in my local development environment.

    If anyone can suggest any additional troubleshooting tactics, I'd appreciate it.

     

     

     

    • This topic was modified 2 years, 8 months ago by  Phil Parkin. Reason: Typo

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • What I've done a few times in the past with this is to completely remove the projects from SSISDB and re-deploy and also setup a new job. The weird part is it failing in 2 environments.

    I'm sure there's an explanation but with no info to work with that it's just a guessing game.

    I don't know if that error shows up when there is an issue with the security on the folder in which the package is stored in the catalog, that rings a bell. Or if the proxy account has an issue.

    Generally with things like this I tend to look at security as the culprit.

    Also our stuff is all on prem.

  • Like TangoVictor, all of my stuff is on-prem as well, so my tips and advice may be useless.

    Things I'd check (if possible) are logs on the server.  I'm thinking that it would be good to check the windows logs to see if anything strange is happening and might not hurt to check if the account the job is running as is still active.  I've had the service get locked before and it causes all sorts of headaches.  Also, if your AD admin changed the password, new instances of things (like a new package starting) may fail even though the service is still running. But if this is only happening with 1 SSIS package, then it is likely not account related.

    One other case where I saw that happen was when SSIS service had stopped.  But it was causing all of the packages to fail with the "Unexpected Termination" error.  If it is happening on only 1 job, that probably isn't the problem.

    If possible (again, not sure what Azure lets you look into) - do you have enough free memory to start the job on that server and enough free space in the database and disk for logging everything?

     

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I have managed to resolve this issue (it was keeping me up at night!)

    I can hardly believe the solution and do not understand why it works, but it does. It required a change to the design of the package.

    The original (problematic) design of the package looks like this:

    2021-08-22_12-25-02

    In summary: check the log tables in databases A and B and if one or more errors are found, move a file to an error folder and then fail the package. None of this was being executed, because the failure was happening on invocation of the package.

    The modified (and working) version of the package looks like this:

    2021-08-22_12-28-09

    It differs from the original in a couple of ways:

    • The two SQL database checks run in series rather than parallel
    • As a consequence, the precedence constraint logic (the arrows) is simpler:

      Old version: IF Errors1 > 0 or Errors2 > 0 then proceed, otherwise exit package

      New version: IF Errors1 + Errors2 > 0 then proceed, otherwise exit package

    As I have already mentioned, the package has been running unchanged for years without issue, and both versions continue to run as expected in my local development environment.

    When I was troubleshooting, I found that if I disabled either one of the SQL Check tasks, the package ran OK. This is what led me to the working version – the problem appears related to having > 1 tasks running in parallel at the beginning of the package.

    The question remains: why did it suddenly start failing, on two servers?

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Phil, by chance does the package use checkpoints? I've seen some odd behavior like this when checkpoints were enabled and the required checkpoint file was moved or deleted.

    Tim Mitchell, Microsoft Data Platform MVP
    Data Warehouse and ETL Consultant
    TimMitchell.net | @Tim_Mitchell | Tyleris.com
    ETL Best Practices

  • Tim Mitchell wrote:

    Phil, by chance does the package use checkpoints? I've seen some odd behavior like this when checkpoints were enabled and the required checkpoint file was moved or deleted.

    Hi Tim – no, it doesn't use checkpoints.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • I wonder if it was because of the flow change you made in the package or it's because you "re-saved" the package (and I might be using the wrong term term there because, as you know, I'm not an SSIS kinda guy).  I say that because I used to work with a couple of folks that said it seemed like a package would sometimes "lose its mind" and they told me that a resave would do the trick for them.

    I've run into a similar issue with some linked servers that pointed at AS-400 boxes.  Just manually testing the connection from the Object Explorer would fix the problem.  I ended up writing some code that would do the same thing and would call the proc that did the test to the beginning of other code.

    It IS odd that this happened at the same time in two different environments.  There must have been a "disturbance in the force". 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Before I HF'd Prod today, there had been no SSIS deployments since June 9th.

    (If you'll pardon the expression) I played around with the package quite a bit, as I was trying to understand exactly what was making it crash. That included creating a copy and deploying (failed with same error) and then carefully enabling and disabling different components inside the package to find out what was doing the damage.

    It was certainly the two SQL tasks running in parallel which caused it – I was able to turn the error on or off consistently by enabling or disabling either one of these tasks.

    At some point, I may try some further R&D, including creating a package from scratch and gradually adding and removing components, looking for a pattern. But right now I'm behind with other things (you know how it is when something like this comes up and consumes hours of time unexpectedly). Behaviour being different in QA from my personal dev environment is another irritant!

    Yet another nail in the coffin of SSIS as far as you are concerned, Jeff!

     

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • I am running into similar issues which did not start until I added Balance Data Distributor into the mix and something you wrote about your redesign which resembled my BDD was the impact on parallel tasks have on buffer memory and rows committed. In short, I believe the similarity is we are starving the application on resources which ends up making the package fail ungracefully.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply