Interesting Case of SSIS Failures
We had an interesting case of SSIS failures recently where the cause ended up being improved performance. Yes, improved performance caused the package to fail almost every run. We migrated the application to newer, better hardware resulting in overall improved performance of all ETL activities. Prior to the move to new hardware, one particular SSIS package was successful every run. Not so after the move.
How to troubleshoot?
The cause of the failures were easily apparent. The package pulls data from 3 different environments (Staging, PPE, and Production). For PPE and Production, it was pulling data from replica servers, but there was no replica server in Staging. In Staging, it pulls data from the log shipping server that is in stand-by mode. The log shipping restores are scheduled for every 15 minutes, and they knew that they had to keep any queries hitting the server to no more than 10 minutes. They had done that successfully. On the new hardware, the package was failing because it was querying the Staging server when the next log shipping restore would start thus disconnecting the package.
Immediately, the thoughts were that the package was running longer causing the queries to take too long on staging. However, the few times that it succeeded (3 times in 4 days) indicated that this was not the case. When it was successful, it completed in about 12 minutes. Still that didn’t make sense. Even at 12 minutes, it still ran over into the next log restore and so did the successful attempts on the old server. I verified that the log restores were all occurring at the scheduled time and were taking about 2 minutes each time. There was no deviation in the log restores whether the package failed or succeeded.
So What Next?
The next step was to take a close look at the SSIS package and determine exactly what was happening. The developer of the package came up with a theory on the cause of the failure based on the timeline of the tasks in the package. There were a lot of steps to the package, but to simplify the explanation, it does 3 basic things. First, it pulls data from the Staging environment. Then it pulls data from the PPE and Production environments. And lastly, it pulls some additional data from the Staging environment. The theory was that when the package was running longer, the log restore on the Staging server would occur during step 2 of the package when it was querying data from PPE and Production. On the newer hardware, the package was running faster, and it would already be on step 3 by the time the log restore occurred. On the old server, we were receiving the benefit of lucky timing, and on the new server, we were not. Subsequent testing proved this theory correct.
The following timeline diagram demonstrates the behavior we were seeing.
Click to enlarge
The Mitigation and the Fix
For short-term mitigation, I changed the restore schedule on the Staging server to every 30 minutes. The backups and file copies still occur every 15 minutes, only the restores are delayed. This gives a bigger window in which to complete the data pull. This is merely a short-term mitigation. The developers filed a bug for the package and expect to have a new package ready for deploy in their next release that can complete within the 15 minute window at which time the log shipping restore will resume its original schedule.