Is Rolling Back The Same as Failing?

  • Comments posted to this topic are about the item Is Rolling Back The Same as Failing?

  • A rollback in a non-production environment should be celebrated as preventing a rollback scenario in the production environment. It means that your pre-production environments have justified their existence!

    Both deployments and rollbacks should be rehearsed.

    Yes we always kick ourselves when things don't work perfectly first time. Its a matter of professional pride after all.

    If it is people feeling sheepish and embarrassed over having to do a rollback that is a good thing. You are working with people with a good attitude.

    If it is people feeling "oh well it doesn't matter, it's not production" then that is a bad attitude.

  • No situation - especially this one - should be allowed to drag a team down in this way. Technically is was the right thing to do, and it ensured the migration went well.

    In my view the problem here is motivation. It is as much the manager's or team leader's responsibility to maintain morale as it is to manage timescales, cost and quality, and it sounds like they were sleeping on the job 🙂 All it needed was perspective, encouragement and a positive "can do" attitude.

  • When you fail more often you'll get used to it 😉

    "In theory, there is no difference between theory and practice. In practice, there is."

    -Yogi Berra

  • In a test environment? This is exactly what they are for! In hindsight, you may kick yourself for not seeing things that should have been incorporated. But you can celebrate the fact that it was caught at a noncritical stage and can be corrected.

    I manage a team that does nothing but one-offs (custom one-time data conversions and miscellaneous custom database scripts). It is the nature of our work that when we run the final result in the production environment, it is typically the first time that code is run in that exact environment. That, in and of itself, carries risk. We do back up the databases prior to execution. If something goes wrong, we evaluate to see if we can correct now or correct later. In a worst-case scenario, we'll have to restore the database, reschedule, and reevaluate.

    Prior to running in production, we perform many tests and many rollbacks in a test environment. If we can get the process to run perfectly in the test environment, then we've greatly reduced (but not eliminated) the risk as we move this to the production environment.

    A significant aspect of our job is to protect the data. If a process executes against the data in such a way that the data is damaged and cannot be corrected back to it's original state, then a rollback is the right thing to do (if possible).

    Long ago in my career (back when it took two hours to back up a database to multiple tapes, and you did double backups because you didn't trust the tapes) I had an experience where a rollback was not possible. It involved a conversion within a financial system. Some data was damaged and could not be corrected. There was no time to restore a database and try again, because this would have brought the system down for 24 hours or more. I spent the next four weeks rebuilding the data as close to original as possible. It wasn't perfect, but the accountants (who were aware of the situation) accepted the result. That was a very expensive lesson.

  • Then there are the rollbacks caused by stuff that didn't happen in the test environment.

    We had a software package that we upgraded the DB and the clients in a test environment with no problem.

    Then production Monday we upgraded the DB and as the users logged on in the morning, it would install the updated client. We had pretty much automated the whole thing that about the only manual client install was on the terminal servers.

    Well we started getting calls that the client install was failing. We go start looking at the clients and it was failing even with manual installs. Our company was one of the "first" installers for upgrades of the SW. So the vendor had no idea what was going on either.

    Luckily we had a two week leeway before regulations needed the SW to be upgraded, and only about five clients needed to be rolled back along with a DB restore.

    Then it is a matter of troubleshooting. Turned out the prior version had made the desktop shortcut icon as read only. So when the installer went to replace it, it was choking. It took us two days to find the issue.



    ----------------
    Jim P.

    A little bit of this and a little byte of that can cause bloatware.

  • I read a blog post on this topic recently.

    I like the approach depicted in the article - try to resolve and then rollback after having run there enough troubleshooting to hopefully identify the problem and shortcomings. In addition, I would say there needs to be a go/nogo point in the rollout where the rollback needs to be considered.

    http://tjaybelt.blogspot.com/2014/02/be-tenacious.html

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • I think it's kind of odd that the team felt badly about this. The team should feel satisfied that their procedures are correct.

    Is any of us perfect that everything we do works right the first time? That's we have have test environments. You were able to rollback, discuss the issue, find a resolution, and test again.

  • OCTom (3/4/2014)


    I think it's kind of odd that the team felt badly about this. The team should feel satisfied that their procedures are correct.

    Is any of us perfect that everything we do works right the first time? That's we have have test environments. You were able to rollback, discuss the issue, find a resolution, and test again.

    +1. See my comment above about motivation/morale.

  • I'm going to disagree. I think in this situation is was a failure. Sure you can tell yourself that it should be caught now (and it should) but don't consider a rollback successful after you had reviews and meetings and still missed something. That is a failure. Of course, you won't do the same thing again and you'll learn but a 'good' team will feel bad that they missed something. You (the team leader) can celebrate that the team cared enough to feel bad and cares about the work but don't make an excuse for the failure.

  • don c-309367 (3/4/2014)


    I'm going to disagree. ... don't make an excuse for the failure.

    Understandable vs acceptable.

  • I recently went through this. We have a somewhat older ASP.NET website, that doesn't want to work with Internet Explorer 11. But more and more of our external users are getting new PCs that come with IE 11 installed. I had made changes to the web site so that it would work. I tested it as thoroughly as I could and then pushed it out to production, but that turned out to be a disaster. Not only did it not work for IE 11, if didn't work for any browser. I spent an hour or so trying to get it to work, and finally decided to rollback. Better that users using IE 11 couldn't use the site, than no one at all couldn't use it.

    It wasn't a satisfactory choice, but I felt it the only one open to me. I'm still puzzled as to why it didn't work, because it worked perfectly in testing. And unfortunately for me, as often happens in my job, other more urgent priorities have come up requiring my attention. This website/IE11 issue has been pushed to the back burner. However, I intend to put up a staging site on a server that's not visible to the outside world, where we can test it again, before putting it out for the rest of the world to use. One that will mirror the production web server as closely as possible.

    When I have the time to get back to it.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • To use Monty Python's Holy Grail as a metaphor, every major effort in a database requires a 'run away' option.

  • I had a similar experience last year, except it was a production install we had to roll back. And yes, the team was somewhat depressed initially about that - we had planned and tested and done "everything right". Well, almost, as it turned out.

    Testing of a specific feature component had been missed in the QA environment, but was performed in the production validation step. By accident so to speak, which tells me our test planning needs to be improved.

    Any set of professionals will be upset when something goes wrong. That's part of what makes us professionals, and is also what drives us to improve. It's OKAY to be depressed, bummed out, even angry, as long as that gets translated into the energy to fix the issue and improve the process.

    How you react, as a manger, team leader, scrum master, teammate, whatever, become very important at that time (IMHO). I do think it's important to remind the team that the rollback itself is a success - we prevented buggy code from existing in production. And that is our job! We prefer to do it earlier in the process. That's what I told my team, and depression gave way to thoughtfulness, then confidence.

    That was the important take-away for me. I suppose I could have beat the development team up for not enough unit testing, or the QA testers for not having the scenario in their test plan. But what would that have accomplished? I don't need team members with an inferiority complex or who point fingers at other team members ("it was their fault, I'm off the hook") - I need competent, confident team members who work together. By pointing out the positive aspects of the roll-back, that's what I got. The whole team spent the next week correcting the problem and at the next implementation window we successfully installed a major upgrade that's been working smoothly ever since.

    So was the rollback indicative of success or failure? It depends on which goal you're talking about. The TEAM was successful; a small (but important) part of our PROCESS failed.

    - we successfully prevented non-working code from existing in production

    - all team members were aware of which part of our PROCESS failed

    - all team members worked enthusiastically to correct the issue

    - we successfully implemented a week later

    - we successfully corrected our process and all team members are now more aware of what is needed process-wise to increase the likelihood of a successful implementation

    That's a lot of success, and it's only a failure if you fail to learn from it. That's my philosophy, anyway.


    Here there be dragons...,

    Steph Brown

  • Part of a good migration plan is the Roll Back or Back Out plan and the fact that you had one was commendable in and of itself. You would be surprised at how many times I have seen migrations without a Roll Back plan. Unforeseeable things happen all the time, especially in an environment where there are many people and even teams working a migration, and that is what the Back Out plan is for. Now, the fact that this was caught in a Test environment, to me, should be considered a Success. The success is that we uncovered an issue that hadn't been thought of or covered in the Development phase and that we prevented this from happening in the Production environment. Whenever doing these things we need to have these "Dry Runs" in the Test environment, which should be as close to mimicking the Production environment as possible, and as many Dry Runs as it takes before we have something that we are 99.99% sure is going to work in our Production environment because we worked out ALL the kinks in the Test environment.

    Great Job to you and your team(s) for having the Back Out plan and for executing it when it was necessary.

    Hopefully your next Dry Run will be the one that tells you that you are now ready for a push to Production.

    Merrill

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply