Recovering from Bad Deployments

  • Comments posted to this topic are about the item Recovering from Bad Deployments

  • Some companies are ok with a DBA doing a few tests keeping scripts having a rollback plan, and even plan a release.

    Others seem to have the view that: its a waste of time, you cant code for what could go wrong, other companies obviously don't need to, you shouldn't be writing cr@p scripts, etcetera, etcetera.

    Guess which company has the most rewarding job satisfaction, the least amount of stress, and least fire fighting.

  • Fully auditing the data changes to the end of the cooling off period is one of the more robust ways of fixing the data. Also it can give you the certainty that you have actually fixed the data if a release does screw up the data. Of course it does mean that you have to have the spare capacity and the scripts, application knowledge and ability of creating/running the fix.

  • We deploy to production about once a month. The changes are tested well before deployment and somewhat afterwards, including generic tests throughout the primary systems just to be sure that we didn't somehow introduce any 8-year-old errors. Those are errors even an 8-year-old would notice. We are prepared to go back to before the deployment but we don't have to contend with a 24X7 environment.

  • I am always surprised to hear a company say a lot of planning/prep isn't a good idea. I've worked at place that said 'huh that's a good idea' and then let me prep/prepare. No one has ever said no, the worst is "Is this really necessary?" to which I reply yep and do it. So far been very lucky... Knock on wood!

  • The worst thing you can do is to assume that even the most simple deployment will go perfectly. Each deployment should be treated as if it will completely destroy the system in a heartbeat, and as much automation of the process you can afford to do upfront should be done, and tested in a low or no risk environment. Note I use the word 'should' not must. Some believe they cannot afford to do these things, but learn that they cannot afford to not do them. It should take only one catastrophe to learn this.

    Even that little fix to the data and a tweak to a script that is a no brainer could become problematic.

    Wise and prudent to backup and take care. Efficient to automate what you can.

    M.

    Not all gray hairs are Dinosaurs!

  • I like a simple and robust process for live deploys of:

    -> lockdown interfaces (architecture may allow queuing/broker to reduce impact here)

    -> backup (full/transaction/diff depending on how complex you want your recovery to be. Verify)

    -> deploy

    -> stakeholder approval -> decide if rollback -> otherwise commit

    it is the only method I am happy with to have a reliable deploy experience and to get back to a known point safely and reliably under the pressure of a failed deployment. Either things are so bad you cannot leave the system with an issue so you must get the system back to the state it was asap or you decide you can live with an issue introduced and issue a tested hotfix in the hours/days ahead to resolve.

    After the commit, there is no going back. This is made clear to the stakeholder as part of the planning. They are the one that wants the new features/bug fixes for the deploy anyway, so they are the right ones to make the decision on which path to take.

    Whatever the chosen strategy, there should *always* be a verified backup that was taken just before doing a deploy as your last chance option anyway..

    - Transaction is not sufficient (not everything supports transactions)

    - Snapshot is not sufficient (I have experienced snapshots that will not restore).. you can use them to get a faster recovery for large databases but can't rely on them

    - Rollback script is not sufficient and I'd argue not worth doing (100% surety of bug free code cannot be achieved. Extra work involved discourages refactoring which results in high technical debt buildup or high dev costs.)

    With a strong (and preferably automated) lifecycle process where the deploy package has already been applied against very recent restored versions of live at least twice BEFORE the actual live deploy (once on the actual day) and fully functionally tested by both unit tests and humans, the only case you should need a rollback is if something goes down (hardware/network/connectivity) in the middle of the deploy (Or really unusual data introduced since the most recent simulation.. that breaks a change to constraint etc). Anything else would smell of non-representative test systems being used for earlier deploys or a different release package was used than was tested with. I would argue investment would be better placed in fixing or setting up automated insurance against this rather than adding work to expensive and fallible developers/dbas.

    So far with this process of over 100 deploys I have had need for 1 rollbacks. This was due to a pre-deploy code issue inside a condition that only was fulfilled if run on live! (oops). conditionally run deploy control code based on current environment is a bad idea for obvious reasons. Obvious in hindsight of course!..

    I do have a second "online" deployment process also which doesn't lock down the system. This is only allowed for small programmatic changes only that can run entirely inside a transaction and complete in seconds. No table or data changes would be allowed via this method and here a script rollback could be used pretty safely. However I do always take a full backup before these also. Restore backup or use a snapshot and use as a compare to generate a safe rollback script if you really need it, rather than spend time doing this up-front.

    A big topic though... and one that seems to be usually furthest from the minds of most developers that I have come across.

  • Steve,

    What is the SQL Compare switch that automatically creates a rollback script for a deployment script? I can't find it anywhere. Thanks.

  • It's the "Switch" button that's below the databases in the data sources dialog.

  • I have used a number of different methods to secure a nice, safe recovery process before deployments, but there have been times where I have been given 5000+ lines of TSQL with only 5 minutes before a deployment. Times like those I have just taken whatever backup is feasible in the window, and made sure all of the approvals for it are well documented and express vehemently that more time would be great. That way if something breaks I have been able to say I didn't like the plan and that I followed the required chain o' command.

  • I think rollback plans are an essential part of any release. Anything containing data is more complicated because the rollback plan usually needs to consider the moving target that data is. It is far more complicated than me swapping in and out a few assemblies 😉

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply