Product articles Flyway Rollback strategies
Failed Database Deployments: Roll Back…

Failed Database Deployments: Roll Back or Fix Forward?

This article explains the different ways to reverse or accelerate your way out of the problems caused by a failed database deployment, where a technique such as feature flags fits into your strategy, and how Flyway is designed to help you ensure both that you're fully prepared to take swift remedial measures, and that you minimize the need to use them.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

If you don’t feel anxious during the deployment of a database update, then you’re doing something wrong. Your nerves should be tingling because, if this database is supporting the activities of an organization, the entire organization can suffer if things go wrong. Even an apparently insignificant problem in a minor release has been known to have disastrous consequences.

What is the best way of dealing with a failed database release? Do you roll back to the previous version of fix forward to a new, stable version? Or can you avoid the need for either, by using feature toggles that allow you disable the feature that is causing problems? As so often with databases, it depends.

A defensive approach to database releases

There is no magical way to ensure that the database release process is easy and stress-free. If problems crop up, the right course of action varies according to the nature of the problem, when it was detected, the scale of the release, and the design of both the database and application. Some advice is obvious:

  • Always take backups – before performing any process on a live database that can potentially go wrong
  • Do the migration within a transaction – so it will safely rollback if the deployment fails part way through. Flyway does this automatically (if supported by the RDBMS)
  • Release often and in small doses – Flyway is designed to support branch-based development, testing, merging and releasing small changes, regularly
  • Have a tested idempotent rollback to match every migration script – Flyway undo scripts are a useful starting point
  • Shake out as many issues as possible in Staging – perform sanity checks on the important business processes using test scripts run by ‘phantom users’
  • Design the database with a defensive approach – to releases, and to preserving data.

Other advice ‘depends’, mainly on the design of the application and database. Strategies such as feature flagging can help deal with issues limited to a particular feature and are useful if used judiciously.

In any event, the way you deal with failure of the deployment process itself will likely be different to the way you deal with a problem discovered after the deployment completes. The difficulty involved increases if there have been subsequent changes to the data.

Recovering from failure of the deployment process

When a problem, such as a runtime error from the RDBMS or a connection-termination, occurs during the execution of a database migration, then the process of updating the database is halted during its execution. The desired outcome is that all the changes are rolled back, returning the target database to its previous version, with all existing data preserved.

Ideally your database migrations tool will take care of this for you, automatically, provided your RDBMS supports doing the change, and therefore rolling it back, within an explicit transaction. If it doesn’t, or there is some other reason you can’t use a transaction, then a deployment failure can leave the database in an indeterminate state. The only possible courses of action then are either to ‘rollback’ all the changes manually, using a script, or to revert the database to a backup, taken before the update started. This may be fine in development but is a ‘last resort’ in any deployment to production.

When using a transaction to update the database

The rollback of a transaction is well-understood in SQL, and it is one of the more valuable qualities of a relational database. When the process of updating the database occurs within an explicit (multi-statement) transaction, all changes either succeed completely and are committed or, if some error occurs during the transaction, are rolled back completely. Flyway will always attempt to run migrations within a transaction so that, if an error occurs, it can undo any changes.

This transaction rollback is essential to maintaining database integrity because, if an error happens, all the changes performed within the transaction are played back in reverse order, undoing the changes, and so the connection is restored to a consistent state at the point when the transaction started. Before the introduction of DDL transactions, all production databases had to be taken offline for updates.

With all enterprise-quality relational database systems, it is not just the changes in data that can be rolled back but, with one or two exceptions, also the changes to the metadata. In other words, the DDL that alters the database itself, rather than just the data within it, can be rolled back. In some cases, this relies on the correct isolation level being set and appropriate connection settings.

When running a script to deploy changes to a target database, any competent tool such as Flyway or SQL Compare will execute the changes within a transaction. With SQL Server, for example, the transaction isolation level for the migration will be set to SERIALIZABLE, or occasionally READ COMMITTED (in tandem with other checks, such as on @@TRANCOUNT). In the former case, it prevents all read phenomena, meaning that no other user sessions can see any effects of the changes to the data, nor make any changes to it, until the transaction commits or rolls back. These tools will also help deal with error handling to ensure a clean roll back.

So, if Flyway or SQL Compare modifies a table in a production database, and then the changes are rolled back, what happens to the data modifications of other users who are accessing that table? In fact, there will be no other users accessing the table, because the schema modifications that are being executed require acquisition of a table-level lock in most RDBMSs, which can only be acquired if no other sessions are accessing the table. It is a resilient and well-used system that underlies the practice of making updates to the live production system rather than having to take the system offline for the release.

When updating a target database without an explicit transaction

Some RDBMSs don’t support executing DDL code within a transaction (MySQL, MariaDB, or SQLite don’t, for example). In such cases, if a migration that alters DDL code fails partway through, Flyway can’t roll it back. It will report an error like the one shown in Dealing with Failed SQL Migrations in MariaDB or MySQL:

Flyway : ERROR: Migration of schema zzzz to version “xxxx- yyyy yyyyy yyyy” failed! Please restore backups' and roll back database and code!'

Flyway info will list the migration as FAILED and it won’t report a version until you clean up the mess. You can encounter the same problem if you’ve specified that Flyway should not execute within a transaction

Occasionally, even SQL Server doesn’t allow you to use an explicit transaction when modifying a database. For example, you can’t issue CREATE/ALTER/DROP DATABASE, or CREATE/ALTER/DROP FULLTEXT INDEX commands, within an explicit transaction. It is difficult to imagine circumstances when you’d want to do this to a live production system, but if so, then you have some rather more painful alternatives.

The extent of the complications involved in returning the database to a known state depends on whether the database release is being done with the database offline, or with the database and applications online and being used. The former is relatively simple to deal with and can be an effective second-best to a transaction rollback. The second isn’t necessarily simple at all, especially if the release made significant changes to the base tables and live data changes happened subsequently. It is, in fact, only possible if you’ve designed the database to be able to do it.

Using a rollback script

If you cannot use a SQL transaction when deploying a new version of a database, you can use a ‘rollback’ script, so-called because, if it is effective, it mimics the result of a ROLLBACK TRANSACTION statement. Its goal is to return the database safely from an ‘unknown’ version, produced by a migration that part-failed, to the starting version, with all data in-tact. Unlike a Flyway undo script, which is designed to revert a database from a specific version to a previous version, a rollback script must check what needs to be reverted before doing so. See Rollbacks, Undos and Undonts for details.

There is nothing inherently complex about a rollback script. If releases are frequent and the metadata changes are minor, a rollback script can be reasonably simple. Problems only really arise with major ‘online’ releases that make substantial changes to base tables, especially if some time has elapsed before the decision is taken to roll back the deployment. In the worst cases, it is quite possible to lose any changes to the data made after the deployment.

If you have an RDBMS such as MySQL, MariaDB or SQLite that cannot rollback within a transaction, it is essential to have a well-tested rollback script prepared alongside every migration script. If this is applied immediately, the ‘rollback’ script will check all the individual changes in the corresponding migration script and remove them, if they’ve been made, to revert the database to the previous release, while preserving the existing data.

If your RDBMS is supported by SQL Compare, you can use it to get a first cut of the rollback script, by comparing the release version of the database, as source, with the current production version as target. You will then need to test and amend the script, as required, to ensure that both the schema and data are returned to their original states even when a failed migration leaves the database in an indeterminate state. The insertion of a column in a table, for example, will need a corresponding rollback script that removes the column, if it exists, after moving any data changes in the column into an appropriate place in the previous version. Like a migration or synchronization script, it moves the database from one version to another, but in this case the database is in an indeterminate state between two versions, which is why rollback scripts must be idempotent; they must check what has and hasn’t been done.

At the level of the base tables, user changes could be deletes, updates, insertions. If the release has made alterations to base tables, then ‘reversing’ the data, including the subsequently changed data, back into the old table structures could be complicated. The extra complication is in having to deal with data in the released version for which there is no table structure to preserve it, in the previous version. After all, if Adventureworks decided to sell Surfboards, or if Pubs wanted to cope with Audiobooks, and released a version to enabled them to do it, where would the subsequent data on the new products be stored on rollback? It can be done, but it is a lengthy and distracting discussion involving XML and/or JSON.

If you have a pre-prepared and tested rollback script, you can run it immediately. You’ll need to prevent other users making changes while the rollback takes place. For example, you could follow the format that SQL Compare uses for encapsulating the script within a serializable transaction. Alternatively, you’ll need to ensure that the system is offline before the rollback starts.

It is not unknown for a rollback script to be created in an emergency, after an error happens, but it is bad practice as it is untested, and could extend the time that the database must be kept offline. It is much better to have this script already prepared even if, like a fire extinguisher or parachute, it is almost never used.

Restoring from a backup

If, shortly before the deployment starts, you take a full database backup, and immediately prior to starting the deployment you take a final log backup, then a small database can be restored quite quickly to the state immediately before a failed release. With increasing size of a database, this restoration process can take a long time. Again, any changes made to the database after the final log backup was taken will be lost unless you captured them some other way.

Recovering from problems that arise after a deployment

The sort of problem that happens at this point is generally different in character to an immediate error during a release. This time the release succeeds, and so the database is at a known version, but then something ‘breaks’ later, after normal operation is resumed. It could have several causes, such as a required database component that is missing, or a problem caused because someone has made uncontrolled changes to the current release. A failure after release, however, is generally an error in database functionality, or unexpected condition, which wasn’t detected during testing. It can manifest itself in a number in ways such as SQL errors, faulty processing of data, performance problems, or outright database narcolepsy (sudden sleep).

The correct course of action is a matter of judgement. It all depends on the nature of the problem. You can still restore a backup if all else fails, but you would lose all the data changes that have happened since the release, and it would mean significant downtime.

Fixing forward (aka rolling forward)

The correct term here is ‘fixing forward’ and it refers to the scripted techniques that are used to create a new and stable version of the database, in cases where a release succeeds but causes a significant problem subsequently. One requires DBA heroics and is very bad and the other can be good, if it is part of a well-controlled release process.

First, the very bad…

Uncontrolled fixing forward

There was a time when a release failed would often be fixed with ‘DBA heroics’. After all, with the confident belief of knowing exactly what is causing the failure, what could go wrong if one strode towards the terminal with a confident chuckle and keyed the necessary DDL changes into the terminal.

Unfortunately, I have witnessed occasions where this has gone woefully wrong. The problem is that, although you can do this within a transaction and revert it, if necessary, you cannot test the change before you commit it. Even if you’re lucky and it works, if you end up making several changes to the live system then it no longer represents the current release. The code changes that you make must then be retrofitted to the code for the release and added to source control as a new release. This will, of course, invalidate the deployment process with all its tests and checks. This is high-risk: if your changes aren’t signed-off by the security team, for example, could you be culpable of a subsequent security breach?

Basically, this is an uncontrolled process which has no place in any functioning development department. We haven’t described what happens if the ‘heroic’ roll-forward fails, because the potential consequences re too painful to relate, but it will involve production down-time and perhaps loss of data.

Now for the better approach…

Controlled fixing forward

A more satisfactory version of fixing forward is the hotfix, which is usually created as a migration script in Development as soon as the issue in Production is detected. After automated testing, the hotfix is tested thoroughly in Staging before being applied, as a version increment, to Production. In Flyway Teams or Enterprise, the whole process can be automated to the point that the process is relatively safe.

With Flyway this ‘roll forward’ technique results in the next database released version being dedicated to fixing the problem. Flyway, when used properly and within an effective branching and merging strategy, can become a very rapid database releasing system. The changes made to fix a ‘broken’ release are made within a dedicated ‘hotfix’ branch, with as much of the automated test series as possible. A subsequent pull request is subject to all the necessary signoffs by the various teams before it is accepted for release, to create a new version. Because the hotfix is tested, hopefully with improved tests to cover the unexpected condition that caused the problem in the first place, then it is low risk.

The code is in source control. With a fully scripted and automated process, this can be done in a matter of hours. If the tests focus on the root of the problem, then it could be much quicker. A team that is good at the rapid and reliable release of small discrete chunks of database functionality will manage this type of ‘forward fixing’ release with relative east. However, the more wide-ranging a release, the slower a subsequent ‘fixing’ release will be, because more must be tested, checked and signed off.

Using feature switching or feature toggling

When a problem happens after a release and is isolated to a new feature that isn’t fundamental to the working of the database, you may have an alternative to rolling forward or back. You don’t necessarily have to do either, if you’ve designed the database and application to be able to switch new features on and off, gracefully.  These switches, usually called feature toggles, are designed to allow teams to disable or enable a new feature or piece of functionality quickly. They are most effective when coupled with the ability to restrict future releases to a subset of the users. This is usually implemented using role-based security. For example, we can assign permissions to access the new feature to only a small set of “power users”, or a particular set of customers, meaning that toggling ON makes the feature available only to a small pool of users, rather than everyone. Via a toggle, new features introduced in the release can also be ‘disappeared’, causing the user interface to change likewise ‘in concert’ (e.g. hiding a field from the UI). This can allow teams to do more regular, controlled deployments, and can reduce the ‘mean time to recovery’ in the event of a deployment problem.

However, there are implementation challenges too. There must be a method of managing any data associated with a new feature, so that this data can be preserved if the feature must be re-engineered to fix a problem. Toggling a feature ON or OFF must not affect the version of the database, so must not require any database schema (DDL) changes, just data changes. Usually, toggles can be done simply by refencing ‘switch values’ in a table or view. I prefer to use a view, because a toggle can then be changed, via a Flyway migration that alters the view, and therefore the database version. This allows you to be certain about what features were turned on or off at any point in time, which is good for auditing purposes.

I’ve had problems with feature toggles where someone has failed to understand the reason for them, made changes to them or even removed them. The consequences of a well-used feature on which the business relies being accidentally switched off by a bored Ops guy is enough to induce a cold sweat. The biggest challenge of the feature toggle approach is that it must be engineered into the database and application and requires additional test resources. If not managed well, or overused, it can complicate the database release process.

Designing database releases defensively

There is no technique that entirely removes the anxieties of the release of a production system. All that one can do is to take all possible precautions. I’m no stranger to having a well-known captain of industry striding up and down the server room, scowling and shouting, “every minute this **!#* system is down you lot are losing me $xxxxxxx of revenue!“. You can’t argue with that, and so it’s up to the developer to reduce the risk as far as possible. Think lifebelts. In my databases, for example, all migration scripts from source version to target version are accompanied by an idempotent reverse-migration script from target to source that likewise preserves the data.

This requires taking steps to make sure that it is easy to preserve data in a rollback. If, for example, I wish to make changes to a table, I like to preserve a column and its now-redundant data, for one release, after moving or reorganizing the data elsewhere in the database. Only when the change proves itself in production is the redundant column removed in the subsequent release. I have phantom users on the live system directly after a release. These users, from a script, exercise all the vital processes just to double-check that everything is in working order. For example, users armed with special credit card numbers will perform a series of customer actions on a shopping website. After the initial release of a database system, I release little and often.

Where I have a say in the design of a new system, and it happens occasionally, I insist that all interaction between the applications and the database must happen via a defined interface within the database. This helps to decouple the evolution of the database from the development of the application. It also allows a ‘playback’ log to be kept of all user interaction that affects the data. This includes the date it happened as well as the SPID, user identification, procedure, and parameters, so that business transactions can be replayed. If done well, this can, with suitable scripting, allow the restoration of data, even if the database must be restored from backup or reverted from snapshot. If it can be anonymized, it is also gold-dust for subsequently testing database processes.

Finally, how do we revert a database to a previous version when that previous version has no suitable schematic structure (tables) which could assimilate the data? For this, I add, before the first release, an ‘accommodation’ or ‘limbo’ table. It might just have one date column, and another column for storing the data, in the form of JSON arrays, or XML. The release has no knowledge of this ‘limbo’ table, and it provides a way of preserving the data for subsequent restoration.

Conclusion

There is no magical way to ensure that the database release process is easy and stress-free. If problems crop up, the best remedial actions vary according to the nature of the problem, the scale of the release, and the nature of both the database and application.

Some advice is obvious. Always, where possible, do a migration within a transaction. Shake out as many issues as possible in Staging. Design the database with a defensive approach to releases, and to preserving data. Where possible, perform ‘sanity-checks’ on the basic processes within the production system with test scripts performed by a ‘phantom user’ before the new release goes live. Have available the tested idempotent ‘rollback’ scripts that match every migration script. Other advice ‘depends’. Always take backups before performing any process on a live database that can potentially go wrong.

The best approach is to get into the habit of releasing often, cautiously, and in small doses of new functionality. This allows you a controlled way of updating the database to a new release to fix a flaw in a database release. Flyway is specifically designed to support branch-based development, testing and merging small changes, regularly, and with automation.

If you can avoid the need for ‘big bang’ releases that affect hundreds of objects, then you’ll likely also minimize the need to “cherry pick” what can or can’t be released, or to disable features using a toggle or flag. Database feature flags can be useful, but they can also complicate a release.

By doing all this, it is possible to reduce the anxiety that surrounds the database deployment process.

 

Tools in this post

Flyway

DevOps for the Database

Find out more

SQL Compare

Compare and synchronize SQL Server database schemas

Find out more