SQLServerCentral Article

The Perils of Running Database Repair

,

In a perfect world everyone has the right backups to be able to recover within the downtime and data-loss service level agreements when accidental data loss or corruption occurs. Unfortunately we don’t live in a perfect world and so many people find that they don’t have the backups they need to recover when faced with corruption. Maybe there just aren’t any backups, or the backups are damaged, or recovering using the existing backups would take too long.

In these cases, people opt for using SQL Server’s repair functionality, but I think few of these people understand the perils of doing so, especially on system databases or when system tables are involved. It’s perilous because people don’t know what repair is doing under the covers.

By ‘repair’ I mean running DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS option. The name of this option was chosen carefully so that users understand they are allowing SQL Server to potentially delete data to fix the corruption. The job of repair is not to save all the data – instead it is to make the database structurally sound so that SQL Server can access and process the database without encountering corruption.

There is a non-destructive repair option – REPAIR_REBUILD – which will only rebuild nonclustered indexes to remove corruption in them. For corruptions that require REPAIR_ALLOW_DATA_LOSS, DBCC CHECKDB will recommend the option is used.

The majority of repairs perform what I call “delete what’s broken and fix up all the links”. This means, for example, that a corrupt clustered index data page will get de-allocated from the index, and then all the index linkages will be fixed up, any off-row columns will be deleted, and then all nonclustered indexes on the table will be rebuilt.

You might think this is a pretty drastic step to take, but it’s usually only way to do the repair that works in 100% of cases. Repairs are difficult to get to work correctly in all cases and so they need to be as simple as possible. During the development of SQL Server 2005 we created a test system that took a known database, corrupted it randomly and then ran repair to see if it fixed everything correctly. If it didn’t then it would automatically file a bug for me to fix. I called this system The Corrupter I fixed several hundred bugs in the repair code that The Corrupter system found, thus making repair pretty robust.

However, saying that, it’s not infallible. I have several demos involving databases with various corruptions in the system tables that store the storage metadata for all the tables in the database. I can show cases where attempting to run repair on a database with corrupt system tables either makes the corruptions worse, or causes a crash dump to be generated, or both. There are an infinite number of combinations of corruptions that could happen to a database and repair can’t possibly be tested with all of them.

Another thing to think about is that speed is of paramount importance when performing disaster recovery so repair has to work as fast as possible. This also argues for simple repair operations rather than very complicated algorithms that would take a long time to run with only limited possibility of being able to retain some of the corrupt data. There are a few complicated algorithms in the repair code, for instance to stitch broken IAM chains back together, but mostly the repairs are relatively simple.

Apart from the danger of having data loss, or worse if system catalogs are involved, there are some other problems you may run into if you chose (or have) to run repair.

Firstly, repair operations are specifically marked so they are not replicated. This is because they are direct physical modifications to specific database pages, and subscription databases are not exact physical copies of publication databases, so repairs on a publication database do not translate directly into the same operation on a subscription database. It’s also not possible to effect a logical change to the subscription database because the repair code doesn’t know which table/index records are being deleted. This means that after a repair operation where one or more articles in a publication database are affected, the replication subscribers need to be fully re-initialized.

Secondly, repair does not consider foreign-key constraints that exist in the database. This is because these are higher-level relational relationships, rather than structural storage relationships – and so are not relevant to making the database structurally consistent. It’s perfectly possible for repair to de-allocate a page in a table thus breaking a foreign-key constraint with another table. This restriction is documented in Books Online. As a consequence of this, if repair affects a table involved in a foreign-key constraint then you must validate it using the DBCC CHECKCONSTRAINTS command.

Lastly, repair cannot know anything about inherent business logic and relationships that you have coded into your database and so some of this may be broken after running repair. This is especially applicable if you have to run emergency-mode repair to recover from a damaged transaction log file without using backups. That option very much is a last-ditch effort and should be used very rarely.

The bottom line is that if you have to use REPAIR_ALLOW_DATA_LOSS then you are going to lose some of your data. I recommend taking a backup of the database pre-repair so that you can restore it somewhere else and try to work out what data was lost, using the output of repair which contains descriptions of what repair did – for instance, de-allocating certain data pages. You can use DBCC PAGE to try to examine the pre-repair pages in the restored copy of the corrupt databases.

The purpose of this short article has been to show you why running repair can be dangerous and should not be relied on to allow you to recover from corruption without losing data. It’s always better to have a good backup strategy, where you’re regularly testing backups and you have backups that allow you to recover within your downtime service-level agreement. Is your backup strategy good enough?

Rate

4.53 (34)

You rated this post out of 5. Change rating

Share

Share

Rate

4.53 (34)

You rated this post out of 5. Change rating