My previous article described the fallibility of page checksums in the face of memory corruption, and therefore the need to run regular
DBCC CHECKDB consistency checks on your databases. In the ensuing forum questions, I was asked how often corruption really occurs in the real world. My answer was, "hundreds to thousands of times every week across the world, in the tens of millions of SQL Server databases…". Every single week I receive multiple emails asking for some advice about corruption recovery. When I'm teaching about corruption in our High Availability and Disaster Recovery Immersion Event, I always tell the class that I expect every DBA to see database corruption at some point during their career.
What I didn't offer in that article was any general advice on what to do if
DBCC CHECKDB reports corruptions. The best response to discovering corruption in a database is to follow the guidance in your company's disaster recovery handbook. In there, you should find instructions for the recovery process, based on the company's expectations of acceptable levels of downtime, and maximum permissible data loss, for the corrupt database. Most important of all, whoever devised these guidelines will have done so free from the adrenaline-fueled panic that sometimes happens when we discover corruption.
I'm sure many of you are now thinking, "what if we don't have a disaster-recovery handbook?" That's what I'm going to cover in this article.
When corruption appears, you need to do two things:
- Recover from the corruption
- Perform root-cause analysis to prevent the corruption occurring again.
I would much prefer that root-cause analysis came first, but in most companies getting back online again is the top priority, so my ordering reflects that.
Recovering from Corruption
At this point, I'm assuming you've already run
DBCC CHECKDB and so you know which objects and indexes in the database the corruption has affected (nothing else can tell you that). Next, there are a number of choices for recovering from corruption, depending on where the corruption exists and what resources are available to you.
It is beyond the scope of this article to enumerate all the different possible errors, what they mean and how best to recover from them. Instead, I offer some general "recovery options" with links to further reading.
Failover to a Secondary
Firstly, you could fail over to a secondary copy of the database, if you have one available, as the corruption is unlikely to exist there. Corruptions are usually in the data files and so are not shipped to secondary copies, as SQL Server high-availability technologies ship log records, not data file pages.
This failover will involve some application downtime, while the failover occurs. In addition, some (maybe only temporary) data loss is possible, if you're not maintaining the secondary copy synchronously.
The amount of downtime depends on how much transaction log still needs to replay on the secondary copy to bring it up-to-date. Depending on the high-availability technology in use, this replay could be from the database mirroring or availability group redo queue, or log backups that we apply to a log shipping secondary.
A second option, if you have transaction log backups, is to perform a restore operation, either to return the database, as a whole, to a point in time before the corruption occurred, or to restore only the corrupted files and pages.
If only one or a small number of data pages are corrupt, you could perform a single-page piecemeal restore. In Enterprise Edition, you can run this as an online operation and so minimize both the recovery time, and the amount of data that's inaccessible during the restore.
If the damage is limited to one or more files from a secondary filegroup, then you can perform a piecemeal restore of just the affected filegroup, but you have to set it offline, which in turn requires the database to me offline for a brief time. If the damage is to primary filegroup, the whole database will be offline while you restore the filegroup.
If corruption has affected the entire database, a full restore will be necessary. Whatever the exact restore sequence, make sure you've practiced it, thoroughly, many times before. My TechNet Magazine article, Recovering from Disasters Using Backups, offers a lot more advice on recovery using backups, and in this blog post, I offer some neat tricks around backing up the tail of the log, to ensure you capture the most recent data.
Lastly, if you don't have another copy of the database and you don't have usable backups, you can consider running repair. We (as I was on the SQL Server team at the time) named the
REPAIR_ALLOW_DATA_LOSS option carefully, as using it usually means that
DBCC CHECKDB needs to delete the damaged portion of the database to effect the repair.
Never run repair if there is another option, as data loss and transactional inconsistency will be the result. However, if you really have no other choice, or if extended downtime is more damaging to the business than data loss and it is faster to run repair than restore from backups, then it is a last resort option.
Running repair requires the database to be in
SINGLE_USER mode. After running any repair, you must run
DBCC CHECKDB again, as one of the repaired corruptions could have been masking other corruptions, or
DBCC CHECKDB may not have been able to repair all the corruptions.
In especially dire circumstances, when the transaction log is damaged, you may need to set the database to
EMERGENCY mode to access the transactionally-inconsistent data, and potentially run
EMERGENCY mode repair. I discuss this in a blog post here.
In the event that you're going to run a restore or repair operation, try to restore or repair the smallest possible amount of the database, to limit the downtime required. I have a lot more guidance on this in the Corruption and Disaster Recovery categories on my blog.
In my experience, three things cause corruption and they are, in increasing order of likelihood: SQL Server bugs, memory problems, and I/O subsystems.
In every release of SQL Server have been bugs that cause corruption. That's just a part of running a large, complex software system with many millions of lines of code. If you see a corruption message, I would do a Bing or Google search on the message text to see if anything turns up in a Knowledge Base article. If so, you may have found the root cause and need to install the latest Service Pack or Cumulative Update.
I described memory problems in the previous article so I won't repeat myself here. I would run memory diagnostics from the BIOS of the server to check for bad memory chips, and I'd look to see if the SQL Server error log reports any 832 errors. If there are no memory chip problems, but the memory-related corruption problems persist, then you will likely need to involve Product Support to help diagnose the issue; if the problem is a faulty driver, it may be very hard to track down on your own.
I/O subsystems are by far the largest causers of corruptions, accounting for more than 99.9% of all corruptions I've seen (and I've seen many thousands of cases). The "I/O subsystem" includes all the disks, RAID controllers, SAN controllers, HBAs, and the network (if applicable) hardware, drivers, firmware, and cabling, plus file-system filter drivers. That's a lot of potential for software bugs and faulty electronics.
In the face of such problems, try the following:
- Look in the SQL Server error log and Windows event logs for I/O subsystem messages
- Run your vendor's I/O subsystem diagnostics
- Check for firmware and driver updates
- Run the free utility SQLIOSim, a Microsoft-provided tool to stress an I/O subsystem and expose any problems. See the second reply to this MSDN forum thread for several useful links.
Whatever the problem is, you need to correct it before it happens again, and corrupts some part of the data that means disaster recovery is even more problematic and takes even longer.
About the worst thing you can do when faced with corruption is turn immediately to the
REPAIR_ALLOW_DATA_LOSS option of
DBCC CHECKDB, and then just continue running your workload. Always take the time to evaluate your options and pick the one that is most appropriate for your business. Ideally, your organization has already devised and documented the appropriate procedures for each database in its disaster recovery handbook.
After recovering from the corruption, figure out why the corruption occurred in the first place and then rectify the problem so the corruption doesn't happen again. There's no excuse for experiencing corruption multiple times from the same source and doing nothing to identify the cause.
I'll end by saying this: if you think that corruption won't happen to you at some point in your career, I think you'll be surprised. Be prepared!