Blog Post

TechEd - Data Corruption

,

I went to this one, mostly because I don't understand coruption that well and Paul Randal is a great speaker. Worth going to see him if you get the chance. 

Run CheckDB. It's important.

You need to know how long checkdb takes to run. Checkdb is optimized to run for the times where there are no issues. So it runs fast. If there are issues, then it does a deep dive and runs longer.

In general, keep in mind that only the first 200 messages are reported. Run all_msgs to get all data and use no_infomsgs since there are lots of things reported that aren't needed in a DR situation.

Make sure checkdb completes. Story here about a stock firm.

Lots of messages once it's complete, so there are over 100 errors checkdb can report. Many have a variety of states.

The error results aren't too readable, but there are some things you should be aware of.

First determine if checkdb completed itself. Make sure that it has completed. Common errors with the checkdb process itself.

7984-988 - corruption in the critical system tables.

8967 - invalid states within CHECKDB. 

8930 - Metadata corruption, not critical, but Checkdb cannot make sense of the metadata about a table(s)

Many of these are BOL or online documented. 

Demo with pre-corrupted databases. A critical error shown on a system table, 7995, so checkdb cannot run, including repair. This means a restore situation.

Corruption in a system table, checkdb will not run, but you can perhaps run checktable against individual tables to determine which one is corrupt.

If you have corruptions in nonclustered indexes only, then you can run repair_rebuild if it's recommended from the output. Look through the errors and if the errors are all index IDs > 1, then you can manually repair the non-clustered indexes.

The flipside here is that if you must run repair, you must be in single user mode. If the index is large, than you must have 2x space to rebuild as well, which can be a problem. If you repair with checkdb, it might be able to correct some records.

You can also run an online rebuild in 2005 (Enterprise edition), to try and repair. However since this reads the old index, you need to do an offline rebuild.

Unrepairable error

Was one found? A PFS error? (89090, 8938, 8939) - No way to fix these. You can try to extrat data, but the pages where the corruption exists will not extract. Thereforre a backup is needed.

Was it 8970, invalid data in the column. Example, more than 1439 minutes after midnight. If you get this, repair cannot fix this since it does not know what values should be there. You can repair these manually.

8992 - metadata mismatch, checkcatalog error. Cannot be repaired, but depending on what this is, you might be able to hack the system tables.

Demo

DBCC Page is unsupported/undocumented, but it's safe for production use. Only reads data. Does not change anything. It's heavily used by the SQL Server team.

One trick in binding to system tables is that you can do it with the Dedicated Admin Connection. It doesn't work with regular connections.

System tables are cached. If you start in single-user mode, and use the DAC, you can change metadata. Paul had a demo that showed hwo to fix some corruption doing this, but he got an error that the metadata cache wasn't consistent. SQL Server maintains this and if you make the change to system tables, the cache isn't updated. So you need to restart SQL Server.

Look for blog posts on this stuff from Paul in the next few weeks at SQLSkills

Recovery using backups

Not necessarily the best way to deal with corruption.With large databases, it can be quicker to run DBCC to repair things. However most corruptions require ALLOW_DATA_LOSS, which means data gets deleted.

Full backups are a good starting point, necessary for a base.

Log backups give you good recovery to point in time. 

Backups need to be valid, so you should test them. You can restore bad backups with the "Continue after restore" option. This will restore the database, which is corrupt, but you can possibly retrieve data. Even with a corrupt database, you want to still get a backup first in case things get worse. Use CONTINUEAFTERERROR with the backup as well.

Restore or repair?

Do you have a database? If no, you need to restore.

Working backups? If no, repair. Or you can restore a damaged backup.

Log damaged? Restore, if that fails, use emergency mode and extract information.

CheckDB failed? If it's a critical error, restore or extraction.

If you have nonclustered index errors - restore or repair, choose which makes sense.

If you have repairable errors, choose either repair or restore.

If you have a choice, think about SLAs (downtime and data loss). Choose the one that limits the loss.

One thing to always do in a DR situation is ensure you have the tail log backup. This are all the changes since the last log backup. This is a just a current log backup.

Paul showed a single page restore, which is fairly cool.I haven't done that, but it can help with one (or relatively few) page corruptions.

One trick I've seen over and over is that you should ALWAYS restore with NORECOVERY. That's the default, do that, it allows you to continue with restores. When you're sure you're done, do a restore with recovery by itself to brings things online.

REPAIR_ALLOW_DATA_LOSS

Be careful. You're allowing things to be repaired, which often means just deleting data. 

Repair goes the fastest, doing the most provably correct thing. It doesn't try to save data, just get done quickly. Doesn't look at FKs, constraints, replications, or any types of relationships between objects.

CHECKBD repairs aren't replicated, so you need to quiese the system and then re-init the publications.

Damaged Log

If you don't have a backup, use EMERGENCY mode to extract the data. There is an emergency mode repair in 2005 that  you can run. This will rebuild the log (after extracting what it can) and then run checkdb as one operation.

What people do? 

Restart SQL Server - Doesn't really do anything. If there's corruption, it's corrupt and you're wasting time.

Jump to a last resort -  Without determining the error, they often run repair_allow_data_loss.

Detach a suspect database - Doesn't help.

XVI32, freeware hex editor. Paul showed how to corrupt a t-log. Interesting, not a useful skill if you're not a presenter/tester.

One thing about rebuilding the log on the sly with a detach/attach, Service Broker has been given a new GUID, so it's disabled. Watch out for this if you use SOA stuff.

No backup, no database? Worst state. Kimberly says URLT (update resume, leave town).

Summary

Know the signs

Run checkDB, let it complete,

Always take a backup before restore or repair

Try to limit downtime and data loss, but think about it.

 Test this, go through this using demo databases. And using your systems.

 

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating