Blog Post

Quest Webcast: Dos and Don’ts of Database corruption

,

I was guest presenter on last week’s Quest Pain of the Week presentation. The presentation is available for download from Quest. This is a roundup of the questions from the presentation. (I have edited some of the questions for spelling, grammar and readability)

Q: Why do you have to wait for a maintenance window to take a backup? you can backup during the time the database is being used.

You don’t have to wait, and with a 24×7 system you obviously can’t. But full backups do have an impact even though database can be in use at the time. They’re IO intensive. If you’re going to do backups while the system is in use, make sure that the additional IO load doesn’t adversely effect user performance, or look into things like SAN-level snapshot backups.

Q: If you partition your data, can you only backup the latest partition?

Yes, however be mindful of the requirements for restoring from file/filegroup backups. In full or bulk-logged recovery if the older filegroups are not read-only, you’ll need log backups covering the entire span from the oldest of the file/filegroup backups you use to the newest.

If the partitions with older data are read-only, then this is very easy to do.

For more information see http://msdn.microsoft.com/en-us/library/ms177425.aspx

Q: What if you inherit a database that fails the integrity check every day and the backups have the same issue so you don’t know when the integrity check originally failed?

That just means that restore from backup (full, file/filegroup or page) is not an option for fixing. Probably means that there will be no avoiding data loss.

Q: The 825 errors – these are reported in the logs too aren’t they? I can review to see if these are occurring, even if the alert is not set?

Yes they are (and I don’t know why I thought during the presentation that they aren’t, I’ve written a blog post on them myself). http://sqlinthewild.co.za/index.php/2008/12/06/when-is-a-critical-io-error-not-a-critical-io-error/

Msg 825, Level 10, State 2, Line 1.

A read of the file ‘D:\Data\SomeDB.mdf’ at offset 0x00000020e24000 succeeded after failing 1 time(s) with error: incorrect checksum (expected: 0x7532c420; actual: 0x320e4240). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Q: Can you use Resource Governor for a large DB backup to keep the DB’s performance at an acceptable level?

Resource governor only limits CPU and memory, not IO.

Q: If you have data loss, shouldn’t you be doing a restore?

Maybe. It could be (due to time to restore or other constraints) that the fastest and acceptable solution is to repair (CheckTable or CheckFileGroup) and then sync in missing data later.

Q: I just had this happen on an archive database that I can play with. The problem occurred with lob’s. Since allow data loss removes the entire page that the corruption is on how does this work with lob’s that are not stored on the same page as the record?

The entire LOB chain (all LOB pages for the row) will be deallocated and the data/index record which is the parent will be deleted. (Ref: Chapter 11 of SQL Server 2008 Internals by Kalen Delaney et al. Chapter 11 written by Paul Randal)

Q: Is it possible to create a corrupted database in a test lab to get an 825 error?

Error 825 is a hard one to generate. To get that you need for an IO to fail or the retrieved page to be damaged and then on a retry (which occurs immediately) for the IO to succeed and the page to be correct. That’s hard to fake. You’ll need an IO filter driver that is coded to intermittently mess up IOs.

For 824, either use a hex editor to edit pages of a test database, or download and restore Paul Randal’s pre-corrupted databases.

For 823, create a database with a secondary file on a flash drive. Create a large table in the filegroup on the flash drive, checkpoint, DBCC dropcleanbuffers, then pull the flash drive out and query the table. If that doesn’t throw the error (meaning the pages are still in memory), update a row and checkpoint.

It goes without saying that neither of these should ever be done on a server!

Q: In case of db suspect (not recoverable) can we update the system table sysdatabases and change the status to make it online (this solution used to work in SQL 2000)

That is a SQL 2000 solution. In SQL 2005 and above, sysdatabases isn’t a table and the system tables cannot be updated as was possible in SQL 2000.

Fortunately there’s a fully documented and supported alternative – ALTER DATABASE <Database name> SET EMERGENCY. Once in emergency mode you can run CheckDB with repair_allow_data_loss and then bring the database online (assuming the repair succeeds), or extract data if the repair fails.

Q: Is Check DB the only tool available?

To check database integrity, yes.

Q: If you got a foreign key, you can restore a table but will integrity not be broken?

Relational and possibly transactional integrity may be broken. I’d guess that’s one reason table-level restores aren’t available natively in the product.

Q: Checksum on database backup – this is not an option in maintenance plan – true? To enable this, I have to script all database backups?

Correct, it’s not an option on maintenance plans. Personally I’m not a fan of maintenance plans for anything other than trivial environments.

Q: We use some SQL Express, do all these items work in Express? I had one that the hard drive died and I was able to get it back from suspect at which point I did a full backup and restored it elsewhere on a full SQL Server to fix it

The only limitation for CheckDB on Express is no parallelism.

File/filegroup restores can be done on Express, though they are offline restores.

Page restores can be done on Express, though they are offline restores


Thanks to Paul Randal and Robert Davis for assistance with a couple of these answers.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating