CHECKDB fails, then succeeds after restore to another instance

Question

CHECKDB fails, then succeeds after restore to another instance

Plucky

SSC Veteran

Points: 266
More actions
October 7, 2014 at 3:44 am

#297188

Hi
Occasionally (once every 3 months or so) CHECKDB fails on our busiest database on our VMware management SQL Server instance (this is a VM itself 10.0.5775). No other VMs give problems.
To troubleshoot recovery options, I then restore the database from a backup taken after the CHECKDB failed (using Symantec NetBackup) to another instance (physical 10.0.5828), but when I run CHECKDB on the restored database, it returns no errors whatsoever.
Can anyone perhaps shed any light on why CHECKDB might succeed on the restored copy, after the (corrupt) DB is restored to another instance?
Thank you!
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (3:89710) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (3:89711) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
CHECKDB found 0 allocation errors and 2 consistency errors not associated with any single object.
Msg 8928, Level 16, State 1, Line 1
Object ID 1401772051, index ID 1, partition ID 72057603137601536, alloc unit ID 72057616695885824 (type In-row data): Page (3:89709) could not be processed. See other errors for details.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 1401772051, index ID 1, partition ID 72057603137601536, alloc unit ID 72057616695885824 (type In-row data), page (3:89709). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 29493257 and -4.
Msg 8928, Level 16, State 1, Line 1
Object ID 1401772051, index ID 1, partition ID 72057603137601536, alloc unit ID 72057616695885824 (type In-row data): Page (3:89710) could not be processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 1401772051, index ID 1, partition ID 72057603137601536, alloc unit ID 72057616695885824 (type In-row data): Page (3:89711) could not be processed. See other errors for details.
CHECKDB found 0 allocation errors and 4 consistency errors in table 'VPX_HIST_STAT2_8' (object ID 1401772051).
CHECKDB found 0 allocation errors and 6 consistency errors in database 'VCDB'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (VCDB).
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

kDBAjan Ten Centuries Points: 1262 More actions · Answer 1

The restored database to another instance. Is the instance on different physical hardware than the one it came from? Has fragmentation across the drives been checked? Not exactly an expert but things I'd think about checking.

sql_handle Say Hey Kid Points: 693 More actions · Answer 2

http://sqlserverbuilds.blogspot.com/

The source database instance version is 10.00.5775 or SQL Server 2008 SP3 CU4.

The "second run" database instance version is 10.00.5828 or SQL Server 2008 SP3 CU8.

Definitely want to consider possible role of any CU5, CU6, CU7, or CU8 fixes that effect checkdb behavior.

This fix is in SQL Server 2008 SP3 CU6.

FIX: A database page is copied into a database snapshot even though the page is not updated when you perform a read operation in SQL Server 2008 or in SQL Server 2008 R2

http://support2.microsoft.com/kb/2625768

That's the only documented fix between the two versions that immediately stands out - doesn't exclude the possibility of another fix being involved.

Alan Burstein SSC Guru Points: 61152 More actions · Answer 3

Plucky (10/7/2014)
Hi
Occasionally (once every 3 months or so) CHECKDB fails on our busiest database on our VMware management SQL Server instance (this is a VM itself 10.0.5775). No other VMs give problems.
To troubleshoot recovery options, I then restore the database from a backup taken after the CHECKDB failed (using Symantec NetBackup) to another instance (physical 10.0.5828), but when I run CHECKDB on the restored database, it returns no errors whatsoever.
Can anyone perhaps shed any light on why CHECKDB might succeed on the restored copy, after the (corrupt) DB is restored to another instance?
Thank you!
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (3:89710) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (3:89711) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
CHECKDB found 0 allocation errors and 2 consistency errors not associated with any single object.
Msg 8928, Level 16, State 1, Line 1
Object ID 1401772051, index ID 1, partition ID 72057603137601536, alloc unit ID 72057616695885824 (type In-row data): Page (3:89709) could not be processed. See other errors for details.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 1401772051, index ID 1, partition ID 72057603137601536, alloc unit ID 72057616695885824 (type In-row data), page (3:89709). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 29493257 and -4.
Msg 8928, Level 16, State 1, Line 1
Object ID 1401772051, index ID 1, partition ID 72057603137601536, alloc unit ID 72057616695885824 (type In-row data): Page (3:89710) could not be processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 1401772051, index ID 1, partition ID 72057603137601536, alloc unit ID 72057616695885824 (type In-row data): Page (3:89711) could not be processed. See other errors for details.
CHECKDB found 0 allocation errors and 4 consistency errors in table 'VPX_HIST_STAT2_8' (object ID 1401772051).
CHECKDB found 0 allocation errors and 6 consistency errors in database 'VCDB'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (VCDB).
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

One thing you may want to try is running the DBCC WITH TABLOCK. When running DBCC CHECKDB the default behavior is to make an internal snapshot of the database and run the DBCC against it; this to circumvent concurrency issues. The problem I have experienced in the past was, for reasons I cannot articulate, the snapshot would become slightly corrupted during this process which would result in the DBCC CHECKDB saying the DB had issues when it was really fine. As a general rule, when DBCC CHECKDB says there's a problem I will re-run the DB check using the WITH TABLOCK command just to make sure.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Paul Randal One Orange Chip Points: 29544 More actions · Answer 4

Could be the problem from the fix mentioned above. If not, likely a transient corruption problem from the I/O subsystem where it returns bogus data once in a blue moon but not when you run your backup. Try the fix first. If the problem goes away, voila. Otherwise it's your I/O subsystem.

Paul Randal
CEO, SQLskills.com: Check out SQLskills online training!
Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

Plucky SSC Veteran Points: 266 More actions · Answer 5

Thank you all for the replies.

I will investigate changes that may be relevant in CUs that are different between the instances.

Regarding the TABLOCK recommendation, thanks, but it doesn't apply here unfortunately. The databases was most definitely corrupt. 1x PK had to be rebuild by a repair_with_data_loss, and then a second run of the same was required to fix a few broken pages. Errors were returned when querying the 'broken' table, so not snapshot-related I'm sure.

I would also suspect the IO subsystem with this in mind, but consider that our SAN infrastructure provides storage for over 1000 databases on over 100 instances, but only this one instance has ever had DB corruption in the 4 years I've worked as a DBA here. It's also only this one database...

The last time it happened, a dump file was produced by SQL, and MS said they could find no reason within it for a corruption. We are now barking up the VMware tree.

I'll update this if I find the answer.

Thanks again all

Plucky SSC Veteran Points: 266 More actions · Answer 6

Just to add to this, it appears to be a Symantec NetBackup thing. We experienced the same issue again today, and when I restored the NetBackup backup (corrupted database - to another server with the same version of SQL), DBCC CHECKDB returned no errors.

I then took a copy_only backup and restored it (same server as above) and CHECKDB returned the errors as expected.

Plucky SSC Veteran Points: 266 More actions · Answer 7

Update:

After running a bunch of tests, it seems that the corruption likely occurs *after* the last full backup, but *before* the next diff backup.

I restored the full backup (norecovery) then the diff (recovery), which CHECKDB thinks is perfectly fine. The same occurs when restoring subsequent log backups - no errors reported on the restored database (even when restored well past the first CheckDB on the original database reported issues).

If I restore the full backup only, it reports *no* corruption. If I take another full backup (whether copy_only or not), after the corruption, and restore it, the restored database *does* report the corruption.

It baffles me that if I restore a diff (taken after the corruption was reported), with a preceding full that is probably healthy, checkdb says it's fine!

This brings up a few questions:

1) Is it possible that SQL Server somehow skips the corruption when taking a diff backup?

2) Is it possible that the corruption is somehow not restored in the scenario above?

3) How can one find out when a corruption occurred (we can only practically run CHECKDB every 12 hours), if restoring log backups with standby always returns *ok* when running checkdb against each log restore, until well past the point of reported corruption?

Surely this isn't right - restoring a corrupted database gives it a full bill of health as long as it wasn't corrupt when the FULL backup was taken - but then what is happening?