Database Corruption Problem with CheckDB results - How bad is it?

  • I'm quickly reading to determine the extent of the problem, but if I could draw on the experiences of the masses it would be greatly appreciated. We do have full backups and transactions logs available for restore. We are currently restoring our database.

    I've placed the database in Emergency Mode so others have access to data if neeed. So what's the prognosis?

    Msg 8909, Level 16, State 1, Line 1

    Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:37026864) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).

    Msg 8998, Level 16, State 2, Line 1

    Page errors on the GAM, SGAM, or PFS pages prevent allocation integrity checks in database ID 11 pages from (1:37026864) to (1:37034951). See other errors for cause.

    Msg 8966, Level 16, State 2, Line 1

    Unable to read and latch page (1:37034952) with latch type SH. 38(Reached the end of the file.) failed.

    Msg 8998, Level 16, State 2, Line 1

    Page errors on the GAM, SGAM, or PFS pages prevent allocation integrity checks in database ID 11 pages from (1:37034952) to (1:37043039). See other errors for cause.

    Msg 8966, Level 16, State 2, Line 1

    Unable to read and latch page (1:37043040) with latch type SH. 38(Reached the end of the file.) failed.

    Msg 8998, Level 16, State 2, Line 1

    Page errors on the GAM, SGAM, or PFS pages prevent allocation integrity checks in database ID 11 pages from (1:37043040) to (1:37051127). See other errors for cause.

    CHECKDB found 6 allocation errors and 0 consistency errors not associated with any single object.

  • Fatal.

    You have damage to a couple of your allocation pages (specifically PFS pages). This kind of corruption is irreparable.

    Take a tail-log backup (if you can), then restore your full backup, all log backups followed by the tail-log backup that you took. You should be able to restore without data loss since you have a good set of backups.

    You need to do some root cause analysis. Look in the windows event log, SAN logs, RAID logs, etc. This is typically an IO subsystem problem, you need to identify the cause or it may happen again.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • If you want some reading on corruption, you could start with this. http://www.sqlservercentral.com/articles/Corruption/65804/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks Gila. I was hoping to do a page restore, but I guess we're not looking good. I did read on BOL that we wouldn't be able to restore pages that dealt with PFS; I was just hoping I could get by.

    This is the blog I was looking at for restoring corrupt pages.

    http://blogs.msdn.com/sqlserverstorageengine/archive/2007/01/18/fixing-damaged-pages-using-page-restore-or-manual-inserts.aspx

    Performing Page Restores

    http://msdn.microsoft.com/en-us/library/ms175168.aspx

  • A page restore will not work here. Besides, it looks like there's more than just three pages damaged. The comment about not being able to latch a page with reason 'end of file' seems to indicate that SQL and windows are in disagreement over the size of the file.

    Take the tail log backup and then restore from scratch over the damaged database.

    p.s. Why did you post this in the SQL 2000 forum? It's either a 2005 or 2008 database.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Whoops, I suppose I wasn't reading the best in the heat of moment. You are correct, this is an SQL Server 2005 instance.

  • For the page latch message - when the final PFS page at the end of the file is damaged (which could be the case here), I couldn't tell how many pages from the final extent have been allocated or not - hence CHECKDB tries to read the pages and may fail some of them.

    You need to do a full restore here.

    A page being overwritten with zeroes (as is the case here) is the I/O subsystem - nearly 100% of the times I've seen.

    Thanks

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • Paul Randal (2/25/2010)


    For the page latch message - when the final PFS page at the end of the file is damaged (which could be the case here), I couldn't tell how many pages from the final extent have been allocated or not - hence CHECKDB tries to read the pages and may fail some of them.

    Ah, thanks for the explanation.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • No problem. I really need to get around to writing a really detailed, comprehensive guide to corruption - nothing out there right now.

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • Paul Randal (2/25/2010)


    I really need to get around to writing a really detailed, comprehensive guide to corruption - nothing out there right now.

    I wait with anticipation... 😀

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • So, in addition to doing a restore, you need to check your hardware. This could happen again.

  • Agreed, we've looked over our SAN logs, but haven't found the "smoking gun" yet. We're in the process of looking over our fiber channel logs. Just an FYI, this corruption occured during our nightly reindexing job, that's been running for about a year now.

    Thanks, again everybody your help.

  • Kindly let us know if you could find the exact cause of the issue and how could you solve it. It will be helpfull to others including myself.

  • Unfortunately, I don't have an answer yet for the cause. We've restored the backup and run the reindexing job on another server (same SAN and disks though), but it didn't reproduce the error (which is a good thing). When we have an answer I'll be sure to post it.

  • Could be just about anything anything in the IO stack

    Misbehaving filter driver (eg anti-virus)

    Buggy HBA drivers

    Faulty fibre switch

    Problematic SAN controller

    Glitch in SAN's write cache

    Failing drives

    etc

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply