Corruption in a page of m_type 17, options left ?

  • I got an email this morning asking for my assistance (I'm not really part of the SQL team and practically still on holiday...But I'm the guy who understands most of SQL Server...). Unfortunately I quickly determined that the corruption is already present for a week...

    The question: Given the circumstances below: Is there any hope of recovering or should we restore the database from a backup of a week ago (Microsoft Dynamics AX 2009: the heart of our finance department...)

    The SQL Errorlog reports:

    SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x9f00cdb8; actual: 0x45d28a20). It occurred during a read of page (1:3479284) in database ID 6 at offset 0x000006a2de8000 in file 'D:\Sqldata\HAN_BV_PROD...

    BCC CHECKDB (HAN_BV_PROD) WITH no_infomsgs executed by STAFF\sa_bv_sql_p terminated abnormally due to error state 1. Elapsed time: 0 hours 25 minutes 12 seconds.

    Running DBCC Checkdb exits with:

    Msg 8921, Level 16, State 1, Line 1

    Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.

    The errorlog also shows:

    DBCC encountered a page with an LSN greater than the current end of log LSN (114131:0:1) for its internal database snapshot. Could not read page (40306:-923434249), database 'HAN_BV_PROD' (database ID 10), LSN = (-372778688:1359392204:2), type = 17, isInSparseFile = 1. Please re-run this DBCC command.

    I also found the possible cause:

    SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\Sqldata\HAN_BV_PROD\HAN_BV_PROD.mdf:MSSQL_DBCC10] in database [HAN_BV_PROD] (10). The OS file handle is 0x0000000000000378. The offset of the latest long I/O is: 0x0000000e646000

    SQL Server has encountered 3 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\LogData\HAN_BV_PROD\HAN_BV_PROD.ldf] in database [HAN_BV_PROD] (6). The OS file handle is 0x00000000000004DC. The offset of the latest long I/O is: 0x00000108139800

    dbcc page (HAN_BV_PROD,1,3479284,1) and dbcc page (HAN_BV_PROD,1,3479284,3) exit with errors:

    ,1:

    Msg 0, Level 11, State 0, Line 0

    A severe error occurred on the current command. The results, if any, should be discarded.

    ,3

    Msg 5180, Level 22, State 1, Line 1

    Could not open File Control Bank (FCB) for invalid file ID 40306 in database 'HAN_BV_PROD'. Verify the file location. Execute DBCC CHECKDB.

    dbcc page (HAN_BV_PROD,1,3479284) reports:

    PAGE: (40306:-923434249)

    BUFFER:

    BUF @0x00000001A4FCDA80

    bpage = 0x00000001A436A000 bhash = 0x0000000000000000 bpageno = (1:3479284)

    bdbid = 6 breferences = 0 bUse1 = 63513

    bstat = 0xc00809 blog = 0x159a2159 bnext = 0x0000000000000000

    PAGE HEADER:

    Page @0x00000001A436A000

    m_pageId = (40306:-923434249) m_headerVersion = 232 m_type = 17

    m_typeFlagBits = 0xe7 m_level = 0 m_flagBits = 0x9e9a

    m_objId (AllocUnitId.idObj) = 7471207 m_indexId (AllocUnitId.idInd) = 0

    Metadata: AllocUnitId = 489633021952 Metadata: PartitionId = 0 Metadata: IndexId = -1

    Metadata: ObjectId = 0 m_prevPage = (59:3342338) m_nextPage = (48:3145778)

    pminlen = 112 m_slotCnt = 107 m_freeCnt = 106

    m_freeData = 31014 m_reservedCnt = 36929 m_lsn = (-372778688:1359392204:2)

    m_xactReserved = 43008 m_xdesId = (0:22020186) m_ghostRecCnt = 4096

    m_tornBits = -1627337288

    Allocation Status

    GAM (1:3067392) = ALLOCATED SGAM (1:3067393) = NOT ALLOCATED

    PFS (1:3477840) = 0x60 MIXED_EXT ALLOCATED 0_PCT_FULL DIFF (1:3067398) = CHANGED

    ML (1:3067399) = NOT MIN_LOGGED

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

  • Restore from backup. That's not repairable.

    I would hope, that for the heart of your accounting, you have log backups. Assuming that you do, restore the clean full backup WITH NORECOVERY, then restore all log backups since that point up until the latest you have.

    I suggest you also have a look at your IO subsystem. Some of those errors look rather nasty.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (8/26/2014)


    Restore from backup. That's not repairable.

    I would hope, that for the heart of your accounting, you have log backups. Assuming that you do, restore the clean full backup WITH NORECOVERY, then restore all log backups since that point up until the latest you have.

    I suggest you also have a look at your IO subsystem. Some of those errors look rather nasty.

    Thanks for your clarity, is a single page restore an option ? Since an accounting company came by 24 hour dataloss was determined to be acceptable and apparently log backups were switched off 🙁

  • jharting (8/26/2014)


    Thanks for your clarity, is a single page restore an option ?

    Without log backups, no.

    To do a single page restore, there must be log backups covering the time between the full backup used to restore and the time the restore was done. Without those, you won't be able to finish the restore and the page will not be accessible.

    Maybe time to push back on those log backups...

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • What you may be able to do is:

    Rename the damaged DB

    Restore a clean backup as the original name (this lets users work)

    Carefully sync over the data missing from the restored DB (because of it being older)

    Since it's accounting data, you have to be soooo careful though.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks,

    Since daily full backups are made, a page restore from about a week ago would prove difficult; We'll discuss this with the business 🙂

    Thanks again 🙂

  • CheckDB with tablock resulted in:

    Msg 8939, Level 16, State 98, Line 1

    Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 281964609732608 (type Unknown), page (40306:-923434249). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 12584969 and -4.

    CHECKDB found 0 allocation errors and 1 consistency errors not associated with any single object.

    Msg 8978, Level 16, State 1, Line 1

    Table error: Object ID 1994646349, index ID 26, partition ID 72057607648051200, alloc unit ID 72057607670267904 (type In-row data). Page (1:69264) is missing a reference from previous page (1:3479284). Possible chain linkage problem.

    Msg 8928, Level 16, State 2, Line 1

    Object ID 1994646349, index ID 26, partition ID 72057607648051200, alloc unit ID 72057607670267904 (type In-row data): Page (1:3479284) could not be processed. See other errors for details.

    Msg 8976, Level 16, State 1, Line 1

    Table error: Object ID 1994646349, index ID 26, partition ID 72057607648051200, alloc unit ID 72057607670267904 (type In-row data). Page (1:3479284) was not seen in the scan although its parent (1:6747137) and previous (1:6297536) refer to it. Check any previous errors.

    CHECKDB found 0 allocation errors and 3 consistency errors in table 'PROJINVOICEJOUR' (object ID 1994646349).

    CHECKDB found 0 allocation errors and 4 consistency errors in database 'HAN_BV_PROD'.

    repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (HAN_BV_PROD).

    I've located the index; a rebuild on it fails; I've deleted and recreated the index and are now rerunning check Db

  • Interesting....

    Let's hope that works. I would suggest a long, hard look at that IO subsystem as well. Sooner rather than later.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • The SAN guy's already reported some connectivity issues on the fiberchannels around that time 🙂

  • GilaMonster (8/27/2014)


    Interesting....

    Let's hope that works. I would suggest a long, hard look at that IO subsystem as well. Sooner rather than later.

    Command(s) completed successfully.

    Now I''ll put the Db back in multi user mode and rerun checkdb without TABLOCK hint

  • jharting (8/27/2014)


    GilaMonster (8/27/2014)


    Interesting....

    Let's hope that works. I would suggest a long, hard look at that IO subsystem as well. Sooner rather than later.

    Command(s) completed successfully.

    Now I''ll put the Db back in multi user mode and rerun checkdb without TABLOCK hint

    Command(s) completed successfully.

    We're going back live now 🙂

  • double 🙂

  • Looks like a I/O error so your storage system has problems, which cause one of your page cannot be read. What is the result of putting the db back to multi-user mode?

    What I will suggest is to clone the whole disk image that used to store the database files, then use DBCC CheckDB with REPAIR_REBUILD or REPAIR_ALLOW_DATA_LOSS parameters to correct the errors.

    More info about this error: http://www.datanumen.com/sql-recovery/problems/logical-consistency-based-io-error-incorrect-checksum.htm

  • Thanks to insights gained from http://www.pluralsight.com/training/Courses/TableOfContents/sqlserver-database-corruption and http://www.pluralsight.com/training/Courses/TableOfContents/sqlserver-advanced-corruption-recovery-techniques I was able to deal with the corruption. Although at first it díd apperar te be a unrepairable error the corruption turned out to be in a Non-Clustered index which I was able to recreate 🙂

    The cause is most likeley to be found on our SAN; our SAN guy's have reported some issues and, last weekend, rolled back a change which solved issues on our Exchange 2013 environment (which over the last week constantly failed over some servers in the same network segment as the databaseserver involved)

    Now, a few day's later I'm able to get a better picture of what happened, althoug the specific cause and time of the root-cause remain unclear to me 😉

    I suspect that sometime on august 21th the corruption occured:

    SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file... -> mdf and ndf of tempdb and the affected Db

    The first sign of problems arose august 22th at 0:30

    DBCC CHECKDB (HAN_BV_PROD) WITH no_infomsgs executed by STAFF\sa_bv_sql_p terminated abnormally due to error state 1. Elapsed time: 0 hours 25 minutes 12 seconds.

    DBCC encountered a page with an LSN greater than the current end of log LSN (114120:0:1) for its internal database snapshot. Could not read page (40306:-923434249), database 'HAN_BV_PROD' (database ID 10), LSN = (-372778688:1359392204:2), type = 17, isInSparseFile = 1. Please re-run this DBCC command

    I now know that this is most likely to be a problem with the snapshot checkdb creates...

    During the day multiple instances of

    SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x9f00cdb8; actual: 0x45d28a20). It occurred during a read of page (1:3479284) in database ID 6 at offset 0x000006a2de8000 in file 'D:\Sqldata\HAN_BV_PROD\HAN_BV_PROD.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. rai

    Error: 824, Severity: 24, State: 2.

    This really should have raised all alarms, which unfortunately did not happen 🙂

    A collegue did actually run checkdb twice (seperated by a SQL Server recycle). The last checkdb he did reported:

    DBCC results for 'HAN_BV_PROD'.

    Msg 8921, Level 16, State 1, Line 1

    Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.

    DBCC results for 'sys.sysrscols'.

    There are 38016 rows in 925 pages for object "sys.sysrscols".

    DBCC results for 'sys.sysrowsets'.

    There are 3918 rows in 379 pages for object "sys.sysrowsets".

    DBCC results for 'sys.sysallocunits'.

    There are 4184 rows in 441 pages for object "sys.sysallocunits".....

    ....

    DBCC results for 'sys.sysfiles1'.

    There are 75701 rows in 3944 pages for object "TSTIMESHEETLINE".

    CHECKDB found 0 allocation errors and 0 consistency errors in database 'HAN_BV_PROD'.

    Unfortunately he read "CHECKDB found 0 allocation errors and 0 consistency errors in database 'HAN_BV_PROD'" and concluded 'problem solved'

    [Weekend]

    August 22nd I got an email asking for assistance; with dbcc page I checked the content of the corrupt page which turned out to be of m_type 17 and started this topic here. And I wasn't able to decently run checkdb (due to the snapshot error)

    August 23rd I could run checkdb with the tablock hint with the Database in exclusive mode:

    Msg 8939, Level 16, State 98, Line 1

    Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 281964609732608 (type Unknown), page (40306:-923434249). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 12584969 and -4.

    CHECKDB found 0 allocation errors and 1 consistency errors not associated with any single object.

    Msg 8978, Level 16, State 1, Line 1

    Table error: Object ID 1994646349, index ID 26, partition ID 72057607648051200, alloc unit ID 72057607670267904 (type In-row data). Page (1:69264) is missing a reference from previous page (1:3479284). Possible chain linkage problem.

    Msg 8928, Level 16, State 2, Line 1

    Object ID 1994646349, index ID 26, partition ID 72057607648051200, alloc unit ID 72057607670267904 (type In-row data): Page (1:3479284) could not be processed. See other errors for details.

    Msg 8976, Level 16, State 1, Line 1

    Table error: Object ID 1994646349, index ID 26, partition ID 72057607648051200, alloc unit ID 72057607670267904 (type In-row data). Page (1:3479284) was not seen in the scan although its parent (1:6747137) and previous (1:6297536) refer to it. Check any previous errors.

    CHECKDB found 0 allocation errors and 3 consistency errors in table 'PROJINVOICEJOUR' (object ID 1994646349).

    CHECKDB found 0 allocation errors and 4 consistency errors in database 'HAN_BV_PROD'.

    repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (HAN_BV_PROD).

    I've located the index with id 26 in table PROJINVOICEJOUR and recreated it. Checkdb with tablock resulted in 0 errors, after that i put the Db back in multi_user mode and ran checkdb without the tablockhint again (for the snapshot error)

    So I'm glad I've seen Paul Randall's video's on the subject and did not resort to repair_allow_dataloss or other options 😉

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply