Thanks to insights gained from http://www.pluralsight.com/training/Courses/TableOfContents/sqlserver-database-corruption
I was able to deal with the corruption. Although at first it díd apperar te be a unrepairable error the corruption turned out to be in a Non-Clustered index which I was able to recreate
The cause is most likeley to be found on our SAN; our SAN guy's have reported some issues and, last weekend, rolled back a change which solved issues on our Exchange 2013 environment (which over the last week constantly failed over some servers in the same network segment as the databaseserver involved)
Now, a few day's later I'm able to get a better picture of what happened, althoug the specific cause and time of the root-cause remain unclear to me
I suspect that sometime on august 21th the corruption occured:
SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file... -> mdf and ndf of tempdb and the affected Db
The first sign of problems arose august 22th at 0:30
DBCC CHECKDB (HAN_BV_PROD) WITH no_infomsgs executed by STAFF\sa_bv_sql_p terminated abnormally due to error state 1. Elapsed time: 0 hours 25 minutes 12 seconds.
DBCC encountered a page with an LSN greater than the current end of log LSN (114120:0:1) for its internal database snapshot. Could not read page (40306:-923434249), database 'HAN_BV_PROD' (database ID 10), LSN = (-372778688:1359392204:2), type = 17, isInSparseFile = 1. Please re-run this DBCC command
I now know that this is most likely to be a problem with the snapshot checkdb creates...
During the day multiple instances of
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x9f00cdb8; actual: 0x45d28a20). It occurred during a read of page (1:3479284) in database ID 6 at offset 0x000006a2de8000 in file 'D:\Sqldata\HAN_BV_PROD\HAN_BV_PROD.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. rai
Error: 824, Severity: 24, State: 2.
This really should have raised all alarms, which unfortunately did not happen
A collegue did actually run checkdb twice (seperated by a SQL Server recycle). The last checkdb he did reported:
DBCC results for 'HAN_BV_PROD'.
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.
DBCC results for 'sys.sysrscols'.
There are 38016 rows in 925 pages for object "sys.sysrscols".
DBCC results for 'sys.sysrowsets'.
There are 3918 rows in 379 pages for object "sys.sysrowsets".
DBCC results for 'sys.sysallocunits'.
There are 4184 rows in 441 pages for object "sys.sysallocunits".....
DBCC results for 'sys.sysfiles1'.
There are 75701 rows in 3944 pages for object "TSTIMESHEETLINE".
CHECKDB found 0 allocation errors and 0 consistency errors in database 'HAN_BV_PROD'.
Unfortunately he read "CHECKDB found 0 allocation errors and 0 consistency errors in database 'HAN_BV_PROD'" and concluded 'problem solved'
August 22nd I got an email asking for assistance; with dbcc page I checked the content of the corrupt page which turned out to be of m_type 17 and started this topic here. And I wasn't able to decently run checkdb (due to the snapshot error)
August 23rd I could run checkdb with the tablock hint with the Database in exclusive mode:
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 281964609732608 (type Unknown), page (40306:-923434249). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 12584969 and -4.
CHECKDB found 0 allocation errors and 1 consistency errors not associated with any single object.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 1994646349, index ID 26, partition ID 72057607648051200, alloc unit ID 72057607670267904 (type In-row data). Page (1:69264) is missing a reference from previous page (1:3479284). Possible chain linkage problem.
Msg 8928, Level 16, State 2, Line 1
Object ID 1994646349, index ID 26, partition ID 72057607648051200, alloc unit ID 72057607670267904 (type In-row data): Page (1:3479284) could not be processed. See other errors for details.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 1994646349, index ID 26, partition ID 72057607648051200, alloc unit ID 72057607670267904 (type In-row data). Page (1:3479284) was not seen in the scan although its parent (1:6747137) and previous (1:6297536) refer to it. Check any previous errors.
CHECKDB found 0 allocation errors and 3 consistency errors in table 'PROJINVOICEJOUR' (object ID 1994646349).
CHECKDB found 0 allocation errors and 4 consistency errors in database 'HAN_BV_PROD'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (HAN_BV_PROD).
I've located the index with id 26 in table PROJINVOICEJOUR and recreated it. Checkdb with tablock resulted in 0 errors, after that i put the Db back in multi_user mode and ran checkdb without the tablockhint again (for the snapshot error)
So I'm glad I've seen Paul Randall's video's on the subject and did not resort to repair_allow_dataloss or other options