Errors from CHECKDB

Question

Errors from CHECKDB

MyDoggieJessie

SSC-Forever

Points: 44486
More actions
August 31, 2013 at 10:04 am

#281945

Last night I got paged with the 2 errors below:
First
SQL Server Alert System: 'Severity 017 - Insufficient Resources' occurred on ServerA
During undoing of a logged operation in database 'X', an error occurred at log record ID (561239:5810208:1). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.
Second
SQL Server Alert System: 'Error Number 823 - IO/Hardware/System Issue detected' occurred on ServerA
DESCRIPTION:The operating system returned error 1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x0000dc1e14c000 in file 'H:\X.mdf:MSSQL_DBCC25'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
CHECK THE SERVER ASAP - http://support.microsoft.com/kb/2015755 - IO Hardware/System Issue detected - The DBA's have been notified about the cause of this error and can assure you that the public flogging of the developer responsible for this error will be severe.
When I got home and started to do some digging I saw that our subscriber was millions of replication commands behind for a specific article. I didn't think much of this because about 3,000,000 rows were deleted earlier in the day from the publisher for this table (so falling behind a little is expected for a few hours).
Around 4:30pm our automated DBCC CHECKDB process kicked off on the subscriber databases, in particular the one for database X - which ended up kicking out the following errors:
The operating system returned error 1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x0000dc1e14c000 in file 'H:\COREISSUE.mdf:MSSQL_DBCC25'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. [SQLSTATE 01000]
Mail queued. [SQLSTATE 01000]
This appears to have caused an issue with replication on the subscriber? I know noticed that the commands from the distributor were in a ROLLBACK TRANSACTION state with WAIT_TYPEs of 1) WRITE_COMPLETION and 2) FCB_REPLICA_WRITE. Assuming "Crap! We have some bad data pages at the subscriber" - In the end I dropped the article from replication altogether, then added it back in, took the snapshot and applied it to the subscriber.
Time for another CHECKDB - then I got this:
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996873) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Page (1:21996873) was not seen in the scan although its parent (1:21996917) and previous (1:21996872) refer to it. Check any previous errors.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996874) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21996917), slot 166 refers to child page (1:21996874) and previous child (1:21996873), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996875) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21996917), slot 167 refers to child page (1:21996875) and previous child (1:21996874), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996876) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21996917), slot 168 refers to child page (1:21996876) and previous child (1:21996875), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996877) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21996917), slot 169 refers to child page (1:21996877) and previous child (1:21996876), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996878) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21996917), slot 170 refers to child page (1:21996878) and previous child (1:21996877), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996879) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21996917), slot 171 refers to child page (1:21996879) and previous child (1:21996878), but they were not encountered.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Page (1:21996880) is missing a reference from previous page (1:21996879). Possible chain linkage problem.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996956) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Page (1:21996956) was not seen in the scan although its parent (1:21997094) and previous (1:21996955) refer to it. Check any previous errors.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996957) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21997094), slot 72 refers to child page (1:21996957) and previous child (1:21996956), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996958) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21997094), slot 73 refers to child page (1:21996958) and previous child (1:21996957), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996959) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21997094), slot 74 refers to child page (1:21996959) and previous child (1:21996958), but they were not encountered.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Page (1:21996960) is missing a reference from previous page (1:21996959). Possible chain linkage problem.
CHECKDB found 0 allocation errors and 24 consistency errors in table 'StatementHeader' (object ID 1813659046).
CHECKDB found 0 allocation errors and 24 consistency errors in database 'COREISSUE'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (COREISSUE).
This made me think that the issue is actually at the publisher (our PRODUCTION system) so I ran a CHECKTABLE on the publisher...however, that came back without errors (Thank God):
Command(s) completed successfully.
I've never encountered this particular CHECKDB error before and since I dropped the article and rensnapped it, I was under the impression that all of the data in the subscriber's 'X' table is dropped and re-populated from the new snapshot - so why would I get this error again?
Any ideas?
______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

You've got IO subsystem problems on ServerA. Do some diagnostics, check storage logs and consider moving the DB to alternate storage.

Corruption can't be replicated (or mirrored), so bad pages on subscriber cannot be caused by problems at the publisher.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

MyDoggieJessie SSC-Forever Points: 44486 More actions · Answer 2

Thanks Gail, I will co-ordinate with our Tech Services team and see what they can find at the hardware level, perhaps we've had a drive go bad in our of our RAIDs

______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

MyDoggieJessie SSC-Forever Points: 44486 More actions · Answer 3

Found out there is a drive failing in one of the RAID arrays, so IT is looking into having it rebuilt/add a new on in

When that's resolved, my plan is to drop the article from replication, delete the table from the subscriber, add the article back to the publisher and re-snap/re-apply to the subscriber DB. When that's complete, run another DBCC CHECKTABLE - should be resolved?

If not, I will do the similar steps as above but will take the DB offline and move it to an alternate storage location before applying anything to the subscriber database

Sound ok or am I missing something important?

______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 4

Run a full checkDB, if there are any errors I'd recommend dropping the database and reinitialising the subscription.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

MyDoggieJessie SSC-Forever Points: 44486 More actions · Answer 5

Will do, and thanks for the quick responses.

Heard back from IT, a new drive should be installed within the next 2-4 hours, then another few hours to get the data striped back on it.

Once that's complete I'll start my process:

1. Drop the article from replication

2. Delete the table from the subscriber

3. Add the article back to the publisher and re-snap/re-apply to the subscriber DB

4. Run DBCC CHECKDB('X') WITH NO_INFOMSGS, ALL_ERRORMSGS

If not clean, I'll drop the subscriber altogether, recreate the subscription, then reinitialize it

______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 6

A single drive failure in a RAID 1, 5 or 10 should not have caused this. Those raid levels have redundancy, so you can lose an entire drive and not have any problems (other than maybe speed)

If losing a single drive in a RAID 1, 5 or 10 causes data loss or IO errors, then there's a bigger problem with that RAID array than a single failed drive. I think you might want to investigate further.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

MyDoggieJessie SSC-Forever Points: 44486 More actions · Answer 7

I have also found that the battery on the RAID controller is having issues; but I wouldn't think that would cause it?

______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 8

Combined with a hard shutdown it can cause problems (pages SQL assumes written to disk never got there).

If the RAID array's battery is faulty, you must disable the write cache until it can be fixed.

Seriously, get someone to have a top-to-bottom, end-to-end look at that SAN, it really doesn't sound healthy

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

MyDoggieJessie SSC-Forever Points: 44486 More actions · Answer 9

Issues resolved. Gail, I sincerely appreciate the guidance

Steps taken (I know I got lucky):

1. IT-added new drive to the array, rebuilt it

2. Dropped the article from the Publisher

3. Deleted the table from the subscriber

4. Added the article back to the publisher and re-snap/re-apply to the subscriber DB

5. Run DBCC CHECKTABLE('X') WITH NO_INFOMSGS, ALL_ERRORMSGS first - successful

6. Run DBCC CHECKDB('X') WITH NO_INFOMSGS, ALL_ERRORMSGS first - successful

I was ready to move the entire DB to a different storage location and worst case re-initialize...but got lucky...

Ps. Write cache was already disabled (guessing this should always be enabled?) - the replacement battery should be installed in the next few days as well

______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 10

MyDoggieJessie (8/31/2013)
Ps. Write cache was already disabled (guessing this should always be enabled?) - the replacement battery should be installed in the next few days as well

Once the battery is in and tested, you can consider enabling the write cache. It's just for better write performance.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass