Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Errors from CHECKDB Expand / Collapse
Author
Message
Posted Saturday, August 31, 2013 10:04 AM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: 2 days ago @ 10:38 PM
Points: 3,931, Visits: 7,160
Last night I got paged with the 2 errors below:

First
SQL Server Alert System: 'Severity 017 - Insufficient Resources' occurred on ServerA
During undoing of a logged operation in database 'X', an error occurred at log record ID (561239:5810208:1). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.

Second
SQL Server Alert System: 'Error Number 823 - IO/Hardware/System Issue detected' occurred on ServerA
DESCRIPTION: The operating system returned error 1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x0000dc1e14c000 in file 'H:\X.mdf:MSSQL_DBCC25'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

CHECK THE SERVER ASAP - http://support.microsoft.com/kb/2015755 - IO Hardware/System Issue detected - The DBA's have been notified about the cause of this error and can assure you that the public flogging of the developer responsible for this error will be severe.

When I got home and started to do some digging I saw that our subscriber was millions of replication commands behind for a specific article. I didn't think much of this because about 3,000,000 rows were deleted earlier in the day from the publisher for this table (so falling behind a little is expected for a few hours).

Around 4:30pm our automated DBCC CHECKDB process kicked off on the subscriber databases, in particular the one for database X - which ended up kicking out the following errors:
The operating system returned error 1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x0000dc1e14c000 in file 'H:\COREISSUE.mdf:MSSQL_DBCC25'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. [SQLSTATE 01000]
Mail queued. [SQLSTATE 01000]

This appears to have caused an issue with replication on the subscriber? I know noticed that the commands from the distributor were in a ROLLBACK TRANSACTION state with WAIT_TYPEs of 1) WRITE_COMPLETION and 2) FCB_REPLICA_WRITE. Assuming "Crap! We have some bad data pages at the subscriber" - In the end I dropped the article from replication altogether, then added it back in, took the snapshot and applied it to the subscriber.

Time for another CHECKDB - then I got this:
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996873) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Page (1:21996873) was not seen in the scan although its parent (1:21996917) and previous (1:21996872) refer to it. Check any previous errors.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996874) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21996917), slot 166 refers to child page (1:21996874) and previous child (1:21996873), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996875) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21996917), slot 167 refers to child page (1:21996875) and previous child (1:21996874), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996876) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21996917), slot 168 refers to child page (1:21996876) and previous child (1:21996875), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996877) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21996917), slot 169 refers to child page (1:21996877) and previous child (1:21996876), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996878) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21996917), slot 170 refers to child page (1:21996878) and previous child (1:21996877), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996879) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21996917), slot 171 refers to child page (1:21996879) and previous child (1:21996878), but they were not encountered.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Page (1:21996880) is missing a reference from previous page (1:21996879). Possible chain linkage problem.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996956) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Page (1:21996956) was not seen in the scan although its parent (1:21997094) and previous (1:21996955) refer to it. Check any previous errors.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996957) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21997094), slot 72 refers to child page (1:21996957) and previous child (1:21996956), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996958) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21997094), slot 73 refers to child page (1:21996958) and previous child (1:21996957), but they were not encountered.
Msg 2533, Level 16, State 1, Line 1
Table error: page (1:21996959) allocated to object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data) was not seen. The page may be invalid or may have an incorrect alloc unit ID in its header.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Index node page (1:21997094), slot 74 refers to child page (1:21996959) and previous child (1:21996958), but they were not encountered.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 1813659046, index ID 1, partition ID 72057600346619904, alloc unit ID 72057600433913856 (type In-row data). Page (1:21996960) is missing a reference from previous page (1:21996959). Possible chain linkage problem.
CHECKDB found 0 allocation errors and 24 consistency errors in table 'StatementHeader' (object ID 1813659046).
CHECKDB found 0 allocation errors and 24 consistency errors in database 'COREISSUE'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (COREISSUE).
This made me think that the issue is actually at the publisher (our PRODUCTION system) so I ran a CHECKTABLE on the publisher...however, that came back without errors (Thank God):
Command(s) completed successfully.

I've never encountered this particular CHECKDB error before and since I dropped the article and rensnapped it, I was under the impression that all of the data in the subscriber's 'X' table is dropped and re-populated from the new snapshot - so why would I get this error again?

Any ideas?


______________________________________________________________________________
"Never argue with an idiot; They'll drag you down to their level and beat you with experience"
Post #1490400
Posted Saturday, August 31, 2013 10:31 AM


SSC-Forever

SSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-Forever

Group: General Forum Members
Last Login: Yesterday @ 3:43 PM
Points: 43,047, Visits: 36,206
You've got IO subsystem problems on ServerA. Do some diagnostics, check storage logs and consider moving the DB to alternate storage.

Corruption can't be replicated (or mirrored), so bad pages on subscriber cannot be caused by problems at the publisher.



Gail Shaw
Microsoft Certified Master: SQL Server 2008, MVP
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Post #1490403
Posted Saturday, August 31, 2013 10:37 AM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: 2 days ago @ 10:38 PM
Points: 3,931, Visits: 7,160
Thanks Gail, I will co-ordinate with our Tech Services team and see what they can find at the hardware level, perhaps we've had a drive go bad in our of our RAIDs

______________________________________________________________________________
"Never argue with an idiot; They'll drag you down to their level and beat you with experience"
Post #1490405
Posted Saturday, August 31, 2013 12:49 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: 2 days ago @ 10:38 PM
Points: 3,931, Visits: 7,160
Found out there is a drive failing in one of the RAID arrays, so IT is looking into having it rebuilt/add a new on in

When that's resolved, my plan is to drop the article from replication, delete the table from the subscriber, add the article back to the publisher and re-snap/re-apply to the subscriber DB. When that's complete, run another DBCC CHECKTABLE - should be resolved?

If not, I will do the similar steps as above but will take the DB offline and move it to an alternate storage location before applying anything to the subscriber database

Sound ok or am I missing something important?


______________________________________________________________________________
"Never argue with an idiot; They'll drag you down to their level and beat you with experience"
Post #1490418
Posted Saturday, August 31, 2013 1:07 PM


SSC-Forever

SSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-Forever

Group: General Forum Members
Last Login: Yesterday @ 3:43 PM
Points: 43,047, Visits: 36,206
Run a full checkDB, if there are any errors I'd recommend dropping the database and reinitialising the subscription.


Gail Shaw
Microsoft Certified Master: SQL Server 2008, MVP
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Post #1490419
Posted Saturday, August 31, 2013 1:23 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: 2 days ago @ 10:38 PM
Points: 3,931, Visits: 7,160
Will do, and thanks for the quick responses.

Heard back from IT, a new drive should be installed within the next 2-4 hours, then another few hours to get the data striped back on it.

Once that's complete I'll start my process:
1. Drop the article from replication
2. Delete the table from the subscriber
3. Add the article back to the publisher and re-snap/re-apply to the subscriber DB
4. Run DBCC CHECKDB('X') WITH NO_INFOMSGS, ALL_ERRORMSGS

If not clean, I'll drop the subscriber altogether, recreate the subscription, then reinitialize it


______________________________________________________________________________
"Never argue with an idiot; They'll drag you down to their level and beat you with experience"
Post #1490422
Posted Saturday, August 31, 2013 2:15 PM


SSC-Forever

SSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-Forever

Group: General Forum Members
Last Login: Yesterday @ 3:43 PM
Points: 43,047, Visits: 36,206
A single drive failure in a RAID 1, 5 or 10 should not have caused this. Those raid levels have redundancy, so you can lose an entire drive and not have any problems (other than maybe speed)

If losing a single drive in a RAID 1, 5 or 10 causes data loss or IO errors, then there's a bigger problem with that RAID array than a single failed drive. I think you might want to investigate further.



Gail Shaw
Microsoft Certified Master: SQL Server 2008, MVP
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Post #1490424
Posted Saturday, August 31, 2013 2:40 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: 2 days ago @ 10:38 PM
Points: 3,931, Visits: 7,160
I have also found that the battery on the RAID controller is having issues; but I wouldn't think that would cause it?

______________________________________________________________________________
"Never argue with an idiot; They'll drag you down to their level and beat you with experience"
Post #1490426
Posted Saturday, August 31, 2013 2:49 PM


SSC-Forever

SSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-Forever

Group: General Forum Members
Last Login: Yesterday @ 3:43 PM
Points: 43,047, Visits: 36,206
Combined with a hard shutdown it can cause problems (pages SQL assumes written to disk never got there).
If the RAID array's battery is faulty, you must disable the write cache until it can be fixed.

Seriously, get someone to have a top-to-bottom, end-to-end look at that SAN, it really doesn't sound healthy



Gail Shaw
Microsoft Certified Master: SQL Server 2008, MVP
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Post #1490427
Posted Saturday, August 31, 2013 9:58 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: 2 days ago @ 10:38 PM
Points: 3,931, Visits: 7,160
Issues resolved. Gail, I sincerely appreciate the guidance

Steps taken (I know I got lucky):
1. IT-added new drive to the array, rebuilt it
2. Dropped the article from the Publisher
3. Deleted the table from the subscriber
4. Added the article back to the publisher and re-snap/re-apply to the subscriber DB
5. Run DBCC CHECKTABLE('X') WITH NO_INFOMSGS, ALL_ERRORMSGS first - successful
6. Run DBCC CHECKDB('X') WITH NO_INFOMSGS, ALL_ERRORMSGS first - successful

I was ready to move the entire DB to a different storage location and worst case re-initialize...but got lucky...

Ps. Write cache was already disabled (guessing this should always be enabled?) - the replacement battery should be installed in the next few days as well


______________________________________________________________________________
"Never argue with an idiot; They'll drag you down to their level and beat you with experience"
Post #1490436
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse