Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Day 10 of 31 Days of Disaster Recovery: Monitoring for Corruption Errors

31 Days of Disaster Recovery

31 Days of Disaster Recovery

It’s day 10 of my 31 Days of Disaster Recovery series, and I want to talk about monitoring for corruption errors. There are four errors related to corruption for which everyone should raise alerts and send notifications. The four alerts are 823, 824, 825, and 829. The sooner you identify and address corruption, the greater the chance that it can be resolved without data loss and with minimal downtime.

If you missed any of the earlier posts in my DR series, you can check them out here:

    31 Days of disaster Recovery

  1. Does DBCC Automatically Use Existing Snapshot?
  2. Protection From Restoring a Backup of a Contained Database
  3. Determining Files to Restore Database
  4. Back That Thang Up
  5. Dealing With Corruption in a Nonclustered Index
  6. Dealing With Corruption in Allocation Pages
  7. Writing SLAs for Disaster Recover
  8. Resolutions for All DBAs
  9. Use All the Checksums

823 Errors

An 823 error is raised when a read of a page fails at the OS layer. The OS returns the failure to SQL Server who re-throws the error to the user. The attempt to read the page is retried 3 time and on the fourth failure, the error is raised. This means that when you see this error, it actually failed four times. If you see it frequently, then it has failed four times as many errors. This error is generally a sign that of problems with the storage system. It could be a disk going bad, faulty drivers, incorrectly configured disk subsystem, etc. If you see me at an event some day, ask me to share the funny story about a misconfigured SAN causing rampant file corruption on a production SQL Server (discovered before we put it into active use, fortunately).

When you receive this error, the message text will indicate the underlying cause of the error. It is important to note that not all 823 errors are an indication of corruption. For example, error 21 (Device is not ready) is often the result of the drive the file is on going offline or the service account losing permissions to access the drive. If the OS error returned is error 23 (Data error – cyclic redundancy check), then you definitely have a corruption of the file. In fact, OS error 23 is the only 823 error that will trigger automatic page repair in database mirroring and Availability Groups.

If you are receiving 823 errors, I highly advise having the disk system checked as well as it could be an indicator of pending hardware/disk failure.

824 Errors

Error 824 means that the read of the page from the file system was successful, but SQL Server detecting corruption. Just like error 823, the read is retried 3 times, and the error is raised only after the final failure. If you receive an 824 error, it has actually occurred 4 times. This error is a clear indicator of corruption and you should take corrective steps.

825 Errors

As I explained above, reads encountering 823 and 824 errors are retried 3 times and only raised if they fail all retries. An 825 error indicates that one of the retries was successful. Unlike 823 and 824 errors, 825 is NOT a fatal error and does NOT get bubbled up to the user. 825 errors can occur undetected by anyone. The error is logged in the SQL log and in the event log, so there are many ways you can detect these, but most people don’t think to do so. If you are getting many 825 errors, it can be a sign of impending disk failure, and the disks should be checked as soon as possible.

829 Errors

829 errors are lesser known errors, but they are potentially a sign of corruption caused by the disk subsystem. The error means that a page has been marked RestorePending, The only way to fix this problem is to restore the page (or file/database). It can’t be repaired using DBCC. This error, in addition to 823 (CRC failure) and 824, can trigger automatic page repair in database mirroring or Availability Groups. Ironically, when automatic page repair is triggered, it sets the page to a status of RestorePending to ensure that no other transactions can be run against it while the page repair is being performed. If you have this error occurring and it is not the result of automatic page repair, check the disk system .

Creating Alerts

Alerts can be easily created by using the GUI (under the SQL Server Agent node in Object Explorer) or via T-SQL. First, create an Operator to receive notifications and then create the alerts. Below is an example of how to create an Operator to receive emails and alerts for the four error codes assigned to email the Operator.

USE msdb;

-- Create operator
Exec msdb.dbo.sp_add_operator
        @name=N'DBAs', 
        @enabled=1,
        @email_address=N'';
Go

-- Create alert
Exec msdb.dbo.sp_add_alert
        @name=N'Corruption (823) detected', 
        @message_id=823,
        @enabled=1;

-- Assign operator to alert
Exec msdb.dbo.sp_add_notification
        @alert_name=N'Corruption (823) detected',
        @operator_name=N'DBAs',
        @notification_method = 1;
Go

-- Create alert
Exec msdb.dbo.sp_add_alert
        @name=N'Corruption (824) detected', 
        @message_id=824,
        @enabled=1;
        
-- Assign operator to alert
Exec msdb.dbo.sp_add_notification
        @alert_name=N'Corruption (824) detected',
        @operator_name=N'DBAs',
        @notification_method = 1;
Go

-- Create alert
Exec msdb.dbo.sp_add_alert
        @name=N'Error 825 occurred', 
        @message_id=825,
        @enabled=1;
        
-- Assign operator to alert
Exec msdb.dbo.sp_add_notification
        @alert_name=N'Error 825 occurred',
        @operator_name=N'DBAs',
        @notification_method = 1;
Go

-- Create alert
Exec msdb.dbo.sp_add_alert
        @name=N'Page RestorePending (829) detected', 
        @message_id=829,
        @enabled=1;
        
-- Assign operator to alert
Exec msdb.dbo.sp_add_notification
        @alert_name=N'Page RestorePending (829) detected',
        @operator_name=N'DBAs',
        @notification_method = 1;
Go

Summary

Hopefully I was successful in stressing the importance of raising alerts and sending notifications for these errors. It is critical that we identify and deal with corruption as soon as possible to reduce the chance that we will lose data. So create these alerts and learn how to handle them.

Comments

Leave a comment on the original post [www.sqlsoldier.com, opens in a new window]

Loading comments...