Trying to determine whether database corruption was caused at a disk level

  • Folks I am a sysadmin with a few SQL servers on my network. We had an issue with a production database corrupting. Our DBA encountered the following

    "SQL Server detected a logical consistency-based I/O error: torn page"

    Our DBA told me this is was probably caused by a disk error. The database is stored on a SAN disk and after checking the event log, alerts on the HBAs and even checking with the storage group, they found no problems with the disks.

    I am worried as to how this occurred, and would like to know if there are any other reasons for this type of corruption. Any thoughts would be appreciated. Thank you.

  • Hello Terry,

    I'm not sure of what the actual scenario you have encountered but thought that these links might give you some lead in finding out the problem.

    http://www.microsoft.com/technet/prodtechnol/sql/2005/physdbstor.mspx

    http

    ://download.microsoft.com/download/4/7/a/47a548b9-249e-484c-abd7-29f31282b04d/SQLIOBasicsCh2.doc

    Please reply back with any information you have dealt for this problem, so that others who encountered can tackle in the same way or give them some workout.

    Hope this information helps you.

    Thanks

     


    Lucky

  • Thank you for the info Lucky (I'll review when I get a chance -- always good to be able find information through peers).

    In essence this was a strange issue, our server had rebooted around 5:30am yesterday but no information was generated from the Windows 2003 eventlogs. Since that time the database that the application uses was not updating. Our DBA discovered the following;

    2007-02-15 05:34:06.17 spid15s Error: 824, Severity: 24, State: 2.

    2007-02-15 05:34:06.17 spid15s SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0x5556aaaa). It occurred during a read of page (1:11433824) in database ID 5 at offset 0x000015ceec0000 in file 'E:\Microsoft SQL Server\UserDBs\db.mdf (not the real db name)'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

    From there our DBA performed the following recovery ops;

    1. Execute the DBCC Checkdb on db and found there are 19 consistency errors in 2 tables (table1 and table2). Only indexes were corrupted.

    2. Tried to recover the index but the data pages were also corrupted. I got the following error message when I tried to recover the indexes.

    The statement has been terminated.

    Msg 829, Level 21, State 1, Line 1

    Database ID 5, Page (1:11439284) is marked RestorePending, which may indicate disk corruption.

    To recover from this state, perform a restore.

    3. Database restore is the only option to recover the database. Backup size is 98GB

    4. Restore process tool 4 hours.

    5. Run the DBCC Checkdb again and found no errors.

    I tend to agree with our DBA that it was prob a disk issue, but I need evidence of this. Our SAN team did not find anything out of the ordinary. I might ask him to run the SQLIOSim tool, but my question is, can we run it on the same SAN disk where the server resides without impacting the existing database (I know this would have to be done after hours).

    I am just worried that this might occur again.

  • Generally this error happens when there is incomplete I/O operation...

    Restoring from the backup is the best option in this case...

    Is write cache enabled on disk/san controller?

    Read the following article too...

    SQL Server 7.0, SQL Server 2000, and SQL Server 2005 logging and data storage algorithms extend data reliability

    http://support.microsoft.com/kb/230785

     

     

     

    MohammedU
    Microsoft SQL Server MVP

  • SQLIOSim is the best tool from Microsoft to tackle read and write IO Errors pertaining to databases.

    SQLIOSim tackles the Lost Write and Stolen Read problem

    For full discussion check :

    How to use the SQLIOSim utility to simulate SQL Server activity on a disk subsystem

    http://support.microsoft.com/kb/231619/en-us

    However running SQLIOSim would require a downtime. Since SQLIOSIM would create a new MDF And the LDF Files and perform stress tests on these MDF and LDF Files. To ascertain if there is a problem on the disks the SQLIOSIM needs to be run on the same drive and the path where the Actual Database Resides.

    This creates a huge performance degradation on the disks and hence running along side the Production System is not mandatory.

    So in effect you need to run this on a downtime. But running these would be a good idea. You can also run Multiple copies of the SQLIOSIM to increase the probability of the errors with respect to disk access.

    Also Check for the following

    1> Any Filter Drivers which are out of date. Check using fltmc command on Windows 2003. Update them with the latest drivers from the vendors.

    2> Check for Read Caching and Write Caching on the Disk Controller.

    3> Make sure you have applied Windows 2003 SP1 since this leads to Memory Corruption on SQL Server Computers with /PAE Enabled.

    Also we need to find out that appart from DISK. Are the Disk Controllers , Firmwares, Motherboard , Disk Drivers are healthy. SQLIOSIM would help but performing an healthy diagnostic as against the whole system rather than the DISK would help as well.

    As regards to SQLIOSIM. Running for 32 iterations may or may not give us the errors. You need to run them more exhaustively since the Probability of the IO Errors coming up are less.

    There are certain trace flags which help as well.

    818 : Helps detecting stolen reads and lost writes

    815 : Helps detecting Memory corruption.

    Note : The above trace flags would result in degraded performance for the extra functionality for checking integrity of IO Issues.

    Last but not the least if SQLIOSIM does not help and System Diagnostics does not help. contact MS PSS  and the Hardware Vendor.

    Thanks

     

     

     

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply