Torn Page Error - Error: 823, Severity: 24, State: 2

  • Hello

    Yesterday I got a bit of a fright when I received the following Alert:

    DATE/TIME: 19/07/2004 14:07:27

    DESCRIPTION: Error: 823, Severity: 24, State: 2

    I/O error 1450(error not found) detected during read at offset 0x000001d3b42000 in file 'R:\Data\dbName.mdf'.

    COMMENT: (None)

    JOB RUN: (None)

    Looking this up on Google revealed a solution that involved restoring from the last good backup

    We run our Live system on a 2 node Cluster, with the disk sub-system being a IBM Fast-T SAN, with a 2 node SVC Cluster.  We checked the Fast-T SAN, which reported no Errors.  We had a  bit of fun getting into the SVC Console, but eventually we found nothing wrong there either.  I received the Alert only twice, 1 minute apart.  Now accessing the table appears fine.  I ran a suite of DBCC checks 4 times, specifically:

    --low impact check, looks for Torn Pages

    DBCC CHECKDB('DBName') WITH PHYSICAL_ONLY

    --Checks the allocation and structural integrity of all the objects in the specified database.

    DBCC CHECKDB('DBName') WITH NO_INFOMSGS

    --Checks for consistency in and between system tables in the specified database.

    DBCC CHECKCATALOG('DBName') WITH NO_INFOMSGS

    --Checks the consistency of disk space allocation structures for a specified database.

    DBCC CHECKALLOC('DBName') WITH NO_INFOMSGS

    All appears well, with No errors reported.  Has anyone seen this before?  Should I be worried about my SAN, or is it a one-off caused by something other than disk corruption (Network?) that I can stop fretting about?

    Dave J


    http://glossopian.co.uk/
    "I don't know what I don't know."

  • I don't know whether this is good advice or not, but I know one DB administrator who found that (on a fairly regular basis) he'd come in in the morning to find a whole bunch of databases marked as "suspect" and therefore unusable - torn pages being the problem.  Of course, his email box was full of complaints from customers who couldn't use their web applications because the DB's were unavailable.

    A lengthly problem to correct, but he never found a real error with the databases, so using Enterprise Monitor, he removed the option to check for torn pages.

    He's been running the system in that state for over 2 years now with no problem.

    Whether this is deemed a risky practice I really don't know.

  • I seem to remember that Steve or Andy had a problem where accessing the first 9000 rows of a table was OK, but anything after that produced errors.

    The solution was to drop and recreate the clustered index which rebuilt the relationship between the pages.

    I don't know if that helps.

  • Thanks for the feedback.  I have done some further research & come up with:

    From this article, http://support.microsoft.com/default.aspx?scid=kb;en-us;828339

    Note You may or may not see errors from the DBCC CHECKDB statement on the database that is associated with the file in the error message. You can run the DBCC CHECKDB statement when you see an 823 error. If the DBCC CHECKDB statement does not report any errors, you probably have an intermittent system problem or a disk problem.

    There is more info at http://support.microsoft.com/default.aspx?kbid=826433

    This turns on an Extended Trace Flag to trap Unreported I/O Errors:

    Trace flag 818 enables an in-memory ring buffer that is used for tracking the last 2,048 successful write operations that are performed by the computer running SQL Server, not including sort and workfile I/Os. When errors such as Error 605, 823, or 3448 occur, the incoming buffer's log sequence number (LSN) value is compared to the recent write list. If the LSN that is retrieved during the read operation is older than the one specified during the write operation, a new error message is logged in the SQL Server error log. Most SQL Server write operations occur as checkpoints or as lazy writes. A lazy write is a background task that uses asynchronous I/O. The implementation of the ring buffer is lightweight, thereby making the performance affect on the system negligible.

    The following message indicates that SQL Server did not receive an error from the WriteFile API call or the ReadFile API call. However, when the LSN was reviewed, the value was not correct:

    SQL Server has detected an unreported OS/hardware level read or write problem on Page (1:75007) of database 12

    LSN returned (63361:16876:181), LSN expected (63361:16876:500)

    Contact the hardware vendor and consider disabling caching mechanisms to correct the problem

    At this point, either the read cache contains an older version of the page, or the data was not correctly written to the physical disk. In either case (a Lost Write or a Stale Read), SQL Server reports an external problem with the operating system, the driver, or the hardware layers.

    Which is really scary.  I'm awaiting authorisation to turn the Trace Flag on...

     

    Dave


    http://glossopian.co.uk/
    "I don't know what I don't know."

  • Hi Dave,

    You do have an interesting issue. I would certainly be concerned over the integrity of your data. The first thing is to preserve as much of your data as possible, I would recommend BCPing all your data out to text files and store these in a safe place.

    Next you should  have a look at the Page which contains the data where the write failed. Below you can use the following commands to display the page information. The offset you see is the number bytes from the start of the file, if you divide this by 8192 you will get the page number 957857.

    dbcc traceon(3604) --instructs DBCC commands to display to the screen

    dbcc page (dbname, 1,957857,3)

    This should show the contents of the page, check to see if these are correct. Try the previous page and the next page, if these are correct, then you are in luck, your data may be intact.

    You need to consult with your SAN vendor to see if they can disable the cache for your server and start investigating to see if there are any disk corruptions within the SAN on your spindles.

    We do have a product that will read your backup file using the page information, so if you continue to get issues, we can assist you there, and this will certainly report back if their are any corruptions.

    Regards

    Douglas Chrystall

    Imceda Software, Inc

    http://www.imceda.com

  • DOuglas

     

    thanks for your response, I have just ran the DBCC commands you sugggest.  I must be honest & say I haven't seen that one before.  Is it undocumented?

    The output is I'm afraid not very helpful to my eyes.  It outputs a header (below) and then what appear to be records from one of my tables, which at first glance look OK.  What would I look for to indicate an Error?  Is it m_tornBits?  is 21 good or bad?

    PAGE: (1:957857)

    ----------------

    BUFFER:

    -------

    BUF @0x01E0BA40

    ---------------

    bpage = 0x60EE2000        bhash = 0x00000000        bpageno = (1:957857)

    bdbid = 7                 breferences = 0           bstat = 0x9

    bspin = 0                 bnext = 0x00000000       

    PAGE HEADER:

    ------------

    Page @0x60EE2000

    ----------------

    m_pageId = (1:957857)     m_headerVersion = 1       m_type = 1

    m_typeFlagBits = 0x0      m_level = 0               m_flagBits = 0x4

    m_objId = 345820344       m_indexId = 0             m_prevPage = (1:957856)

    m_nextPage = (1:957858)   pminlen = 88              m_slotCnt = 86

    m_freeCnt = 12            m_freeData = 8008         m_reservedCnt = 0

    m_lsn = (279975:3787:6)   m_xactReserved = 0        m_xdesId = (0:0)

    m_ghostRecCnt = 0         m_tornBits = 21          

    Allocation Status

    -----------------

    GAM (1:511232) = NOT ALLOCATED                     

    SGAM (1:511233) = NOT ALLOCATED                    

    PFS (1:954384) = 0x0   0_PCT_FULL                  

    DIFF (1:511238) = NOT CHANGED                      

    ML (1:511239) = NOT MIN_LOGGED                     

    Slot 0 Offset 0x60

    ------------------

    Record Type = PRIMARY_RECORD                       

    Record Attributes =  NULL_BITMAP                   

    60EE2060:  00580010  012db245  00093b2b  000005ea ..X.E.-.+;......

    60EE2070:  00000000  01232340  00009467  0000003b ....@##.g...;...

    60EE2080:  0000002d  00000000  00000000  00005ad9 -............Z..

    60EE2090:  000f42e2  00000000  01713532  0000946f .B......25q.o...

    60EE20A0:  00000000  00000000  00000000  00000000 ................

    60EE20B0:  00000000  00000000  e6880010           ............

    ProvisionHistoryID               = 19771973        

    ProvisionID                      = 604971          

    ReturnStatus                     = 1514            

    ReturnCode                       = [NULL]          

    ResponseDate                     = Jan  7 2004  5:40PM

    RequestTypeID                    = 59              

    CallTypeID                       = 45              

    TargetDate                       = [NULL]          

    FileID                           = 23257           

    Reference                        = [NULL]          

    OrderNo                          = [NULL]          

    CreatedBy                        = 1000162         

    DateCreated                      = Jan 15 2004 10:24PM

    UpdatedBy                        = [NULL]          

    DateUpdated                      = [NULL]          

    DateArchived                     = [NULL]      

    [snip]

    Dave J    


    http://glossopian.co.uk/
    "I don't know what I don't know."

  • Hi Dave,

    There can be many torn bits in a database, this does not indicate a problem. You need to look at the data and see if any of the values are incorrect. It could be a case of no corruption but your data was just not written. I know thats 86 rows, but it is going to be the only way to tell. Try looking at the previous page and the next page, see if they display correctly. The tornbit flag is just used with an XOR to determine where the stolen bits have gone, so the page can be reconstructed.

     

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply