Trying to track down cause of Event ID 824 error in TempDB

  • Hi,

    We have been trying to rack down the cause of an error on one of our servers, and I thought I'd post some info to the forum in-case someone had suggestions.

    One of our sites is hosting our application on a server running SuSE 10.1 and VMWare Server 1.0.1, and several months ago when we started deploying our application to the server, we started seeing numerous Event ID 11 and 15 errors - "The driver detected a controller error on \Device\Harddisk1." and "The description for Event ID ( 15 ) in Source ( symmpi ) cannot be found" - in the System event log when we tried to do database backups or any large write to the file system along with various SQL errors in the Application Logs.

    The 'solution' from the site was to copy the virtual machine to another server with identical hardware and configuration, defrag the filesystem, then setup a routine to reboot the server once a week. This seemed to alleviate the errors, though my gut told me the problem was still present.

    As more testing continued I found we were getting Event ID 824 errors... below is an example:

    Event Type: Error

    Event Source: MSSQLSERVER

    Event Category: Server

    Event ID: 824

    Date: 3/15/2010

    Time: 8:40:50 PM

    User: --

    Computer: --

    Description:

    SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x39bf8c3a; actual: 0xfbe98071). It occurred during a read of page (1:38760) in database ID 2 at offset 0x00000012ed0000 in file 'e:\MSSQL\Data\tempdb.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

    The site is saying it's a SQL problem due to this, but given it's the tempdb database that's causing the error I'm not sure how since it's recreated each time SQL recycles. Both before and after getting the error [dbcc checkdb with data_purity] runs with no errors on any of our databases, and I've even gone as far as stopping the SQL Server, deleting the TempDB log and data file, and restarting SQL, plus I've moved the tempdb data and log files from E: to C: but still the same results.

    Since the 824 error is a checksum error on the page being retrieved by SQL from the file system I've been running some tests. I can recreate the error by running a query that creates two large temp tables then selects from a table which joins to the temp tables. This process runs fine on all of our test servers, but only errors out on the site's server.

    One temp table is rather large with 45002 rows in 45002 pages and the other has 4421340 rows in 23519 pages if that helps.

    Since TempDB is not corrupt given it's recreated each time the server starts nor is checkdb finding anything before or after the 824 error, I'm pointing more towards the server configuration being the problem. Given the versions of the Host OS (SuSE 10.1) and VMWare (1.0.1) I assume the hardware is old as well, and I'm not sure what security or bug patches have been setup on the system since it's not in our data center. Plus neither version of SuSE or VMWare are supported any longer.

    Thanks for any suggestions that can help us either identify the problem if it is SQL or prove that it isn't. Unfortunately since this is a database server, MS SQL is really the only app running on it so it's seeing the grunt of this problem. I tend to think the Event 824 error is more of the effect of something larger, but I'm not seeing that being the consensus opinion.

    Thanks again and take care --

    Sam

Viewing 0 posts

You must be logged in to reply to this topic. Login to reply