Recurring corruption on VM server; VM admin says nothing wrong

  • Full disclosure: I am not a DBA, I am a .NET developer, but the government agency I work for doesn't have a DBA so that means it falls on me and my co-worker, who is also not a DBA. We have so far been fortunate enough to be able to troubleshoot and resolve SQL Server issues through Googling and dumb luck, but we have now encountered a problem that is well over our heads. So please forgive me if any of this sounds ignorant.

    We have two production SQL Servers. One is a formerly physical box that we virtualized several months ago (server A). The other was virtual from the beginning (server B). Server A is Windows 2008 R2 with SQL Server 2005 SP2. Server B is Windows 2012 with SQL Server 2014. The problem we're having is with server A, the 2005 box.

    Last Monday afternoon we started getting unusual errors from the server, similar to this:

    DESCRIPTION: SQL Server Assertion: File: <alloc.cpp>, line = 8401 Failed Assertion = '(savePageAllocatedInPFS && saveIAMPageIsAllocated && (savePageInMixedSlot || savePageInIAMBitMap))' Save page: mixed slot = 0, IAM bitmap = 0, PFS = 0x40; Save IAM: PFS = 0x70. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.

    As it suggested I ran CHECKDB on all databases and there were errors in three databases. Also as suggested I tried rebooting the server after business hours. When it came back online, a fourth database was reported as "suspect". Bear in mind that I have NO IDEA what to do at this point. I Googled a bit and found a "fix" for the suspect database issue:

    EXEC sp_resetstatus [YourDatabase];

    ALTER DATABASE [YourDatabase] SET EMERGENCY

    DBCC checkdb([YourDatabase])

    ALTER DATABASE [YourDatabase] SET SINGLE_USER WITH ROLLBACK IMMEDIATE

    DBCC CheckDB ([YourDatabase], REPAIR_ALLOW_DATA_LOSS)

    ALTER DATABASE [YourDatabase] SET MULTI_USER

    This actually worked, although I was a bit concerned about the "allow_data_loss" option it suggested.

    Then on Friday morning we noticed several scheduled SQL jobs were failing. Again the messages indicated that there was database corruption. This time there were two databases corrupted: one of them was the same one that was "suspect" on Monday night. Initially I could not perform the repair that had previously worked. But on a subsequent attempt later in the morning it finally worked, and the databases were restored. CHECKDB no longer reported any errors.

    As I Google this issue, I see repeated comments that there is likely a hardware issue on the server. With this being a VM server managed by another department, I do not have direct access to the hardware. I asked the VMWare admin to run diagnostics on the VMWare environment where this server resides. He stated that he cannot find any indication of a problem (and incidentally seemed pretty dismissive of the notion that there is something wrong with the hardware). But clearly there is something wrong, and I and my co-worker are completely at a loss as to what to do next to address this issue. My next option is to attempt to locate a SQL Server expert to come in and help with troubleshooting.

    I should point out that both servers have been behaving normally for several months, and nothing has changed in the Windows or SQL Server installations on either server coinciding with the onset of this issue.

    Can anyone suggest additional troubleshooting ideas that we can try on our own -- bearing in mind that we are pretty clueless when it comes to complicated DBA issues like this? If you need more specifics, please let me know what other information I can provide.

    Thanks for reading and for any help you can provide.

    Edit: Now it appears we may be having the same issue with the 2014 box. I'm running a CHECKDB now.

  • 1. Database corruption is almost always caused by storage hardware issues. You should be talking with your storage admins in addition to your VM admin to find out what disks your databases are actually on and have the storage admins check those disks and paths to those disks.

    2. REPAIR_ALLOW_DATA_LOSS does exactly what it says. It will get rid of the corruption, but doesn't care if you lose data in the repair process. This option should only ever be used as a last resort.

    3. The only way to fully and safely recover from corruption is to restore from a known good backup.

  • Jack - thank you for the response. I am glad to get confirmation that this is likely an issue with the storage medium. Incidentally, since I posted that message, I did finally get acknowledgement from the VMWare admin that this issue is not isolated to my servers (they host servers for various other departments as well), so they are now investigating the source of the problem. (Insert grumbling over how they wouldn't believe me until other departments chimed in.)

  • Repeated corruption is almost always an IO subsystem problem. Not necessarily the disks, could be anything in the IO path, controllers, cache, switches, fiber cable, disks, etc, etc.

    You may also want to get out your backups from before the first corruption and restore them somewhere and see if you can figure out how much data you've lost due to running checkDB with repair_allow_data_loss multiple times (and you almost certainly have lost some)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply