Zeroing out bad block?

  • I have a 2005, SP4, 64 bit, Enterprise edition server. Our Netapp crashed suddenly and when it was brought back up DBCC CHECKDB found a corruption in a table. After dropping an index and rebuilding it the CRC errors stopped. Problem solved, except that BACKUPS now fail for this database (with the same CRC errors) when the process gets to the bad block in the data file. Netapp is proposing that they zero out that block (overwrite the block with all zeros) to correct this.

    Will this work and will it affect the database in any way? My initial thought is that the database should be fine as this block is not currently in use by any objects, but my boss is afraid of further corrupting the data file.

    Your thoughts?

  • bad idea. bad bad.

    first, don't do anything like take the database offline or anything without expert advice; Gail Shaw monitors this forum frequently..

    run DBCC CHECKDB('db name') WITH NO_INFOMSGS, ALL_ERRORMSGS and post the results here.

    with that, we can tell you a lot more.

    you might be able to do anything from page level restore to get the data back, maybe migrate the data out to a new table to fix it, and many other possibilities.there's just too many possibilities without .

    letting ANYTHING other than SQL server touch the bits and bytes of the data will guarantee corruption and loss of data.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • If the block belongs to an allocated page, that will cause further corruption.

    What's the error on the backups and what does the following return?

    DBCC CHECKDB (<Database Name>) WITH NO_INFOMSGS, ALL_ERRORMSGS

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Here's the error from the Backup:

    BackupIoRequest::WaitForIoCompletion: read failure on backup device 'T:\Data\T2KAdmin.mdf'. Operating system error 23(Data error (cyclic redundancy check).).

    We're going to be running a DBCC CHECKDB but it takes 8 hours to complete (305 GB DB).

    We had another idea: since this page is now not used by any object since rebuilding the index, if we shrink the data file that disk block could be reallocated back to the operating system and not be in the data file at all any longer. Then our backups should work, right?

  • It depends on where the block is. If you have this:

    data data data data data bad block (free space) (free space)

    you could be fine. If you have this:

    data data (bad block) (free space) (data) (data) (free space)

    you might end up sending another object to the bad block. SQL Server isn't intelligent about the shrink. It reduces the size of the file, moving extents from the end to the beginning.

    Ideally, I'd think about copying the mdf file over, but I'd be worried about shutting down the server right now. At the least, I'd get dbcc to finish, and I'd also look at bcp'ing out all data, and I mean all data into files. Drop each table into a separate file and check them for integrity. Worst case, you have to BCP back in, but that's easily scriptable.

    I'd also script out all schema. Something like Red Gate's SQL Compare makes this really easy and you could transfer stuff to a shell db if you have the tool.

    disclosure: I work for Red Gate.

  • Maybe, but I wouldn't want to take any chances with this. You don't know for sure that the page is unallocated. If it wasn't allocated, the backup shouldn't be breaking as backups only read allocated extents and if the only corruption was in that index, dropping the index would have deallocated the extent.

    p.s. CheckDB Taking 8 hours on a 300GB database? That sounds like a serious IO contention problem in and of itself. I've done checkDB of TB sized databases faster than that.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (5/31/2012)


    CheckDB Taking 8 hours on a 300GB database? That sounds like a serious IO contention problem in and of itself. I've done checkDB of TB sized databases faster than that.

    I suspect that there are still Netapp issues to be resolved. Since the storage crash and recovery the server has been running at approximately 75% CPU while before the crash it was the normal 20% - 40%.

  • May I suggest alternate storage if at all possible...

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Are you using Snap Manager for SQL? If so, I'd suggest trying to do a native backup to see if it's NetApp that's misinformed regarding the state of that block (it did crash, after all) or whether there's truly something wrong from SQL's perspective. If the native backup works, try restoring it somewhere else (same instance would be fine, but I'd suggest a different LUN, on different drives if possible). If *that* works, I would try for a cutover to that new db (after restoring again and applying logs as necessary).

    Good luck. I've dealt with NetApp and SQL Server and will say that it has a bit of a Pinocchio syndrome: when it's good, it's very good. But when it's bad...

  • Unfortunately is the native SQL back up that is failing.

  • Do you have a SnapManager backup? If so, you could restore that in another location and see if that is without the error by running DBCC against it. I'm guessing that it will still have the problem but you never know.

    As to the performance they are most likely running a re-striping process which really consumes the disks, explaining the poor performance you are / were seeing.

    Please post back with what you find out.

    David

    @SQLTentmaker

    “He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

  • I have some suggestions, but I would really like to see the checkDB output before saying anything.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • We did make a snapshot so that we could test it without risking production. The CHECKDB came back in 30 minutes completely clean. No errors or corruptions. However, the backup still fails for the same CRC reason. We're trying to shrink the DB now and then we'll retry the backup.

  • No. Don't. You run the risk of writing a good page over the bad block, do nothing to that database.

    Try a backup WITH CONTINUE_AFTER_ERROR.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Too late, the boss already fired off the SHRINK command. It completed successfully so we tried another BACKUP. Still fails at the same spot with the same error. So we tried the backup using the CONTINUE_ON_ERROR option, and it still fails with the CRC error. Not sure what to try now.

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply