Yeah! my first real corrupt DB

  • I don't have a publicly available ftp upload space - if you can post it somewhere, I'll download it.

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • @paul-2, in case you need the exact error for the backup command :

    Msg 926, Level 14, State 1, Line 1

    Database 'Test Upgrade Dynamics' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.

    Msg 3013, Level 16, State 1, Line 1

    BACKUP DATABASE is terminating abnormally.

  • Copy is done. Expecting 3 hours for compression...

  • Ran this code after bringing the server back online and now the db is available. I now wish I had used the WITH NO_INFOMSGS, ALL_ERRORMSGS arguments. As far as I can see DBCC didn't return any error nor deleted any data but I didn't spend 10 hours reading the output...

    ALTER DATABASE [Test Upgrade Dynamics] SET SINGLE_USER

    DBCC CHECKDB('Test Upgrade Dynamics', REPAIR_ALLOW_DATA_LOSS)

    ALTER DATABASE [Test Upgrade Dynamics] SET MULTI_USER

    ALTER DATABASE [Test Upgrade Dynamics] SET ONLINE

    I now have access to the db.

    I forgot the exact state & rowcount of the table I was working on so it's hard to say if I lost anything. My guess would be not, but I'll rebuild the DB from our prod just to be safe.

    So bottom line, ALWAYS BACKUP, even 24 hour test dbs ;-).

    I'll ask for san checkups tomorrow (everybody's out but me). I'll keep posting the details as I get more info.

    Thank you all for the support.

  • @paul-2 : How many cases do you have to contend for?

    Now I'd like to explore the scenario where I do have a full backup + full recovery.

    Would I have been able to take the tail log backup? I guess it would only be possible with CONTINUE_ON_ERROR.

    Ok so I restore the full backup and available GOOD tlogs. Now what?

    There is a continue after error for the tlog restore. If I use this, does it put the db pretty much in a similar state as checkdb() allow data loss? I'd love a little more than it depends if possible :-).

    How can you figure out the max PIT recovery you can go to without hitting the error so that you get a transaction consistant DB? I'm seeing the STOP_ON_ERROR option but I'm not getting much details neither in BOL nor google.

    TIA

  • Paul Randal (8/9/2011)


    There are hundreds of cases I didn't talk about in that presentation - but glad you liked it!

    Sounds like memory corruption of a log block changing the slot number in a log record.

    What are the backup errors you're getting?

    How big is the database? (wondering if you can take if offline, zip it up and send it to me to play with)

    P.S. The zipped version is ± 8GB. I might be able to find a place to store this if I can get permission to. It's kind of our prod ERP data... 😉

  • Ninja's_RGR'us (8/10/2011)


    Paul Randal (8/9/2011)


    There are hundreds of cases I didn't talk about in that presentation - but glad you liked it!

    Sounds like memory corruption of a log block changing the slot number in a log record.

    What are the backup errors you're getting?

    How big is the database? (wondering if you can take if offline, zip it up and send it to me to play with)

    P.S. The zipped version is ± 8GB. I might be able to find a place to store this if I can get permission to. It's kind of our prod ERP data... 😉

    I was going to ask for a copy myself, at that size, maybe not...

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Ah - if it's business data from your company I won't download it to our systems unless you're a client covered by an NDA and our business liability insurance. Sorry.

    So sounds like it was log corruption and not database corruption, as I surmised. Without analyzing the log at the point that is being complained about it's hard to confirm.

    As far as corruptions - you have to be able to recovery from all of them - which is where backups and redundant database copies come in. Memory corruption of log records is very insidious as it will break backups, restores, log shipping, mirroring, replication. It's pretty rare - yours is the first case I've suspected for a year or so. What can you do about it when it happens? Nothing really as far as SQL Server is concerned. I'd run full memory diagnostics on your server.

    Thanks

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • Paul Randal (8/10/2011)


    Ah - if it's business data from your company I won't download it to our systems unless you're a client covered by an NDA and our business liability insurance. Sorry.

    So sounds like it was log corruption and not database corruption, as I surmised. Without analyzing the log at the point that is being complained about it's hard to confirm.

    As far as corruptions - you have to be able to recovery from all of them - which is where backups and redundant database copies come in. Memory corruption of log records is very insidious as it will break backups, restores, log shipping, mirroring, replication. It's pretty rare - yours is the first case I've suspected for a year or so. What can you do about it when it happens? Nothing really as far as SQL Server is concerned. I'd run full memory diagnostics on your server.

    Thanks

    From the error message and lack of corruption found with checdb I don't see how it could be anything else than log corruption... but what do I know :-D.

    We're bringing in a expert to run those diagnostics. Don't have that expertise in-house. We're had our electrical departement shut down curent with 2 minutes notice (yes they got an earful!).

    This is a prod-copied vm, on 2 sets of sans. Our juniors don't have that kind of knowledge to track the issue.

    The bottom line is that the curent was constantly fluctuating for a good 15 minutes before a complete shutdown. We have a good generator but it doesn't cope with that well. So shutting down the san 2-100 times / minute like that is not good :-D. The senior admin is pretty pissed atm as you can imagine. The best news about this is a test copy of a test db on a test server & that now we can put our foot down and this will never happen again without serious consequence <if at all>.

    Could it be an option to only send over the logs with an emptied db (except for the table in question). I'm sure our bosses could go for that one.

    I could also possibly sell them the NDA option. If / How could that work?

  • @paul-2, do you have other presentation like teched that cover other cases? I'd love to learn more on the subject.

    I'll also try to repro the error. But considered I have been using this technic for years to develop I don't have high hopes. The only change is maybe 100+ queries run within transactions and then a PK error, maybe the rollback harder to do. That one was new to my dev "errors".

    Of course the big elephant in the room is the electrical failure less than 24 hours before the event ;-).

    So to sumrise this one. There was not hope to get back a consistant db with the tail log (assuming full back + tlog backups before the event)?

    What's the behavior of restore with stop on error, is it the equivalent of STOP_AT MAX(consitant transaction)?

  • You need a UPS so the current fluctuations don't mess with your systems.

    Don't worry about it - too much hassle going through client agreements etc - it would need the whole DB to make sense of the log and that's going to be a lot more than a 5-minute exercise - don't have time for that I'm afraid. Just chalk it up to log corruption.

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • More presentations - yes, but not recorded.

    Restore behavior - yes, as you've described.

    Getting things back without data loss - no, of tail-of-log is corrupt.

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • Paul Randal (8/10/2011)


    You need a UPS so the current fluctuations don't mess with your systems.

    Don't worry about it - too much hassle going through client agreements etc - it would need the whole DB to make sense of the log and that's going to be a lot more than a 5-minute exercise - don't have time for that I'm afraid. Just chalk it up to log corruption.

    What have UPS too. This was just freaky!

    Our senior admin is really genius at setting redundancy and safe guard against well, anything!

    We've had the same issue with the electric company a few months ago and the system didn't fail that time. We had to shut down because the ac wasn't keeping the room cool enough (and we had double the normal amount of machines for a few days).

    All that considering we only need to be up 12 hours / day 5 days / week. Week-ends too if possible but not 100% required.

  • Paul Randal (8/10/2011)


    More presentations - yes, but not recorded.

    Restore behavior - yes, as you've described.

    Getting things back without data loss - no, of tail-of-log is corrupt.

    Thanks for all the help, both of you.

    Paul do you have anything else recorded? Google's getting me 2-3 interviews but that's about it.

  • http://www.sqlskills.com/T_MCMVideos.asp

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

Viewing 15 posts - 16 through 30 (of 78 total)

You must be logged in to reply to this topic. Login to reply