'Best' HA/DR option to protect against data corruption

  • Hi All,

    I know this may seem like an open-ended question but last week we encountered data corruption (eventid 824) and it took us 9-10 hours to restore our 1TB database using a previous full backup and rolling forward 10 days worth of hourly transaction log backups.

    Now we are looking at ways in which to minimize the downtime due to data corruption so i'm considering using log shipping to a standby server. In this way, we can keep the standby server in read-only mode and delay the transaction log restore by say 6-12 hours. That way if we get another Eventid 824, severity level 24, i can do a quick dbcc checktable on the affected table, or even a dbcc checkdb against the standby database and if its good, apply the tlogs, bring it to the production server and attach it.

    That being said, i know there's database mirroring option (though i read it will be deprecated in future versions of SQL so that might not be promising) and also clustering (but this will only prevent /protect against the OS being corrupted, not the disks).

    We also have such things turned on as:

    - Alerts for severity levels 19-25

    - page_verify_option set to CHECKSUM on all databases

    Thoughts/comments on how you are mitigating data corruption in your environment and your plan(s) to bring back a VLDB in this event?

  • Hi, you didn't mention anything abt diff backups? That would have save a lot time as i you mentioned you had to roll forward 10 days.

    What is your current backup plan?

    Mirroring is the best option..since its real time.

    Which version of SQL you are using.... I believe mirroring is deprecated for 2014 version... due to always on I believe.

    Regards,
    SQLisAwe5oMe.

  • Weekly full, daily diff, hourly tlog backups is our strategy

    We tried the diff but since it was data corruption, the diff backups contained the corrupted pages

    We using sql 2008 ent edition and it supports a third party application

  • Mirroring is deprecated as most/all of it's features are superceded by AlwaysOn Availability Groups in SQL 2012+

    It's still present in SQL 2012 & 2014. Haven't done it myself but should be a reasonably easy conversion if you ever need to.

    It's worth noting Mirroring has a nifty feature called Automatic Page Repair, which fixes certain types of corruption by recovering the intact version of the corrupt page(s) from the mirror. http://msdn.microsoft.com/en-us/library/bb677167(v=sql.105).aspx

    Edit: for mirroring you'd need the ability to update any app's connection strings to include the FailoverPartner option. Might not be an option if it's a third party app.

  • 1tb isn't quite into the VLDB range, but, once you get past around 3-5tb, you need to start looking at things like SAN snapshots. But, even that could be problematic because the SAN can replicate the corruption.

    Prior to that, mirroring or AlwaysOn would be your best bets for dealing with corruption issues. While Availability Groups help with some types of HA/DR, they won't help with corruption due to the shared storage aspect.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (7/29/2014)


    1tb isn't quite into the VLDB range, but, once you get past around 3-5tb, you need to start looking at things like SAN snapshots. But, even that could be problematic because the SAN can replicate the corruption.

    Prior to that, mirroring or AlwaysOn would be your best bets for dealing with corruption issues. While Availability Groups help with some types of HA/DR, they won't help with corruption due to the shared storage aspect.

    Thanks all, I will look into database mirroring as well. There are basically 4 instances in 4 global regions I am considering (1 instance has SQL 2008 Enterprise edition, the other 3 have SQL 2008 Standard edition). I don't need automatic failover at this point since, if I encounter a data corruption error on the primary server, I would like to run a checktable or checkdb against the mirror database before switching it to production.

    Does database mirroring require a shared storage aspect? I don't think so, and since we're a virtual environment on top of a SAN, I would think the primary, witness and mirror server are all on separate LUNs, thus separate storage. I didn't know availability groups required shared storage, which as you say, won't help if I had data corruption issues.

    Also, to the connection string being updated to include automatic failover partner, at this point I don't think we would have automatic failover since I would like to be able to check the mirror before bringing it online.

    Just like if I had log shipping (which I read is cumbersome to configure and maintain and I would like to keep it simple), I would ship the logs say 4-6 hours delay and if a corruption occurred, I could check the standby database for corruption before bringing it online.

  • msandico 57892 (7/29/2014)


    Grant Fritchey (7/29/2014)


    1tb isn't quite into the VLDB range, but, once you get past around 3-5tb, you need to start looking at things like SAN snapshots. But, even that could be problematic because the SAN can replicate the corruption.

    Prior to that, mirroring or AlwaysOn would be your best bets for dealing with corruption issues. While Availability Groups help with some types of HA/DR, they won't help with corruption due to the shared storage aspect.

    Thanks all, I will look into database mirroring as well. There are basically 4 instances in 4 global regions I am considering (1 instance has SQL 2008 Enterprise edition, the other 3 have SQL 2008 Standard edition). I don't need automatic failover at this point since, if I encounter a data corruption error on the primary server, I would like to run a checktable or checkdb against the mirror database before switching it to production.

    Does database mirroring require a shared storage aspect? I don't think so, and since we're a virtual environment on top of a SAN, I would think the primary, witness and mirror server are all on separate LUNs, thus separate storage. I didn't know availability groups required shared storage, which as you say, won't help if I had data corruption issues.

    Also, to the connection string being updated to include automatic failover partner, at this point I don't think we would have automatic failover since I would like to be able to check the mirror before bringing it online.

    Just like if I had log shipping (which I read is cumbersome to configure and maintain and I would like to keep it simple), I would ship the logs say 4-6 hours delay and if a corruption occurred, I could check the standby database for corruption before bringing it online.

    Mirroring doesn't allow for shared storage, like AlwaysOn. So that aspect is likely to deal well with what you're looking for. Now, setting up the logs to only migrate every 4-6 hours could be problematic. That means you'll be maintaining an active log that size, which may add significantly to your maintenance and storage issues. Also, corruption can't be passed between servers when using mirroring (or AlwaysOn). While I agree that ensuring the other server is not corrupt is a nice thing to do, I'm not sure it should be driving your decision.

    Just a question, how often are you running into corruption issues. In over 20 years in the business, I've only run into a very few that directly impacted production, and most of them were within non-clustered indexes, easily recovered. If you're hitting lots and lots of corruption issues, you may have some serious hardware problems. You may be focusing in the wrong area. Better to prevent the corruption than have to deal with it over and over.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (7/29/2014)


    msandico 57892 (7/29/2014)


    Grant Fritchey (7/29/2014)


    1tb isn't quite into the VLDB range, but, once you get past around 3-5tb, you need to start looking at things like SAN snapshots. But, even that could be problematic because the SAN can replicate the corruption.

    Prior to that, mirroring or AlwaysOn would be your best bets for dealing with corruption issues. While Availability Groups help with some types of HA/DR, they won't help with corruption due to the shared storage aspect.

    Thanks all, I will look into database mirroring as well. There are basically 4 instances in 4 global regions I am considering (1 instance has SQL 2008 Enterprise edition, the other 3 have SQL 2008 Standard edition). I don't need automatic failover at this point since, if I encounter a data corruption error on the primary server, I would like to run a checktable or checkdb against the mirror database before switching it to production.

    Does database mirroring require a shared storage aspect? I don't think so, and since we're a virtual environment on top of a SAN, I would think the primary, witness and mirror server are all on separate LUNs, thus separate storage. I didn't know availability groups required shared storage, which as you say, won't help if I had data corruption issues.

    Also, to the connection string being updated to include automatic failover partner, at this point I don't think we would have automatic failover since I would like to be able to check the mirror before bringing it online.

    Just like if I had log shipping (which I read is cumbersome to configure and maintain and I would like to keep it simple), I would ship the logs say 4-6 hours delay and if a corruption occurred, I could check the standby database for corruption before bringing it online.

    Mirroring doesn't allow for shared storage, like AlwaysOn. So that aspect is likely to deal well with what you're looking for. Now, setting up the logs to only migrate every 4-6 hours could be problematic. That means you'll be maintaining an active log that size, which may add significantly to your maintenance and storage issues. Also, corruption can't be passed between servers when using mirroring (or AlwaysOn). While I agree that ensuring the other server is not corrupt is a nice thing to do, I'm not sure it should be driving your decision.

    Just a question, how often are you running into corruption issues. In over 20 years in the business, I've only run into a very few that directly impacted production, and most of them were within non-clustered indexes, easily recovered. If you're hitting lots and lots of corruption issues, you may have some serious hardware problems. You may be focusing in the wrong area. Better to prevent the corruption than have to deal with it over and over.

    Hi Grant,

    Thanks for that and yes, in my 10+ years of being a DBA I've only seen data corruption like this (event 824) twice - more often 10 years ago when SANs were coming out and there were a bunch of driver issues, and just now - so there was a gap of 10 or so years between seeing this.

    We think we've narrowed it down to a new backup agent that was deployed to all our hosts about a month back. This SQL server was not the only VM (Hyper-V) affected by disk corruption, I had about 5 others experiencing similar issues and non-SQL servers were being impacted as well. Our team has found some articles that other customers have had when using this particular backup agent version and our version of Hyper-V.....so they've removed it as of last week off all the hosts.

    So yes I agree the root cause should be the main focus, which we've hopefully addressed. Now from the SQL Server perspective, I am looking at options that don't necessarily prevent disk corruption (which is impossible from SQL server point of view right? Once SQL writes the page, and passes it to OS/storage subsystem, it's out of our control until we read it back and recalculate the page header checksum), but help us recover faster if a disaster happens again, whether it be corruption or not.

    Database mirroring sounds like a very promising solution and I will start to investigate it, break it and develop use cases for/against it...and then a cost analysis of course !

  • msandico 57892 (7/29/2014)


    Grant Fritchey (7/29/2014)


    msandico 57892 (7/29/2014)


    Grant Fritchey (7/29/2014)


    1tb isn't quite into the VLDB range, but, once you get past around 3-5tb, you need to start looking at things like SAN snapshots. But, even that could be problematic because the SAN can replicate the corruption.

    Prior to that, mirroring or AlwaysOn would be your best bets for dealing with corruption issues. While Availability Groups help with some types of HA/DR, they won't help with corruption due to the shared storage aspect.

    Thanks all, I will look into database mirroring as well. There are basically 4 instances in 4 global regions I am considering (1 instance has SQL 2008 Enterprise edition, the other 3 have SQL 2008 Standard edition). I don't need automatic failover at this point since, if I encounter a data corruption error on the primary server, I would like to run a checktable or checkdb against the mirror database before switching it to production.

    Does database mirroring require a shared storage aspect? I don't think so, and since we're a virtual environment on top of a SAN, I would think the primary, witness and mirror server are all on separate LUNs, thus separate storage. I didn't know availability groups required shared storage, which as you say, won't help if I had data corruption issues.

    Also, to the connection string being updated to include automatic failover partner, at this point I don't think we would have automatic failover since I would like to be able to check the mirror before bringing it online.

    Just like if I had log shipping (which I read is cumbersome to configure and maintain and I would like to keep it simple), I would ship the logs say 4-6 hours delay and if a corruption occurred, I could check the standby database for corruption before bringing it online.

    Mirroring doesn't allow for shared storage, like AlwaysOn. So that aspect is likely to deal well with what you're looking for. Now, setting up the logs to only migrate every 4-6 hours could be problematic. That means you'll be maintaining an active log that size, which may add significantly to your maintenance and storage issues. Also, corruption can't be passed between servers when using mirroring (or AlwaysOn). While I agree that ensuring the other server is not corrupt is a nice thing to do, I'm not sure it should be driving your decision.

    Just a question, how often are you running into corruption issues. In over 20 years in the business, I've only run into a very few that directly impacted production, and most of them were within non-clustered indexes, easily recovered. If you're hitting lots and lots of corruption issues, you may have some serious hardware problems. You may be focusing in the wrong area. Better to prevent the corruption than have to deal with it over and over.

    Hi Grant,

    Thanks for that and yes, in my 10+ years of being a DBA I've only seen data corruption like this (event 824) twice - more often 10 years ago when SANs were coming out and there were a bunch of driver issues, and just now - so there was a gap of 10 or so years between seeing this.

    We think we've narrowed it down to a new backup agent that was deployed to all our hosts about a month back. This SQL server was not the only VM (Hyper-V) affected by disk corruption, I had about 5 others experiencing similar issues and non-SQL servers were being impacted as well. Our team has found some articles that other customers have had when using this particular backup agent version and our version of Hyper-V.....so they've removed it as of last week off all the hosts.

    So yes I agree the root cause should be the main focus, which we've hopefully addressed. Now from the SQL Server perspective, I am looking at options that don't necessarily prevent disk corruption (which is impossible from SQL server point of view right? Once SQL writes the page, and passes it to OS/storage subsystem, it's out of our control until we read it back and recalculate the page header checksum), but help us recover faster if a disaster happens again, whether it be corruption or not.

    Database mirroring sounds like a very promising solution and I will start to investigate it, break it and develop use cases for/against it...and then a cost analysis of course !

    Sounds like a plan. Just know that mirroring is deprecated so it'll get removed the product at some point in the future. No word on when that is.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Start with regularly tested backups and frequent database integrity checks. You should be running integrity checks often enough that it's caught before ending up in the backups. Or, if there isn't a large enough maintenance window, restore the backups to another server and CheckDB that.

    Second, brush up on restore options. A 1TB database is probably in multiple files/filegroups. If the corruption was isolated, you should have been able to restore just the affected file/filegroup. Usually a lot faster than restoring the entire DB and if you have SQL enterprise edition, the restore is 'online' (rest of the database is online and usable). Depending on the extent of the damage, you may have been able to do page restores, restoring just the corrupted pages.

    Once that's in place, then start looking at HA/DR that support your SLAs.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (7/29/2014)


    Start with regularly tested backups and frequent database integrity checks. You should be running integrity checks often enough that it's caught before ending up in the backups. Or, if there isn't a large enough maintenance window, restore the backups to another server and CheckDB that.

    Second, brush up on restore options. A 1TB database is probably in multiple files/filegroups. If the corruption was isolated, you should have been able to restore just the affected file/filegroup. Usually a lot faster than restoring the entire DB and if you have SQL enterprise edition, the restore is 'online' (rest of the database is online and usable). Depending on the extent of the damage, you may have been able to do page restores, restoring just the corrupted pages.

    Once that's in place, then start looking at HA/DR that support your SLAs.

    Thanks Gila, as this is a third-party application there is only 1 filegroup (PRIMARY), and it contains 10 years worth of data, therefore it's size. I will brush up on restore options however.

    Also, as you mentioned, I also have asked our team to investigate the feasibility of increasing our checkdb from weekly to daily (this one takes ~4hrs to run), but we have 100+ other SQL servers that are smaller in size and could run in an hour or less.

    Also, the extent of the damage was pretty bad - over 3500+ consistency errors spread amongst 10 tables. When we finally recovered , I compared my checktable pagecount to the checkdb pagecount initially for those 10 tables and the difference was about 36k pages difference in the restored db versus the corrupted database (so I assume if I did repair_allow_data_loss option I may have lost 36k pages!).

  • I said nothing implying you should have repaired...

    See if you can get some automated restore tests set up. Restore backups on another server, run CheckDB. Off the prod server means it doesn't matter how long they take, does mean you will know if your backups are good.

    See if there's any way you can archive older data into tables on another filegroup or another database entirely. Generally with a DB like that, the active portion is tiny compared to the entire DB.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (7/29/2014)


    I said nothing implying you should have repaired...

    See if you can get some automated restore tests set up. Restore backups on another server, run CheckDB. Off the prod server means it doesn't matter how long they take, does mean you will know if your backups are good.

    See if there's any way you can archive older data into tables on another filegroup or another database entirely. Generally with a DB like that, the active portion is tiny compared to the entire DB.

    Ah I should have mentioned we had multiple 'solutions' occurring in parallel and repair_allow_data_loss was one of them as an absolute last resort.

    And yes, I managed to archive one table (yippee) into another database last year but their biggest table, an audit table, taking 300GB out of the 1TB, cannot be touched according to the business, they want all history in there..and accessible if needed..ugh.

    And yes (again), doing more frequent checkdbs is definitely being looked at..thanks again everyone.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply