RE: 'Best' HA/DR option to protect against data corruption

Ten Centuries

Points: 1063

July 29, 2014 at 8:56 am

Grant Fritchey (7/29/2014)
msandico 57892 (7/29/2014)
Grant Fritchey (7/29/2014)
1tb isn't quite into the VLDB range, but, once you get past around 3-5tb, you need to start looking at things like SAN snapshots. But, even that could be problematic because the SAN can replicate the corruption.
Prior to that, mirroring or AlwaysOn would be your best bets for dealing with corruption issues. While Availability Groups help with some types of HA/DR, they won't help with corruption due to the shared storage aspect.
Thanks all, I will look into database mirroring as well. There are basically 4 instances in 4 global regions I am considering (1 instance has SQL 2008 Enterprise edition, the other 3 have SQL 2008 Standard edition). I don't need automatic failover at this point since, if I encounter a data corruption error on the primary server, I would like to run a checktable or checkdb against the mirror database before switching it to production.
Does database mirroring require a shared storage aspect? I don't think so, and since we're a virtual environment on top of a SAN, I would think the primary, witness and mirror server are all on separate LUNs, thus separate storage. I didn't know availability groups required shared storage, which as you say, won't help if I had data corruption issues.
Also, to the connection string being updated to include automatic failover partner, at this point I don't think we would have automatic failover since I would like to be able to check the mirror before bringing it online.
Just like if I had log shipping (which I read is cumbersome to configure and maintain and I would like to keep it simple), I would ship the logs say 4-6 hours delay and if a corruption occurred, I could check the standby database for corruption before bringing it online.
Mirroring doesn't allow for shared storage, like AlwaysOn. So that aspect is likely to deal well with what you're looking for. Now, setting up the logs to only migrate every 4-6 hours could be problematic. That means you'll be maintaining an active log that size, which may add significantly to your maintenance and storage issues. Also, corruption can't be passed between servers when using mirroring (or AlwaysOn). While I agree that ensuring the other server is not corrupt is a nice thing to do, I'm not sure it should be driving your decision.
Just a question, how often are you running into corruption issues. In over 20 years in the business, I've only run into a very few that directly impacted production, and most of them were within non-clustered indexes, easily recovered. If you're hitting lots and lots of corruption issues, you may have some serious hardware problems. You may be focusing in the wrong area. Better to prevent the corruption than have to deal with it over and over.

Hi Grant,

Thanks for that and yes, in my 10+ years of being a DBA I've only seen data corruption like this (event 824) twice - more often 10 years ago when SANs were coming out and there were a bunch of driver issues, and just now - so there was a gap of 10 or so years between seeing this.

We think we've narrowed it down to a new backup agent that was deployed to all our hosts about a month back. This SQL server was not the only VM (Hyper-V) affected by disk corruption, I had about 5 others experiencing similar issues and non-SQL servers were being impacted as well. Our team has found some articles that other customers have had when using this particular backup agent version and our version of Hyper-V.....so they've removed it as of last week off all the hosts.

So yes I agree the root cause should be the main focus, which we've hopefully addressed. Now from the SQL Server perspective, I am looking at options that don't necessarily prevent disk corruption (which is impossible from SQL server point of view right? Once SQL writes the page, and passes it to OS/storage subsystem, it's out of our control until we read it back and recalculate the page header checksum), but help us recover faster if a disaster happens again, whether it be corruption or not.

Database mirroring sounds like a very promising solution and I will start to investigate it, break it and develop use cases for/against it...and then a cost analysis of course !