Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

'Best' HA/DR option to protect against data corruption Expand / Collapse
Author
Message
Posted Monday, July 28, 2014 3:49 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, December 19, 2014 7:22 AM
Points: 25, Visits: 167
Hi All,

I know this may seem like an open-ended question but last week we encountered data corruption (eventid 824) and it took us 9-10 hours to restore our 1TB database using a previous full backup and rolling forward 10 days worth of hourly transaction log backups.

Now we are looking at ways in which to minimize the downtime due to data corruption so i'm considering using log shipping to a standby server. In this way, we can keep the standby server in read-only mode and delay the transaction log restore by say 6-12 hours. That way if we get another Eventid 824, severity level 24, i can do a quick dbcc checktable on the affected table, or even a dbcc checkdb against the standby database and if its good, apply the tlogs, bring it to the production server and attach it.

That being said, i know there's database mirroring option (though i read it will be deprecated in future versions of SQL so that might not be promising) and also clustering (but this will only prevent /protect against the OS being corrupted, not the disks).

We also have such things turned on as:
- Alerts for severity levels 19-25
- page_verify_option set to CHECKSUM on all databases

Thoughts/comments on how you are mitigating data corruption in your environment and your plan(s) to bring back a VLDB in this event?
Post #1597100
Posted Monday, July 28, 2014 8:15 PM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Friday, December 5, 2014 8:02 PM
Points: 422, Visits: 2,662
Hi, you didn't mention anything abt diff backups? That would have save a lot time as i you mentioned you had to roll forward 10 days.

What is your current backup plan?

Mirroring is the best option..since its real time.

Which version of SQL you are using.... I believe mirroring is deprecated for 2014 version... due to always on I believe.


Regards,
SQLisAwe5oMe.
Post #1597141
Posted Monday, July 28, 2014 8:21 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, December 19, 2014 7:22 AM
Points: 25, Visits: 167
Weekly full, daily diff, hourly tlog backups is our strategy

We tried the diff but since it was data corruption, the diff backups contained the corrupted pages

We using sql 2008 ent edition and it supports a third party application
Post #1597142
Posted Tuesday, July 29, 2014 3:59 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Wednesday, December 24, 2014 9:35 AM
Points: 2,114, Visits: 3,720
Mirroring is deprecated as most/all of it's features are superceded by AlwaysOn Availability Groups in SQL 2012+
It's still present in SQL 2012 & 2014. Haven't done it myself but should be a reasonably easy conversion if you ever need to.

It's worth noting Mirroring has a nifty feature called Automatic Page Repair, which fixes certain types of corruption by recovering the intact version of the corrupt page(s) from the mirror. http://msdn.microsoft.com/en-us/library/bb677167(v=sql.105).aspx

Edit: for mirroring you'd need the ability to update any app's connection strings to include the FailoverPartner option. Might not be an option if it's a third party app.
Post #1597206
Posted Tuesday, July 29, 2014 6:18 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 4:10 AM
Points: 14,208, Visits: 28,539
1tb isn't quite into the VLDB range, but, once you get past around 3-5tb, you need to start looking at things like SAN snapshots. But, even that could be problematic because the SAN can replicate the corruption.

Prior to that, mirroring or AlwaysOn would be your best bets for dealing with corruption issues. While Availability Groups help with some types of HA/DR, they won't help with corruption due to the shared storage aspect.


----------------------------------------------------
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood..." Theodore Roosevelt
The Scary DBA
Author of:
SQL Server Query Performance Tuning
and
SQL Server Execution Plans

Product Evangelist for Red Gate Software
Post #1597243
Posted Tuesday, July 29, 2014 8:22 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, December 19, 2014 7:22 AM
Points: 25, Visits: 167
Grant Fritchey (7/29/2014)
1tb isn't quite into the VLDB range, but, once you get past around 3-5tb, you need to start looking at things like SAN snapshots. But, even that could be problematic because the SAN can replicate the corruption.

Prior to that, mirroring or AlwaysOn would be your best bets for dealing with corruption issues. While Availability Groups help with some types of HA/DR, they won't help with corruption due to the shared storage aspect.


Thanks all, I will look into database mirroring as well. There are basically 4 instances in 4 global regions I am considering (1 instance has SQL 2008 Enterprise edition, the other 3 have SQL 2008 Standard edition). I don't need automatic failover at this point since, if I encounter a data corruption error on the primary server, I would like to run a checktable or checkdb against the mirror database before switching it to production.

Does database mirroring require a shared storage aspect? I don't think so, and since we're a virtual environment on top of a SAN, I would think the primary, witness and mirror server are all on separate LUNs, thus separate storage. I didn't know availability groups required shared storage, which as you say, won't help if I had data corruption issues.

Also, to the connection string being updated to include automatic failover partner, at this point I don't think we would have automatic failover since I would like to be able to check the mirror before bringing it online.

Just like if I had log shipping (which I read is cumbersome to configure and maintain and I would like to keep it simple), I would ship the logs say 4-6 hours delay and if a corruption occurred, I could check the standby database for corruption before bringing it online.
Post #1597315
Posted Tuesday, July 29, 2014 8:46 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 4:10 AM
Points: 14,208, Visits: 28,539
msandico 57892 (7/29/2014)
Grant Fritchey (7/29/2014)
1tb isn't quite into the VLDB range, but, once you get past around 3-5tb, you need to start looking at things like SAN snapshots. But, even that could be problematic because the SAN can replicate the corruption.

Prior to that, mirroring or AlwaysOn would be your best bets for dealing with corruption issues. While Availability Groups help with some types of HA/DR, they won't help with corruption due to the shared storage aspect.


Thanks all, I will look into database mirroring as well. There are basically 4 instances in 4 global regions I am considering (1 instance has SQL 2008 Enterprise edition, the other 3 have SQL 2008 Standard edition). I don't need automatic failover at this point since, if I encounter a data corruption error on the primary server, I would like to run a checktable or checkdb against the mirror database before switching it to production.

Does database mirroring require a shared storage aspect? I don't think so, and since we're a virtual environment on top of a SAN, I would think the primary, witness and mirror server are all on separate LUNs, thus separate storage. I didn't know availability groups required shared storage, which as you say, won't help if I had data corruption issues.

Also, to the connection string being updated to include automatic failover partner, at this point I don't think we would have automatic failover since I would like to be able to check the mirror before bringing it online.

Just like if I had log shipping (which I read is cumbersome to configure and maintain and I would like to keep it simple), I would ship the logs say 4-6 hours delay and if a corruption occurred, I could check the standby database for corruption before bringing it online.


Mirroring doesn't allow for shared storage, like AlwaysOn. So that aspect is likely to deal well with what you're looking for. Now, setting up the logs to only migrate every 4-6 hours could be problematic. That means you'll be maintaining an active log that size, which may add significantly to your maintenance and storage issues. Also, corruption can't be passed between servers when using mirroring (or AlwaysOn). While I agree that ensuring the other server is not corrupt is a nice thing to do, I'm not sure it should be driving your decision.

Just a question, how often are you running into corruption issues. In over 20 years in the business, I've only run into a very few that directly impacted production, and most of them were within non-clustered indexes, easily recovered. If you're hitting lots and lots of corruption issues, you may have some serious hardware problems. You may be focusing in the wrong area. Better to prevent the corruption than have to deal with it over and over.


----------------------------------------------------
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood..." Theodore Roosevelt
The Scary DBA
Author of:
SQL Server Query Performance Tuning
and
SQL Server Execution Plans

Product Evangelist for Red Gate Software
Post #1597332
Posted Tuesday, July 29, 2014 8:56 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, December 19, 2014 7:22 AM
Points: 25, Visits: 167
Grant Fritchey (7/29/2014)
msandico 57892 (7/29/2014)
Grant Fritchey (7/29/2014)
1tb isn't quite into the VLDB range, but, once you get past around 3-5tb, you need to start looking at things like SAN snapshots. But, even that could be problematic because the SAN can replicate the corruption.

Prior to that, mirroring or AlwaysOn would be your best bets for dealing with corruption issues. While Availability Groups help with some types of HA/DR, they won't help with corruption due to the shared storage aspect.


Thanks all, I will look into database mirroring as well. There are basically 4 instances in 4 global regions I am considering (1 instance has SQL 2008 Enterprise edition, the other 3 have SQL 2008 Standard edition). I don't need automatic failover at this point since, if I encounter a data corruption error on the primary server, I would like to run a checktable or checkdb against the mirror database before switching it to production.

Does database mirroring require a shared storage aspect? I don't think so, and since we're a virtual environment on top of a SAN, I would think the primary, witness and mirror server are all on separate LUNs, thus separate storage. I didn't know availability groups required shared storage, which as you say, won't help if I had data corruption issues.

Also, to the connection string being updated to include automatic failover partner, at this point I don't think we would have automatic failover since I would like to be able to check the mirror before bringing it online.

Just like if I had log shipping (which I read is cumbersome to configure and maintain and I would like to keep it simple), I would ship the logs say 4-6 hours delay and if a corruption occurred, I could check the standby database for corruption before bringing it online.


Mirroring doesn't allow for shared storage, like AlwaysOn. So that aspect is likely to deal well with what you're looking for. Now, setting up the logs to only migrate every 4-6 hours could be problematic. That means you'll be maintaining an active log that size, which may add significantly to your maintenance and storage issues. Also, corruption can't be passed between servers when using mirroring (or AlwaysOn). While I agree that ensuring the other server is not corrupt is a nice thing to do, I'm not sure it should be driving your decision.

Just a question, how often are you running into corruption issues. In over 20 years in the business, I've only run into a very few that directly impacted production, and most of them were within non-clustered indexes, easily recovered. If you're hitting lots and lots of corruption issues, you may have some serious hardware problems. You may be focusing in the wrong area. Better to prevent the corruption than have to deal with it over and over.


Hi Grant,
Thanks for that and yes, in my 10+ years of being a DBA I've only seen data corruption like this (event 824) twice - more often 10 years ago when SANs were coming out and there were a bunch of driver issues, and just now - so there was a gap of 10 or so years between seeing this.
We think we've narrowed it down to a new backup agent that was deployed to all our hosts about a month back. This SQL server was not the only VM (Hyper-V) affected by disk corruption, I had about 5 others experiencing similar issues and non-SQL servers were being impacted as well. Our team has found some articles that other customers have had when using this particular backup agent version and our version of Hyper-V.....so they've removed it as of last week off all the hosts.

So yes I agree the root cause should be the main focus, which we've hopefully addressed. Now from the SQL Server perspective, I am looking at options that don't necessarily prevent disk corruption (which is impossible from SQL server point of view right? Once SQL writes the page, and passes it to OS/storage subsystem, it's out of our control until we read it back and recalculate the page header checksum), but help us recover faster if a disaster happens again, whether it be corruption or not.

Database mirroring sounds like a very promising solution and I will start to investigate it, break it and develop use cases for/against it...and then a cost analysis of course !
Post #1597339
Posted Tuesday, July 29, 2014 9:21 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 4:10 AM
Points: 14,208, Visits: 28,539
msandico 57892 (7/29/2014)
Grant Fritchey (7/29/2014)
msandico 57892 (7/29/2014)
Grant Fritchey (7/29/2014)
1tb isn't quite into the VLDB range, but, once you get past around 3-5tb, you need to start looking at things like SAN snapshots. But, even that could be problematic because the SAN can replicate the corruption.

Prior to that, mirroring or AlwaysOn would be your best bets for dealing with corruption issues. While Availability Groups help with some types of HA/DR, they won't help with corruption due to the shared storage aspect.


Thanks all, I will look into database mirroring as well. There are basically 4 instances in 4 global regions I am considering (1 instance has SQL 2008 Enterprise edition, the other 3 have SQL 2008 Standard edition). I don't need automatic failover at this point since, if I encounter a data corruption error on the primary server, I would like to run a checktable or checkdb against the mirror database before switching it to production.

Does database mirroring require a shared storage aspect? I don't think so, and since we're a virtual environment on top of a SAN, I would think the primary, witness and mirror server are all on separate LUNs, thus separate storage. I didn't know availability groups required shared storage, which as you say, won't help if I had data corruption issues.

Also, to the connection string being updated to include automatic failover partner, at this point I don't think we would have automatic failover since I would like to be able to check the mirror before bringing it online.

Just like if I had log shipping (which I read is cumbersome to configure and maintain and I would like to keep it simple), I would ship the logs say 4-6 hours delay and if a corruption occurred, I could check the standby database for corruption before bringing it online.


Mirroring doesn't allow for shared storage, like AlwaysOn. So that aspect is likely to deal well with what you're looking for. Now, setting up the logs to only migrate every 4-6 hours could be problematic. That means you'll be maintaining an active log that size, which may add significantly to your maintenance and storage issues. Also, corruption can't be passed between servers when using mirroring (or AlwaysOn). While I agree that ensuring the other server is not corrupt is a nice thing to do, I'm not sure it should be driving your decision.

Just a question, how often are you running into corruption issues. In over 20 years in the business, I've only run into a very few that directly impacted production, and most of them were within non-clustered indexes, easily recovered. If you're hitting lots and lots of corruption issues, you may have some serious hardware problems. You may be focusing in the wrong area. Better to prevent the corruption than have to deal with it over and over.


Hi Grant,
Thanks for that and yes, in my 10+ years of being a DBA I've only seen data corruption like this (event 824) twice - more often 10 years ago when SANs were coming out and there were a bunch of driver issues, and just now - so there was a gap of 10 or so years between seeing this.
We think we've narrowed it down to a new backup agent that was deployed to all our hosts about a month back. This SQL server was not the only VM (Hyper-V) affected by disk corruption, I had about 5 others experiencing similar issues and non-SQL servers were being impacted as well. Our team has found some articles that other customers have had when using this particular backup agent version and our version of Hyper-V.....so they've removed it as of last week off all the hosts.

So yes I agree the root cause should be the main focus, which we've hopefully addressed. Now from the SQL Server perspective, I am looking at options that don't necessarily prevent disk corruption (which is impossible from SQL server point of view right? Once SQL writes the page, and passes it to OS/storage subsystem, it's out of our control until we read it back and recalculate the page header checksum), but help us recover faster if a disaster happens again, whether it be corruption or not.

Database mirroring sounds like a very promising solution and I will start to investigate it, break it and develop use cases for/against it...and then a cost analysis of course !


Sounds like a plan. Just know that mirroring is deprecated so it'll get removed the product at some point in the future. No word on when that is.


----------------------------------------------------
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood..." Theodore Roosevelt
The Scary DBA
Author of:
SQL Server Query Performance Tuning
and
SQL Server Execution Plans

Product Evangelist for Red Gate Software
Post #1597351
Posted Tuesday, July 29, 2014 9:47 AM


SSC-Forever

SSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-Forever

Group: General Forum Members
Last Login: Today @ 5:24 AM
Points: 40,694, Visits: 37,147
Start with regularly tested backups and frequent database integrity checks. You should be running integrity checks often enough that it's caught before ending up in the backups. Or, if there isn't a large enough maintenance window, restore the backups to another server and CheckDB that.

Second, brush up on restore options. A 1TB database is probably in multiple files/filegroups. If the corruption was isolated, you should have been able to restore just the affected file/filegroup. Usually a lot faster than restoring the entire DB and if you have SQL enterprise edition, the restore is 'online' (rest of the database is online and usable). Depending on the extent of the damage, you may have been able to do page restores, restoring just the corrupted pages.

Once that's in place, then start looking at HA/DR that support your SLAs.



Gail Shaw
Microsoft Certified Master: SQL Server 2008, MVP
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Post #1597372
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse