Data corruption aftermath -- future prevention

  • Last week, we discovered that our SQL Server installation had some significant data corruption in it (error 823 with some of the pages listed by checkdb having no segment ID to give a clue what object they belonged to). Moderate size DB (about half a terabyte).

    We take regular backups (and they were good), but after several restores, we found that the number of errors had gotten progressively higher, but that corruption had already existed for more than four days by the time we caught it (we were only running checkdb once per week because of its duration, and the corruption apparently happened shortly after the successful weekend check).

    Because of the amount of time we would lose, and because the pages that were identified by checkdb as problems belonged mostly to low-importance tables, we did try (on a copy) to run checkdb with allow data loss, and it failed.

    So, we restored from backup before corruption, and haven been trying to rebuild the missing data by extracting it from the corrupt database, which has obviously been a long and tedious process, and obviously people are concerned about the amount of time it is taking.

    So, the famous question is, of course, how do we keep this from ever happening again?

    Actions taken so far:

    1) Running nightly physical only checkdb (keeping the weekly full check)

    2) Automated monitoring of the suspect_pages table

    But I'm just a programmer pretending to be a DBA on this, so I'm looking for some advice on what else we ought to be doing.

    Some questions I'm getting asked:

    1) If this had happened on our Oracle database, we could have restored the database from before corruption, restored all the archive logs from then to present, and rolled forward to present, and quite possibly recovered everything with no problems. What do we have to do to allow this for SQL Server?

    2) Is there any additional redundancy that can be added to the system to prevent this type of situation or ease recovery?

  • If your databases were set to FULL and you had a good full backup to restore you could have restored transaction log backups to restore the data. Yes, you need to run checkdb as often as possible. The question becomes what caused the corruption. Typically it is the drives that does that not SQLServer. Have you identified the corruption? has the server team looked at the drives or SAN?

  • Have you identified the corruption? has the server team looked at the drives or SAN?

    Both our internal SAN team, and the vendor have looked at it, and assured me that nothing went wrong with the storage subsystem. It is just a wee bit suspicious that it happened within a couple of weeks to moving it to new storage, but regardless of that, I cannot systematically prove that it was an IO subsystem failure.

    If your databases were set to FULL

    They weren't. Our backup software vendor advised SIMPLE because their system relies on VSS to do point in time backups, which doesn't have a way to let SQL Server know a backup happened, so a FULL recovery log would grow indefinitely. Said vendor shall remain nameless, but suffice it to say that for a lot of reasons, we are looking for a replacement.

    But, if I did have several days worth of hourly FULL recovery logs backed up perfectly somehow, I would be able to roll forward through all the logs (and hopefully not miss any data), right?

    So, just from the hip, I could

    1. change to full recovery,

    2. set up a scheduled job to backup just the logs somewhere,

    then in time of failure

    3. use the existing snapshot backup to get back before corruption

    4. get retrievals of all the intervening log backups

    5. apply them iteratively

    This is the step I'm not sure will actually work; since the log file from #3 will also be (partially) one of the logs retrieved from #4, I'm not sure what will happen when I try to apply it (or others for that matter).

  • First, to prevent this from happening again, take steps to insure that you are not restoring good data on top of faulty hardware, so run a disk subsystem diagnostic.

    Going forward, consider keeping an eye on the systme table [suspect_pages].

    https://msdn.microsoft.com/en-us/library/ms191301.aspx

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • run a disk subsystem diagnostic

    Recommendation for a tool to use? Chkdsk shows no problems, and whatever tools the SAN guys used to check it showed no issues.

    consider keeping an eye on the systme table [suspect_pages]

    Agreed, and already done.

  • The real problem is, you can't really prevent corruption. You could just have a bit of solar radiation hit the disk at just the right angle flipping a 1 to a 0 and you have a corrupted database. What you can do is work around protecting the issue. Running DBCC checks more frequently, or, using an offline system to restore your database to and run DBCC checks there. It's still a good idea to do the PHYSICAL_ONLY checks on the main server, but you can run all the LOGICAL_ONLY checks (the longer part of the process) against the backup. Make sure you're testing the backups regularly too. You have to provide as much protection around recovery from corruption as possible.

    And yeah, I agree, getting the corruption immediately after setting up new hardware is extremely suspicious.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • So, to kind of re-iterate a question from higher in the thread:

    If I do just a log backup in a timed job, and then restore a "snapshot" of all the disks (SAN level backup), including all the mdf, ndf, and ldf files as they were at that instant, and then try to roll the database forward or back based on information in the log file backups, will it work?

    My concern is that the mdf / ndf / ldf that gets restored from SAN snapshot might have transactions 101-199 in it, but because of the schedule on which the timed backup is running, the log backup files (which could be many) have transaction 50-149, 150-249, 250-349, etc. in them. Will the restore process be smart enough to recognize what transactions belong where?

  • Just my opinion, but if the backup system you are using doesn't quite the database before doing the snapshot, you aren't guaranteed to get a reliable backup.

    Running SQL backups using the native backup/restore or using a third party tool from Dell, Idera, Redgate and then backing those up when completed would make more sense. Also, the larger the database the more you may want to look at piece meal backup/restores. I haven't done the later, but some people here have and could help point you in the right direction. Some may also suggest bringing in a consultant to help set it up as they are not trivial.

  • nhansen pcc (4/12/2016)


    So, to kind of re-iterate a question from higher in the thread:

    If I do just a log backup in a timed job, and then restore a "snapshot" of all the disks (SAN level backup), including all the mdf, ndf, and ldf files as they were at that instant, and then try to roll the database forward or back based on information in the log file backups, will it work?

    My concern is that the mdf / ndf / ldf that gets restored from SAN snapshot might have transactions 101-199 in it, but because of the schedule on which the timed backup is running, the log backup files (which could be many) have transaction 50-149, 150-249, 250-349, etc. in them. Will the restore process be smart enough to recognize what transactions belong where?

    I seriously doubt it. There are SAN systems that are transactionally aware such that they will allow you to successfully restore an uncorrupted database. However, they won't restore in a RECOVERING state so that you can then apply log backups. I've never heard of anything approaching a mix between SAN snapshots and log backups. In fact, the biggest push against the SAN backups as the primary means of backing up is the inability to do a point in time restore from them.

    I do know large scale systems (many tb, 10+), which have been forced to surrender to the inevitable and they don't have the ability to do point in time recovery. Instead, they rely on multiple layers of redundancy through the SAN and services like Availability Groups. If you're that large, toss the log backups and focus on having secondary systems as protection. If you're not that large, focus on traditional backup strategies, not the SAN.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • they won't restore in a RECOVERING state so that you can then apply log backups.

    That makes sense.

    the biggest push against the SAN backups as the primary means of backing up is the inability to do a point in time restore

    Our SAN backup is actually pretty nice; it got me within 15 minutes of the time I wanted. Obviously not quite "point in time" like the SQL restore could have supplied, but a lot better than what you might be thinking.

    So, just to make sure I understand everything properly, if I had available full, differential, and log backups before, during, and after the corruption period, could I have:

    1. Restored full backup (most recent one available before corruption)

    2. Ignored all other full and differential backups (since they could have possibly had the corruption in them as well; or could they?)

    3. Iteratively applied all log backups starting after #1 up to present

    Will the database ever say "Where's my differential/full backup? You need to apply that first (or should have applied it instead of all those log backups)."

    Would I need to do something special to create a tail log backup before every full/differential (not even sure that is the right terminology)? And would doing so require downtime (e.g.: would more stuff enter the log between backup commands otherwise)?

    Again, the goal here is to somehow get the database rolled back to before the corruption happened, yet still roll all the transactions forward to present without recreating the corruption again.

  • nhansen pcc (4/13/2016)

    Our SAN backup is actually pretty nice; it got me within 15 minutes of the time I wanted. Obviously not quite "point in time" like the SQL restore could have supplied, but a lot better than what you might be thinking.

    Hey, 15 minutes may meet the RPO. In which case, you're OK.

    So, just to make sure I understand everything properly, if I had available full, differential, and log backups before, during, and after the corruption period, could I have:

    1. Restored full backup (most recent one available before corruption)

    2. Ignored all other full and differential backups (since they could have possibly had the corruption in them as well; or could they?)

    They may or may not. It depends on when the corruption occurred and the type of corruption.

    3. Iteratively applied all log backups starting after #1 up to present

    Will the database ever say "Where's my differential/full backup?

    No. Differentials and logs are separate (but equal-ish). If you go down the differential path, the logs can pick up from there, but you only can restore a single differential on top of a full (meaning, you don't add your differentials to arrive at a point, you just apply the last one nearest the point you want), and the logs just don't care either way.

    You need to apply that first (or should have applied it instead of all those log backups)."

    Would I need to do something special to create a tail log backup before every full/differential (not even sure that is the right terminology)? And would doing so require downtime (e.g.: would more stuff enter the log between backup commands otherwise)?

    Again, the goal here is to somehow get the database rolled back to before the corruption happened, yet still roll all the transactions forward to present without recreating the corruption again.

    Not recreating the corruption again... well, depending on the corruption, that could still happen. No guarantees there. Depending on the corruption, you may only be able to restore to just prior to the corruption.

    You also may want to look at page level restores. Depending on the corruption, that's a great way to get around all this. Also, another thing that SAN snapshots can't help with (I think, can they restore just a page?).

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • OK, think I understand (or at least have enough information to set up a rehearsal disaster recovery on our dev environment).

    Thanks so much to everyone for their input.

  • Grant Fritchey (4/13/2016)


    nhansen pcc (4/12/2016)


    So, to kind of re-iterate a question from higher in the thread:

    If I do just a log backup in a timed job, and then restore a "snapshot" of all the disks (SAN level backup), including all the mdf, ndf, and ldf files as they were at that instant, and then try to roll the database forward or back based on information in the log file backups, will it work?

    My concern is that the mdf / ndf / ldf that gets restored from SAN snapshot might have transactions 101-199 in it, but because of the schedule on which the timed backup is running, the log backup files (which could be many) have transaction 50-149, 150-249, 250-349, etc. in them. Will the restore process be smart enough to recognize what transactions belong where?

    I seriously doubt it. There are SAN systems that are transactionally aware such that they will allow you to successfully restore an uncorrupted database. However, they won't restore in a RECOVERING state so that you can then apply log backups. I've never heard of anything approaching a mix between SAN snapshots and log backups. In fact, the biggest push against the SAN backups as the primary means of backing up is the inability to do a point in time restore from them.

    At the last place I worked (which I just left about two months ago), the DBA team was being required to convert all backups to a product called Actifio, which claims to be able to take VSS snapshots at the SAN/LUN level to do a SQL backup, get the MDF and LDF files (as opposed to making a native SQL BAK), and then shift the DB into recovering mode to restore transaction log backups. I have to admit that while it was a cumbersome process and not in the least user friendly, it did seem to do just that on a small DB. We never tried it on anything large, but just the idea that I could take a VSS snapshot and then convert it to a form that accepted transaction log restores blew my mind.

    To the OP -- another suggestion if it is possible is mirroring/availability groups. This will also require FULL recovery. But if a corrupt page is found on one DB, SQL will look up the good page on the mirroring partner to replace it. This won't prevent hardware-driven corruption but it will minimize the damage.

  • nhansen pcc (4/11/2016)


    Have you identified the corruption? has the server team looked at the drives or SAN?

    Both our internal SAN team, and the vendor have looked at it, and assured me that nothing went wrong with the storage subsystem. It is just a wee bit suspicious that it happened within a couple of weeks to moving it to new storage, but regardless of that, I cannot systematically prove that it was an IO subsystem failure.

    I never heard of a single real-life case when a SAN team, or a vendor, or a maintenance contractor have found an issue with IO subsystem.

    Even when an enclosure has been overheated for a week, or it's been a yellow light on a panel for a while, or throughput of the system dropped by, say, 5 times - it's still perfectly fine, according to them.

    So, do not trust their words too much.

    _____________
    Code for TallyGenerator

  • Excuse me? Are you saying that if a database is mirrored and disk-level corruption occurs the SQL will just pull the data from the mirror??

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply