Breaking SQL Server for disaster recovery practice

  • Hey all. I understand if this gets taken down due to the subject matter but I am at a point in my career where I am leading a team of 3 other DBA's and I want to get them some database DR practice. What I am hoping for are some ways to break SQL Server 2022 in ways that simulate real world issues that happen so that my team and I can practice fixing these things. I have been struggling for a bit on Googling all of this so I come to you.

    What are ways that I can break SQL server for practice?

  • I think it was Brandie Tarvin that wrote about this? TBH, I can't remember anymore.... Someone on here used to break SQL Server one setting at a time and writing down the error returned, and logged them all... interesting way to figure out what's wrong... refer to your notes/list/table.

  • This was removed by the editor as SPAM

  • This is exactly what I specialize in—breaking SQL Server environments in controlled ways to test disaster recovery readiness.

    help teams run:

    Structured DR simulations with escalating difficulty

    Multi-failure scenarios (where one issue hides another)

    Real-time incident drills with scoring and feedback

    Runbook validation and improvement

    If you want, I can help you design a tailored DR training program for your team or run a live simulation where you act as incident commander and your DBAs respond in real time.

  • This is exactly what I specialize in—breaking SQL Server environments in controlled ways to test disaster recovery readiness.

    help teams run:

    Structured DR simulations with escalating difficulty

    Multi-failure scenarios (where one issue hides another)

    Real-time incident drills with scoring and feedback

    Runbook validation and improvement

    If you want, I can help you design a tailored DR training program for your team or run a live simulation where you act as incident commander and your DBAs respond in real time.

  • I tried to do this with my team ages back, I simulated database corruption - instance is online but checkdb comes back dirty and can't recover it on it's own. I intentionally corrupted the DB in a way that was fixable if you know what you are doing but if you don't it's easy to break things badly.

    But it really depends on what you consider the level of disaster your team is responsible for. How high up are you responsible for? If the VM is accidentally deleted, is that on you? If so, that's a good one to start with. If not, are you responsible if the service stops and refuses to start? If so, another good thing to throw at the team. If not, how about database corruption?

    I mean, it's really hard to say where to start without knowing what you are responsible for. When I was responsible as a DBA (no longer my role), simulated database corruption is one thing I had done - created a database, put some data in, took it offline, corrupted it, brought it back online. Database still accepted inserts, but any select, update, or delete against the one table gave grief due to the index being corrupted. checkdb said it couldn't fix it automatically without data loss, so it feels like a good DR scenario. That one was recoverable without restoring from backup too!

    I also came across a weird one where a database that was upgraded from ancient times contained a table (I think it was?) with an invalid name so it couldn't be dropped or have any operations run against it. I tasked the team with removing it even if it meant having downtime (which was required for the solution that I came up with... nobody figured out how to solve that one).

    Another one - someone drops the DB and you restore from backup. OR a worse scenario someone drops a table or deletes wrong data from the table and you need to do a surgical restore of the bad data.

    It really falls down to what are you responsible for and what disaster scenarios are you going to be responsible for? Define those and go from there.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • This was removed by the editor as SPAM

  • Hi chapwolff,

    A couple of things to clarify would be what kind of scenarios you're likely to encounter with your current setup and range of responsibilities, and what are the outcomes you're hoping for.

    Like others have said, cause some corruption on a (preferably VLDB) test database and leave them with nothing but an internet connection and guidance only if they need it. If you host any large DB's it's important to scale up any testing you're doing to account for those inevitable road-blocks. Obviously only do the following with test systems...

    Do you have AG's or FCI's? - Rip the primary site out without warning - Simulate loss of quorum - One of the more dangerous scenarios an inexperienced DBA can encounter in my opinion is a 'RESOLVING' state AG.

    Do you manage VLDB's and have disks to spare? - Simulate backup/restore scenarios to test and time RPO/RTO

    Do you have filegroups? - Practice piecemeal restores

    Do you have replication? - This is another important one that breaks hard when it does break and can sometimes be complex to re-establish in a pinch

    • This reply was modified 2 weeks, 5 days ago by mastershake. Reason: Corrected 'RESTORING' with 'RESOLVING'
  • This is the right direction here, and I’d echo what others have said about first defining what kinds of failures your team is actually responsible for handling because that should drive your DR practice more than just randomly “breaking” things. A good approach is to build scenarios in layers: start with common operational issues like service won’t start, disk/log full then move into data-level problems like accidental deletes or index corruption, and then into HA/DR situations like AG failures or restore exercises. One scenario that’s especially valuable to include is a corruption case where CHECKDB can’t repair without data loss, because it forces real decision-making under pressure restore vs repair vs partial recovery. In those edge cases where clean backups aren’t usable, some teams also evaluate third-party recovery tools like Stellar Repair for MS SQL, Systool for MS SQL or Aryson SQL Database Recovery as a last resort to extract critical data, but the key is to test that ahead of time rather than figuring it out during an incident. Ultimately, the biggest value comes from validating your runbooks, timing your response against RTO/RPO, and making sure the team can execute calmly and correctly not just fix the issue, but handle the situation end-to-end.

  • I will try to give couple of scenarios:

    1. Try disabling log backups. This will force the logs until the disk is full. Now Database cannot write transactions.
    2. Fill up the TempDB disk space. Queries will fail and blocking will be escalated.

    Deepesh Dhake
    Database Administrator

  • Hey all,

    Thank you so much for your responses. I will do my best to answer your questions to keep this conversation flowing forward.

    As far as what we cover in our role. We are responsible for the Database Server. Essentially we are a PaaS team. However, if a database gets corrupted it will mainly be my team working to fix the issue. We don't get into development of OLTP or OLAP queries (though that may change in the future as I am working on building relationships between our team and the Devs to work together on bad queries). We are responsible for setting up SQL server, but the VM (using VMware) is controlled by another team.

    We currently have to production Availability groups and one of those also has replication attached to it (long story about an OLD application that dev didn't want to change so we had to roll with it).

    As far as VLDB's, we have a few that are over 1 TB but most of our estate is 50-200 GB in size.

    As far as Filegroups go, our newer systems are set up to use file groups, but some of our legacy systems are all on primary.

    I hope this gets the conversation going more. If you have more questions let me know and I will do my best to answer.

Viewing 11 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply