SQL Server – What To Do When Disaster Strikes (5-Point Survival Guide)

SQL Server – What To Do When Disaster Strikes (5-Point Survival Guide)

Chill, coffee first.

When your SQL Server goes down, alarms are blaring, and managers are hovering, that sentence might feel like a joke. But it’s exactly the right mindset: don’t panic, follow a plan.

Here’s a 5-point, practical guide for what to do in the moment when disaster hits – with battle-tested best practices and a bit of DBA humor.

1. Chill (Coffee First), Then Triage the Situation

The worst thing you can do in a disaster is start randomly “fixing” things.

This is not the time for “I’ll just try something in prod”.

Take 2 minutes to breathe, grab that coffee, and then:

Identify the blast radius

Is it one database or the whole instance?
Is it a single server, shared storage, or a whole data center?
Are we dealing with:
- Storage failure?
- SQL Server service/OS issues?
- Corruption?
- Accidental delete or bad deployment?

Check the basics first

SQL Server error log
Windows Event Log
Cluster / Availability Group dashboard
Storage / hypervisor alerts

Clarify what really matters right now

Even if nobody ever formally defined them, you still need to think in terms of:

RPO (Recovery Point Objective) – how much data loss can the business tolerate?
RTO (Recovery Time Objective) – how long can the system be down?

Also clarify:

Which applications/databases are tier 1 (must be up first)?
Which can wait until phase 2 or 3 of the recovery?

Your next steps (fix infrastructure, fail over, or restore) should be driven by these two numbers and the priority list – not by guesswork or whoever shouts the loudest.

Rule of thumb: if you can’t say the RPO/RTO and “what’s restored first” out loud, you’re not ready to touch the server.

Also: when you think “repair”, do not instinctively jump to DBCC CHECKDB ... REPAIR_ALLOW_DATA_LOSS. That option is a last resort only when good backups do not exist, not a standard disaster step.

2. Activate the DR Plan and Communicate Clearly

A disaster is not a solo hero mission – it’s a team sport.

Trigger the Disaster Recovery (DR) plan

A good DR plan defines:

Who is the incident lead
Who can approve data loss
Escalation paths and vendor contacts (e.g., Microsoft support)
Which playbook to follow for:
- Server failure
- Storage corruption
- Network / data center outage
- Logical disasters (accidental delete, bad release)
Priority order for systems
- Which databases/applications are recovered first
- Which can be temporarily read-only or offline

If your DR plan is “we’ll figure it out on the day”, congratulations – this is that day.

Get the right people in the loop early

Application owners – to validate functionality once you recover
Management / business owners – to make decisions about data loss and downtime
Infrastructure / Network / Storage teams – many “SQL disasters” start below SQL Server

Establish a communication rhythm

Short, regular updates (e.g., every 15 minutes):

What we know
What we’re doing
What’s next
Any decisions needed (data loss, failover, etc.)

This prevents chaos, random calls, and people hovering over your shoulder while you’re trying to type.

Pro tip: “We have a plan, we’re executing it, next update in 15 minutes” is magic for calming a room.

3. Decide: Fix, Fail Over, or Restore?

Once you understand the problem and the business constraints, you have three main options.

Option A: Fix the Primary

Best when:

No corruption
The issue is straightforward (full disk, downed service, OS glitch, misconfigured network)
Underlying infrastructure is basically healthy

Examples:

Fix full or offline disks
Restart the SQL Server service / VM (with change control awareness)
Bring a failing cluster/AG node back in cleanly

Option B: Fail Over to a Secondary

Use this when you have HA/DR in place:

Always On Availability Groups
Failover Cluster Instances (FCI)
Log shipping secondaries
Synchronous replica in another data center/region

Key points:

Confirm how far the secondary is behind – does it meet your RPO?
Perform the failover according to your runbook, not from memory.
Redirect application connections:
- DNS aliases
- Connection strings
- Load balancers / app configuration

Then validate:

Can users log in?
Are critical transactions and reports working end-to-end?
Are background jobs / SQL Agent jobs running as expected?
Are linked servers, endpoints, encryption and certificates working?

Also check for:

Orphaned users (logins exist on instance, but database users lost SID mapping). Fix these so applications don’t get mysterious login errors.

Option C: Restore from Backups

Sometimes this is the only safe path, especially with:

Severe corruption
Storage loss without HA
Long-running logical damage (wrong data being written for hours/days)

High-level restore flow:

Stop the bleeding
- Disable offending jobs, stop the app, put the system in maintenance mode if needed.
(If possible) Take a tail-log backup
- If the database is in FULL or BULK_LOGGED recovery and still accessible, take a tail-log backup first.
- This captures all transactions right up to the failure, so you can restore as close as possible to the crash/delete.
Restore in the right order
- Full backup
- Latest differential (if any)
- All log backups (including the tail-log backup) up to the desired point-in-time
Include system databases in your planning
- master, msdb, model contain:
  - Logins
  - Jobs
  - Credentials / linked servers
  - Configuration
- Without them, your “recovered” system may come up, but nobody can log in or scheduled tasks don’t run.
Run DBCC CHECKDB on the restored database
- To verify you didn’t just restore corruption onto new hardware.
- And yes, run regular CHECKDB on system databases too – corruption there hurts just as much.

If your restore plan doesn’t include test restores and CHECKDB, it’s more of a wish than a strategy.

4. Execute Recovery Using Tested Backups and Checklists

Once you’ve chosen your strategy, execution should be boring and repeatable, not improvised.

If you fail over (AG / FCI / log shipping)

Confirm the replica meets your RPO (sync state, log shipping delay).
Fail over according to a tested runbook.
Update:
- Application configuration / connection strings
- Jobs that point to a specific server name
- Monitoring / alerting targets

Validation checklist (minimum):

Can users log in?
Are critical transactions and reports working end-to-end?
Are SQL Agent jobs running as expected?
Are linked servers and external dependencies working?

Also check for:

Orphaned users and fix them so security mappings are correct.

If you restore from backup

Tail-log backup first (for FULL/BULK_LOGGED and accessible DB)
- This allows restore right up to the moment of failure.
Restore chain in order
- Full ? differential ? all log backups (ending with the tail-log backup, if taken)
- Use STOPAT for point-in-time restores in accidental delete scenarios.
Validate backup integrity
- Use BACKUP ... WITH CHECKSUM when taking backups.
- Use RESTORE VERIFYONLY on backup files as a lightweight sanity check.
- Combine this with regular test restores into non-prod for full confidence.
Run DBCC CHECKDB
- On the restored user databases
- On system databases as part of your regular maintenance
Re-map logins & jobs
- Restore or recreate logins, and fix orphaned users.
- Restore SQL Agent jobs from msdb or recreate them if needed.
- Script and reapply:
  - Linked servers
  - Credentials
  - Endpoints
  - Important instance settings (sp_configure, trace flags, etc.)
Use checklists
- One for failover
- One for restore
- One for post-recovery validation

Disaster recovery is not the time to “YOLO into SSMS”. Future you will thank present you for having a checklist.

Recovery models & why they matter

It’s worth calling this out clearly:

FULL / BULK_LOGGED
- Supports log backups and point-in-time restore
- Required for scenarios like “recover to just before the accidental delete”
SIMPLE
- No log backups, no point-in-time restore
- You can only restore to the time of the last full/diff backup

If your RPO is “near zero” but your critical databases are in SIMPLE recovery with nightly full backups… that’s not a DR plan, that’s a cliff.

5. Learn, Harden, and Rehearse

Once the fire is out and your users are back online, the most valuable work begins.

Run a blameless post-mortem

Focus on learning, not blaming:

What exactly happened?
Did we meet our RPO/RTO or miss them? Why?
What slowed us down?
- Missing permissions?
- Old documentation?
- No DR runbooks?
- Backup storage too slow?

Write down:

What worked well
What must change
Concrete action items with owners and deadlines

Upgrade your protection

Use what you learned to strengthen:

Backup strategy

Correct recovery model for each database (FULL/SIMPLE based on RPO)
Full + differential + log backups where needed
WITH CHECKSUM on backups and scheduled RESTORE VERIFYONLY
Backups stored off the main data/log disks
At least one offsite or immutable backup copy:
- Cloud/object storage
- WORM/immutable backups
- “Air-gapped” location

Ransomware can encrypt servers and local backups – it has a much harder time encrypting a tape in a safe.

HA/DR

Always On AGs, FCIs, log shipping, cloud replicas
Clear runbooks for failing over and failing back
Regular DR drills that use your real backups and scripts

Server-level configuration & scripting

Regularly script out:
- Logins (including SIDs)
- SQL Agent jobs
- Linked servers, credentials, endpoints
- Important instance-level configuration
Store these scripts in source control (Git, etc.).
This makes rebuilding or comparing environments after a disaster much faster and safer.

Improve monitoring & alerts

A DR plan without monitoring is like an airbag that only works on Tuesdays.

Monitor:
- Backup jobs (success/failure)
- Backup currency (how old is the latest full/diff/log backup?)
- AG replica health and synchronization state
- Log shipping delay
Alert when:
- Backups fail
- There has been no log backup for longer than your RPO
- DR replicas fall behind or go unhealthy

Rehearse disasters on purpose

Schedule DR drills like you schedule releases:

Simulate a storage failure or server loss
Restore databases into a test environment from production backups
Time how long it takes (real RTO, not “PowerPoint RTO”)
Measure actual data loss (real RPO)
Update documentation and checklists after each exercise

No one has ever said “We practice DR too much and our restores are too fast”.

Every real incident and every drill should make the next one less painful. Over time, “disaster strikes” becomes “we know exactly what to do”.

Bonus Tip: When Data Is Accidentally Deleted – Restore Side-by-Side

Not every disaster is a crashed server. Sometimes it’s a quiet, horrifying moment:

DELETE FROM ImportantTable;   -- oops, no WHERE...

Or a TRUNCATE on a critical table in the middle of the day. In these cases, the engine is fine – it’s the data that’s gone.

Instead of overwriting your active production database with a backup, a safer and more flexible approach is:

Restore a copy of the database side-by-side, then copy the lost data back into production.

Step 0: Take a tail-log backup (if you can)

If the database is in FULL or BULK_LOGGED recovery and still accessible:

Take a tail-log backup immediately.
This captures all changes up to “now”, including the accidental delete, so you can restore to just before it.

Step 1: Restore to a new database name

Restore full ? diff ? log backups (including tail-log) to a new database on the same instance or another server.
Use STOPAT to restore to just before the accidental delete.
Give it a clear name like: ProdDB_Recovery_2025_12_03_1105.

Step 2: Verify the recovered data in the copy

Check row counts and key ranges in the affected tables.
Verify you’re at the correct point in time.
Optionally run DBCC CHECKDB on the recovery copy as well.

Step 3: Copy the data back into production

Carefully move the missing data:

Use INSERT…SELECT, MERGE, SSIS, or other ETL tools.
Scope it accurately:
- Primary keys
- Date ranges
- Business-specific filters
Wrap operations in transactions.
Be careful with:
- Foreign key constraints
- Identity columns
- Triggers

Double-check you are not overwriting newer valid data in production.

Step 4: Clean up

Once everything is verified and signed off, drop or archive the recovery database according to your policies.

This “side-by-side restore” approach avoids trashing a healthy production database just to recover a subset of rows. You get:

A clean, read-only source of truth from backup
Full control over what you put back and how
The ability to re-try if you realise you grabbed the wrong rows the first time

Think of it as a time machine where you bring back only the rows you miss – not the whole universe.

Wrap-up

When SQL Server goes sideways, remember the order:

Chill, coffee first.
Triage and understand the problem and priorities.
Activate your DR plan and communicate clearly.
Choose between fix, failover, or restore, based on RPO/RTO and recovery model.
Execute with backups, tail-log strategy, CHECKSUM, VERIFYONLY, CHECKDB, and checklists.
Learn, harden, monitor, and rehearse so next time is faster and less painful.
For accidental deletes, restore side-by-side and copy data back instead of nuking a healthy production database.

That calm first sip of coffee isn’t wasted time – it’s the moment you switch from panic mode to professional, repeatable disaster response.