SQL Server – What To Do When Disaster Strikes (5-Point Survival Guide)
When your SQL Server goes down, alarms are blaring, and managers are hovering, that sentence might feel like a joke. But it’s exactly the right mindset: don’t panic, follow a plan.
Here’s a 5-point, practical guide for what to do in the moment when disaster hits – with battle-tested best practices and a bit of DBA humor.
1. Chill (Coffee First), Then Triage the Situation
The worst thing you can do in a disaster is start randomly “fixing” things.
This is not the time for “I’ll just try something in prod”.
Take 2 minutes to breathe, grab that coffee, and then:
Identify the blast radius
Is it one database or the whole instance?
Is it a single server, shared storage, or a whole data center?
Are we dealing with:
Storage failure?
SQL Server service/OS issues?
Corruption?
Accidental delete or bad deployment?
Check the basics first
SQL Server error log
Windows Event Log
Cluster / Availability Group dashboard
Storage / hypervisor alerts
Clarify what really matters right now
Even if nobody ever formally defined them, you still need to think in terms of:
RPO (Recovery Point Objective) – how much data loss can the business tolerate?
RTO (Recovery Time Objective) – how long can the system be down?
Also clarify:
Which applications/databases are tier 1 (must be up first)?
Which can wait until phase 2 or 3 of the recovery?
Your next steps (fix infrastructure, fail over, or restore) should be driven by these two numbers and the priority list – not by guesswork or whoever shouts the loudest.
Rule of thumb: if you can’t say the RPO/RTO and “what’s restored first” out loud, you’re not ready to touch the server.
Also: when you think “repair”, do not instinctively jump to DBCC CHECKDB ... REPAIR_ALLOW_DATA_LOSS. That option is a last resort only when good backups do not exist, not a standard disaster step.
2. Activate the DR Plan and Communicate Clearly
A disaster is not a solo hero mission – it’s a team sport.
Trigger the Disaster Recovery (DR) plan
A good DR plan defines:
Who is the incident lead
Who can approve data loss
Escalation paths and vendor contacts (e.g., Microsoft support)
Which playbook to follow for:
Server failure
Storage corruption
Network / data center outage
Logical disasters (accidental delete, bad release)
Priority order for systems
Which databases/applications are recovered first
Which can be temporarily read-only or offline
If your DR plan is “we’ll figure it out on the day”, congratulations – this is that day.
Get the right people in the loop early
Application owners – to validate functionality once you recover
Management / business owners – to make decisions about data loss and downtime
Infrastructure / Network / Storage teams – many “SQL disasters” start below SQL Server
Establish a communication rhythm
Short, regular updates (e.g., every 15 minutes):
What we know
What we’re doing
What’s next
Any decisions needed (data loss, failover, etc.)
This prevents chaos, random calls, and people hovering over your shoulder while you’re trying to type.
Pro tip: “We have a plan, we’re executing it, next update in 15 minutes” is magic for calming a room.
3. Decide: Fix, Fail Over, or Restore?
Once you understand the problem and the business constraints, you have three main options.
Option A: Fix the Primary
Best when:
No corruption
The issue is straightforward (full disk, downed service, OS glitch, misconfigured network)
Underlying infrastructure is basically healthy
Examples:
Fix full or offline disks
Restart the SQL Server service / VM (with change control awareness)
Bring a failing cluster/AG node back in cleanly
Option B: Fail Over to a Secondary
Use this when you have HA/DR in place:
Always On Availability Groups
Failover Cluster Instances (FCI)
Log shipping secondaries
Synchronous replica in another data center/region
Key points:
Confirm how far the secondary is behind – does it meet your RPO?
Perform the failover according to your runbook, not from memory.
Redirect application connections:
DNS aliases
Connection strings
Load balancers / app configuration
Then validate:
Can users log in?
Are critical transactions and reports working end-to-end?
Are background jobs / SQL Agent jobs running as expected?
Are linked servers, endpoints, encryption and certificates working?
Also check for:
Orphaned users (logins exist on instance, but database users lost SID mapping). Fix these so applications don’t get mysterious login errors.
Option C: Restore from Backups
Sometimes this is the only safe path, especially with:
Severe corruption
Storage loss without HA
Long-running logical damage (wrong data being written for hours/days)
High-level restore flow:
Stop the bleeding
Disable offending jobs, stop the app, put the system in maintenance mode if needed.
(If possible) Take a tail-log backup
If the database is in FULL or BULK_LOGGED recovery and still accessible, take a tail-log backup first.
This captures all transactions right up to the failure, so you can restore as close as possible to the crash/delete.
Restore in the right order
Full backup
Latest differential (if any)
All log backups (including the tail-log backup) up to the desired point-in-time
Include system databases in your planning
master, msdb, model contain:
Logins
Jobs
Credentials / linked servers
Configuration
Without them, your “recovered” system may come up, but nobody can log in or scheduled tasks don’t run.
Run
DBCC CHECKDBon the restored databaseTo verify you didn’t just restore corruption onto new hardware.
And yes, run regular
CHECKDBon system databases too – corruption there hurts just as much.
If your restore plan doesn’t include test restores and CHECKDB, it’s more of a wish than a strategy.
4. Execute Recovery Using Tested Backups and Checklists
Once you’ve chosen your strategy, execution should be boring and repeatable, not improvised.
If you fail over (AG / FCI / log shipping)
Confirm the replica meets your RPO (sync state, log shipping delay).
Fail over according to a tested runbook.
Update:
Application configuration / connection strings
Jobs that point to a specific server name
Monitoring / alerting targets
Validation checklist (minimum):
Can users log in?
Are critical transactions and reports working end-to-end?
Are SQL Agent jobs running as expected?
Are linked servers and external dependencies working?
Also check for:
Orphaned users and fix them so security mappings are correct.
If you restore from backup
Tail-log backup first (for FULL/BULK_LOGGED and accessible DB)
This allows restore right up to the moment of failure.
Restore chain in order
Full ? differential ? all log backups (ending with the tail-log backup, if taken)
Use STOPAT for point-in-time restores in accidental delete scenarios.
Validate backup integrity
Use
BACKUP ... WITH CHECKSUMwhen taking backups.Use
RESTORE VERIFYONLYon backup files as a lightweight sanity check.Combine this with regular test restores into non-prod for full confidence.
Run
DBCC CHECKDBOn the restored user databases
On system databases as part of your regular maintenance
Re-map logins & jobs
Restore or recreate logins, and fix orphaned users.
Restore SQL Agent jobs from msdb or recreate them if needed.
Script and reapply:
Linked servers
Credentials
Endpoints
Important instance settings (
sp_configure, trace flags, etc.)
Use checklists
One for failover
One for restore
One for post-recovery validation
Disaster recovery is not the time to “YOLO into SSMS”. Future you will thank present you for having a checklist.
Recovery models & why they matter
It’s worth calling this out clearly:
FULL / BULK_LOGGED
Supports log backups and point-in-time restore
Required for scenarios like “recover to just before the accidental delete”
SIMPLE
No log backups, no point-in-time restore
You can only restore to the time of the last full/diff backup
If your RPO is “near zero” but your critical databases are in SIMPLE recovery with nightly full backups… that’s not a DR plan, that’s a cliff.
5. Learn, Harden, and Rehearse
Once the fire is out and your users are back online, the most valuable work begins.
Run a blameless post-mortem
Focus on learning, not blaming:
What exactly happened?
Did we meet our RPO/RTO or miss them? Why?
What slowed us down?
Missing permissions?
Old documentation?
No DR runbooks?
Backup storage too slow?
Write down:
What worked well
What must change
Concrete action items with owners and deadlines
Upgrade your protection
Use what you learned to strengthen:
Backup strategy
Correct recovery model for each database (FULL/SIMPLE based on RPO)
Full + differential + log backups where needed
WITH CHECKSUM on backups and scheduled
RESTORE VERIFYONLYBackups stored off the main data/log disks
At least one offsite or immutable backup copy:
Cloud/object storage
WORM/immutable backups
“Air-gapped” location
Ransomware can encrypt servers and local backups – it has a much harder time encrypting a tape in a safe.
HA/DR
Always On AGs, FCIs, log shipping, cloud replicas
Clear runbooks for failing over and failing back
Regular DR drills that use your real backups and scripts
Server-level configuration & scripting
Regularly script out:
Logins (including SIDs)
SQL Agent jobs
Linked servers, credentials, endpoints
Important instance-level configuration
Store these scripts in source control (Git, etc.).
This makes rebuilding or comparing environments after a disaster much faster and safer.
Improve monitoring & alerts
A DR plan without monitoring is like an airbag that only works on Tuesdays.
Monitor:
Backup jobs (success/failure)
Backup currency (how old is the latest full/diff/log backup?)
AG replica health and synchronization state
Log shipping delay
Alert when:
Backups fail
There has been no log backup for longer than your RPO
DR replicas fall behind or go unhealthy
Rehearse disasters on purpose
Schedule DR drills like you schedule releases:
Simulate a storage failure or server loss
Restore databases into a test environment from production backups
Time how long it takes (real RTO, not “PowerPoint RTO”)
Measure actual data loss (real RPO)
Update documentation and checklists after each exercise
No one has ever said “We practice DR too much and our restores are too fast”.
Every real incident and every drill should make the next one less painful. Over time, “disaster strikes” becomes “we know exactly what to do”.
Bonus Tip: When Data Is Accidentally Deleted – Restore Side-by-Side
Not every disaster is a crashed server. Sometimes it’s a quiet, horrifying moment:
DELETE FROM ImportantTable; -- oops, no WHERE...
Or a TRUNCATE on a critical table in the middle of the day. In these cases, the engine is fine – it’s the data that’s gone.
Instead of overwriting your active production database with a backup, a safer and more flexible approach is:
Restore a copy of the database side-by-side, then copy the lost data back into production.
Step 0: Take a tail-log backup (if you can)
If the database is in FULL or BULK_LOGGED recovery and still accessible:
Take a tail-log backup immediately.
This captures all changes up to “now”, including the accidental delete, so you can restore to just before it.
Step 1: Restore to a new database name
Restore full ? diff ? log backups (including tail-log) to a new database on the same instance or another server.
Use
STOPATto restore to just before the accidental delete.Give it a clear name like:
ProdDB_Recovery_2025_12_03_1105.
Step 2: Verify the recovered data in the copy
Check row counts and key ranges in the affected tables.
Verify you’re at the correct point in time.
Optionally run
DBCC CHECKDBon the recovery copy as well.
Step 3: Copy the data back into production
Carefully move the missing data:
Use
INSERT…SELECT,MERGE, SSIS, or other ETL tools.Scope it accurately:
Primary keys
Date ranges
Business-specific filters
Wrap operations in transactions.
Be careful with:
Foreign key constraints
Identity columns
Triggers
Double-check you are not overwriting newer valid data in production.
Step 4: Clean up
Once everything is verified and signed off, drop or archive the recovery database according to your policies.
This “side-by-side restore” approach avoids trashing a healthy production database just to recover a subset of rows. You get:
A clean, read-only source of truth from backup
Full control over what you put back and how
The ability to re-try if you realise you grabbed the wrong rows the first time
Think of it as a time machine where you bring back only the rows you miss – not the whole universe.
Wrap-up
When SQL Server goes sideways, remember the order:
Chill, coffee first.
Triage and understand the problem and priorities.
Activate your DR plan and communicate clearly.
Choose between fix, failover, or restore, based on RPO/RTO and recovery model.
Execute with backups, tail-log strategy, CHECKSUM, VERIFYONLY, CHECKDB, and checklists.
Learn, harden, monitor, and rehearse so next time is faster and less painful.
For accidental deletes, restore side-by-side and copy data back instead of nuking a healthy production database.
That calm first sip of coffee isn’t wasted time – it’s the moment you switch from panic mode to professional, repeatable disaster response.

