Connect resiliently to SQL Server

,

 

Why you should connect resiliently to SQL Server

Transient failures happen — in the cloud (Azure SQL) and on-prem. A resilient connection strategy lets your app recover gracefully instead of crashing: it waits smartly, retries safely, and doesn’t pound the database when it’s truly unavailable. 


Why care?

  • Less downtime during patching/scaling/failovers

  • Smoother UX — fewer “please try again” moments for users

  • Operational robustness against brief network glitches and load spikes

  • Fewer cascading errors — keep transactions consistent and data clean


Transient vs. persistent errors

Transient errors (good candidates for retry):

  • Brief network interruptions

  • “Server busy”/throttling

  • Deadlocks or timeouts due to temporary load

  • Failovers during maintenance/patching

Persistent errors (should fail fast):

  • Bad connection string

  • Authentication/authorization failures

  • Missing or corrupted DB objects

The point: classify the error, then do the right thing. Retries only help with transient faults.


Patterns you actually need

  1. Retry with exponential backoff + jitter (spreads load to avoid stampedes)

  2. Circuit Breaker (stop hammering when the DB is really down)

  3. Connection pooling (reduce login churn, lower latency)

  4. Timeouts & Cancellation (no infinite waits; play nice with callers)

  5. Idempotency (retries must not create duplicates or corrupt state)

  6. Always On optimization: MultiSubnetFailover=True in your connection string


Code: Async retry with backoff + jitter (C# / Microsoft.Data.SqlClient)

Below is a quick primer and a C# sample you can paste right in.

using System; using System.Collections.Generic; using System.Data; using System.Linq; using System.Threading; using System.Threading.Tasks; using Microsoft.Data.SqlClient; public class DBRepository { private readonly string _cs; // Typical transient error codes (extend as needed) private static readonly HashSet<int> TransientErrors = new() { 4060, // Cannot open database 40197, // Service error (Azure) 40501, // Service busy (Azure) 40613, // DB not available (Azure) 49918, 49919, 49920, // Throttling (Azure) 11001, // Could not connect (DNS/connection) 1205, // Deadlock victim -2 // Timeout (command timeout) }; public DBRepository(string connectionString) => _cs = connectionString; public async Task ExecuteWithRetryAsync(CancellationToken ct = default) { const int maxRetries = 6; // or drive via a total time budget var rng = new Random(); for (int attempt = 1; ; attempt++) { try { await using var connection = new SqlConnection(_cs); await connection.OpenAsync(ct); await using var command = new SqlCommand("YourProc", connection) { CommandType = CommandType.StoredProcedure, CommandTimeout = 30 }; // Example: command.Parameters.AddWithValue("@ParamName", value); await command.ExecuteNonQueryAsync(ct); return; // success } catch (SqlException ex) when (IsTransient(ex) && attempt < maxRetries) { // Exponential backoff + jitter var baseDelayMs = (int)Math.Min(200 * Math.Pow(2, attempt - 1), 5000); var delay = TimeSpan.FromMilliseconds(baseDelayMs + rng.Next(0, 250)); // Consider logging attempt, delay, and error numbers for diagnostics await Task.Delay(delay, ct); } } } private static bool IsTransient(SqlException ex) { // Check all errors in the exception, not just ex.Number return ex.Errors.Cast<SqlError>().Any(e => TransientErrors.Contains(e.Number)); } }

Usage

var repo = new DBRepository( "Server=tcp:your-sql-listener,1433;" + "Database=YourDb;User ID=YourUser;Password=***;" + "Encrypt=True;TrustServerCertificate=False;" + "MultiSubnetFailover=True"); // recommended with Always On await repo.ExecuteWithRetryAsync();

Production tips

  • Log with structure (attempt, backoff, error codes, correlation ID).

  • Bound retries (max attempts or a total time budget).

  • Make writes idempotent or wrap in transactions that tolerate retries.

  • Consider a Circuit Breaker (e.g., via Polly) around the most critical calls.

  • Windows auth.

  • Leverage SqlClient’s built-in retry (RetryLogicProvider) if you want a centralized policy.


Bottom line

Resilience isn’t a nice-to-have; it’s part of correctness in data access. With a thin layer of backoff-based retries, clear error classification, cancellation, and the right connection settings, your application will sail through the little storms that would otherwise dent availability and user experience.

Original post (opens in new tab)

Rate

4 (1)

You rated this post out of 5. Change rating

Share

Share

Rate

4 (1)

You rated this post out of 5. Change rating