Business Continuity in SQL Database

I have wrote a number of blogs on the topic of business continuity in SQL Database before (HA/DR for Azure SQL Database, Azure SQL Database high availability, Azure SQL Database disaster recovery) but with a number of new features I felt it was time for a new blog on the subject, focusing on disaster recovery and not high availability.

Business continuity in Azure SQL Database and SQL Managed Instance refers to the mechanisms, policies, and procedures that enable your business to continue operating in the face of disruption, particularly to its computing infrastructure. In the most of the cases, SQL Database and SQL Managed Instance will handle the disruptive events that might happen in the cloud environment and keep your applications and business processes running. From a database perspective, these are the major potential disruption scenarios, or disasters:

Data corruption or deletion typically caused by an application bug or human error (user accidentally deleted or updated a row in a table). Such failures are application-specific and typically cannot be detected by the database service
Malicious attacker succeeded to delete data or drop a database
Datacenter outage or temporarily disabled, possibly caused by a natural disaster such as an earthquake. This scenario requires some level of geo-redundancy with application failover to an alternate datacenter
Local hardware or software failures affecting the database node such as a disk-drive failure
Upgrade or maintenance errors, unanticipated issues that occur during planned infrastructure maintenance or upgrades may require rapid rollback to a prior database state

This overview describes the capabilities that SQL Database and SQL Managed Instance provide for business continuity and disaster recovery.

To mitigate the local hardware and software failures, SQL Database includes a high availability architecture, which guarantees automatic recovery from these failures with up to 99.995% availability SLA.

SQL Database and SQL Managed Instance also provide several business continuity features that you can use to mitigate various unplanned scenarios:

Temporal tables enable you to restore row versions from any point in time
To protect your business from data loss, SQL Database and SQL Managed Instance automatically create full database backups weekly, differential database backups every 12 hours, and transaction log backups every 5 – 10 minutes (see Built-in automated backups). The backups are stored in RA-GRS storage for at least 7 days for all service tiers. All service tiers except Basic support configurable backup retention period for point-in-time restore, up to 35 days. Point in Time Restore enables you to restore a complete database to some point in time within the configured retention period
You can restore a deleted database to the point at which it was deleted if the server has not been deleted.
For a corrupted database, you can create a new database from a backup to the same server, usually in less than 12 hours unless it is a very large or very active database (see database recovery time). You can also restore a database to another geographic region, called geo-restore, due to geo-replicated backups, which is where a backed up database is automatically copied to an Azure blob in a different region (note there is a delay between when a backup is taken and when it is geo-replicated to an Azure blob in a different region, so as a result, the restored database can be up to one hour behind the original database). This allows you to recover from a geographic disaster when you cannot access your database or backups in the primary region. It creates a new database on any existing server or managed instance, in any Azure region. See Azure SQL Database and Backups
Long-term backup retention (LTR) enables you to keep the backups up to 10 years (this is in limited public preview for SQL Managed Instance). LTR allows you to restore an old version of the database by using the Azure portal or Azure PowerShell to satisfy a compliance request or to run an old version of the application
Active geo-replication enables you to create readable replicas and manually failover to any replica in case of a datacenter outage or application upgrade (see table below to compare with auto-failover groups). If you have an application that must be taken offline because of planned maintenance such as an application upgrade, check out Manage application upgrades which describes how to use active geo-replication to enable rolling upgrades of your cloud application to minimize downtime during upgrades and provide a recovery path if something goes wrong
Auto-failover group allows the application to automatically recovery in case of a datacenter outage (see table below to compare with active geo-replication)

Auto-failover groups simplify the deployment and usage of active geo-replication and add the additional capabilities as described in the following table:

Feature	Geo-replication	Failover groups
Automatic failover	No	Yes
Fail over multiple databases simultaneously	No	Yes
User must update connection string after failover	Yes	No
SQL Managed Instance support	No	Yes
Can be in same region as primary	Yes	No
Multiple replicas	Yes	No
Supports read-scale	Yes	Yes

Recovering a database

Some details if you need to recover a database due to the very rare case of an Azure datacenter having an outage:

One option is to wait for your database to come back online when the datacenter outage is over. This works for applications that can afford to have the database offline. When a datacenter has an outage, you do not know how long the outage might last, so this option only works if you don’t need your database for a while
Another option is to restore a database on any server in any Azure region using geo-restore, as explained above
Finally, you can quickly recover from an outage if you have configured either geo-secondary using active geo-replication or an auto-failover group for your database or databases, as explained above. Depending on your choice of these technologies, you can use either manual or automatic failover. While failover itself takes only a few seconds, the service will take at least 1 hour to activate it. This is necessary to ensure that the failover is justified by the scale of the outage. Also, the failover may result in small data loss due to the nature of asynchronous replication

As you develop your business continuity plan, you need to understand the maximum acceptable time before the application fully recovers after the disruptive event. The time required for application to fully recover is known as Recovery time objective (RTO). You also need to understand the maximum period of recent data updates (time interval) the application can tolerate losing when recovering from an unplanned disruptive event. The potential data loss is known as Recovery point objective (RPO).

Different recovery methods offer different levels of RPO and RTO. You can choose a specific recovery method, or use a combination of database backups and active geo-replication to achieve full application recovery. The following table compares RPO and RTO of each recovery option:

Recovery method	RTO	RPO
Geo-restore from geo-replicated backups	12 h	1 h
Auto-failover groups	1 h	5 s
Manual database failover (via unplanned mode)	30 s	5 s

Use auto-failover groups if your application meets any of these criteria:

Is mission critical
Has a service level agreement (SLA) that does not allow for 12 hours or more of downtime
Downtime may result in financial liability
Has a high rate of data change and 1 hour of data loss is not acceptable
The additional cost of active geo-replication is lower than the potential financial liability and associated loss of business

For a discussion of design considerations for stand-alone databases and for elastic pools using these business continuity features, see Design an application for cloud disaster recovery and Elastic pool disaster recovery strategies.

The following sections provide an overview of the steps to recover using database backups, active geo-replication, or auto-failover groups. For detailed steps including planning requirements, post recovery steps, and information about how to simulate an outage to perform a disaster recovery drill, see Recover a database in SQL Database from an outage.

Prepare for an outage

Regardless of the business continuity feature you use, you must:

Identify and prepare the target server, including server-level IP firewall rules, logins, and master database level permissions
Determine how to redirect clients and client applications to the new server
Document other dependencies, such as auditing settings and alerts

Failover to a geo-replicated secondary database

If you are using active geo-replication or auto-failover groups as your recovery mechanism, you can configure an automatic failover policy or use manual unplanned failover. Once initiated, the failover causes the secondary to become the new primary and to be ready to record new transactions and respond to queries – with minimal data loss for the data not yet replicated. For information on designing the failover process, see Design an application for cloud disaster recovery. When the Azure datacenter comes back online the old primaries automatically reconnect to the new primary and become secondary databases. If you need to relocate the primary back to the original region, you can initiate a planned failover manually (failback).

Perform a geo-restore

If you are using the automated backups with geo-redundant storage (enabled by default), you can recover the database using geo-restore. Recovery usually takes place within 12 hours – with data loss of up to one hour determined by when the last log backup was taken and replicated. Until the recovery completes, the database is unable to record any transactions or respond to any queries. Note that geo-restore only restores the database to the last available point in time. If the datacenter comes back online before you switch your application over to the recovered database, you can cancel the recovery.

Perform post failover / recovery tasks

After recovery, you must perform the following additional tasks before your users and applications are back up and running:

Redirect clients and client applications to the new server and restored database
Ensure appropriate server-level IP firewall rules are in place for users to connect or use database-level firewalls to enable appropriate rules
Ensure appropriate logins and master database level permissions are in place (or use contained users)
Configure auditing, as appropriate
Configure alerts, as appropriate

If you are using an auto-failover group and connect to the databases using the read-write listener, the redirection after failover will happen automatically and transparently to the application.

More info:

Overview of business continuity with Azure SQL Database

High availability for Azure SQL Database and SQL Managed Instance

Creating and using active geo-replication – Azure SQL Database

Use auto-failover groups to enable transparent and coordinated failover of multiple databases

Configure a failover group for Azure SQL Database

Restore your Azure SQL Database or failover to a secondary

Introduction to Azure SQL Database Auto-Failover Groups

Azure SQL Server auto-failover groups

Introduction to Azure SQL Database Auto-failover groups