Achieving Server Redundancy at Remote Offices

Opportunity

For production systems,

uptime is a typical measure of how well we as administrators do our

jobs. Its not the only one, but its an important one. I was once

told that if a DBA keeps her/his database up, keeps it fast, and can

recover in the event of failure that they’ve earned their

salary no matter what it might be. While some may disagree with

this, it certainly encapsulates the role we all play in the service

we provide to our employers. This article focuses on the third item

– recovery.

Background

At the company I work

for, we have a centralized data center with many database servers and

several branch offices with a couple of database servers in each one.

A few months back, we began a project to assess where we were in

terms of data loss and recovery. After our review, we found that we

were in good shape in our data center, but realized that if one of

the servers in a branch office failed – not a disaster mind

you, but just failed – we would be in serious straits with our

customers who rely on our systems for data exchange. These servers

host databases that didn’t necessary require 24x7 uptime, but

an outage of more than a couple of hours would lead to escalating

fines and affect our ratings with our customers. What to do?

Redundancy

The first thing that

comes to mind for reducing the risks associated with downtime is

installing redundant servers. The idea is that if the production

server does fail, the application can be moved to another server and

availability restored. If you go this route, you have to be willing

to convince your boss that the application is critical enough that it

warrants letting thousands of dollars worth of equipment potentially

set underutilized. If you win this battle, then the next step is to

decide exactly how you’re going to achieve that redundancy.

To understand what

server redundancy really means, let’s look it up. According to

Wikipedia, redundancy can be defined as:

Redundancy,

in general terms, refers to the quality or state of being redundant,

that is: exceeding what is necessary or normal, containing an excess.

This can have a negative connotation, superfluous, but also positive,

serving as a duplicate for preventing failure of an entire system

Obviously, for our

purposes we’re taking the positive side of the topic and

considering those scenarios where applications must be available to

users whenever users are available to access those same applications.

But due diligence requires that we invest our employer’s money

as efficiently as possible. What are our options?

There are several ways

to maximize uptime for database systems, each with its own advantages

and disadvantages. In this document, I’ll try to review three

that we considered then try and explain why we selected the option we

choose.

The options we

considered were:

Redundant Servers – No Synchronization
Clustering
Log Shipping

Redundant
Servers – No Synchronization

Unsynchronized

redundant servers satisfy at least one requirement of maximizing

uptime by having replacement hardware onsite. This requires

purchasing duplicate hardware for single applications, in case of

catastrophic system failure. The advantage is

hardware is always
available, ready to be deployed as needed.

The disadvantages

include

doubling the cost
of hardware
time needed to
restore applications and databases from the production system
backups

Due to the trivial

nature of this solution, we didn’t see much value in going this

way, but we included it for the sake of completeness.

Clustering

Clustering was the

second method we looked at and is defined as a grouping of two or

more servers into a single logical entity known as a cluster.

Clusters typically come in two flavors –

Active/Active
Active/Passive.

An Active/Active

cluster is composed of two or more servers, with all servers in the

cluster providing resources to users. In an Active/Passive cluster,

one server in the cluster responds to user resource demands –

called the Primary, while the other is in standby mode – called

the Secondary. In the event of one of a number of predetermined

failures, the Primary server will transfer processing tasks –

or fail-over – to the Secondary. In our experience, this

transfer disrupts access to the applications installed on the cluster

– such as databases – for a period ranging from 5 seconds

to 2 minutes. The Active/Active configuration is more complex and

can be more expensive than the Active/Passive. Because we were

considering the needs of branch offices where no IT staff was

stationed, we only considered the Active/Passive type of cluster.

In order to evaluate

the fit for clustering at my company, a test environment consisting

of an Active/Passive cluster was setup. SQL Server was installed and

a typical production database restored. We ran a battery of tests on

the cluster and arrived at the following results:

Advantages

the cluster
performed “as advertised” by failing-over after the
occurrence of a failure event
the fail-over
itself took a very short period time – ranging from 10 to 30
seconds
the fail-over was
automatic, ensuring that in most cases, downtime was minimized
clustering uses a
single name and machine address, allowing applications to always
connect to the same address, regardless of which server is actually
servicing the request

Disadvantages

Setup and
maintenance of a clustered environment was considerably more complex
than other solutions. This includes everything from stopping and
starting services to stopping and starting the individual servers in
the cluster. This was a major drawback for us, again considering
that our solution was to be installed at all of our branch offices.
Cost –
clustering was the most expensive of the 3 options considered,
requiring not only redundant hardware, but an Enterprise Edition
license for Windows Server 2003 (WIN2K3EE) as well as an Enterprise
Edition license for SQL Server 2000 (SQL2KEE) for each server in the
cluster. In addition, the two servers in the cluster are
recommended to be identical, for obvious reasons.
Automatic
fail-over. In addition to being an advantage, this feature can be
problematic as well, especially when IT staff isn’t
experienced in dealing with clusters. Old habits are hard to break.
For example, when stopping and starting services, you must do so
from the Cluster management console. If you don’t, the
cluster will assume that a failure has occurred and fail over. That
was a painful lesson.
Clusters utilize
separate server hardware, but share a single disk drive unit. As a
result, clusters are as susceptible to disk failures as single
servers.

Log Shipping

Log shipping is the

third and final option we considered. Log shipping is intended to

provide a type of fail-over similar to that of clustering with one

major difference - the fail-over process for log shipping is manual,

involving a predefined set of steps to accomplish.

It’s essentially

a solution composed of two servers, with one server acting as the

Primary – servicing user requests - and the other as the

Secondary. The purpose of the Secondary server is to receive and

then apply database changes from the Primary transaction log at

predefined intervals. By doing so, the Secondary is synchronized

with the Primary up to the last interval. This interval is

configurable and for testing purposes we varied the time span from 5

to 15 minutes.

Log shipping is

actually an old technology and can be implemented in various ways,

from the very simple to the very robust. You can implement it by

“rolling your own” or by using the wizards included in

SQL Server 2000 Enterprise Edition (but not Standard). We decided to

use a home grown version that was written by Bill Wunder and

graciously provided for free on his website at

http://b.wunder.home.comcast.net/15758.htm.

These scripts are designed to be used with the SQL LiteSpeed product

which we had already purchased. Thanks Bill, you rock.

In order to evaluate

log shipping, a test environment was setup, SQL Server was installed

and a typical production database restored. We ran a battery of

tests on this setup and arrived at the following results:

Advantages

Once the
intricacies of setting up our log shipping routines were overcome,
setup was straight forward.
We were able to
accomplish fail-over in the test environment in a matter of minutes.
While this time could reasonably be expected to grow to the range
of 20-40 minutes (it is a manual fail-over after all), we considered
this to be acceptable. It’s important to note here that if
this duration is unacceptable, then log shipping probably won’t
work for you.
Manual fail-over.
Given our experience in managing our clustering tests, the argument
could be made that this is a definite advantage.
Single machine
name and address. Although – strictly speaking – this
was not a function of log shipping, by utilizing an alias, we were
able to eliminate log shipping’s primary drawback –
having to reconfigure applications to look at a new machine name and
address after fail-over.
Cost. Log
shipping can function using the Windows Server 2003 standard license
and SQL Server 2000 Standard license. You’ll need to purchase
two OS and two database licenses, but by using Bill’s home
grown solution and not using the Enterprise version of either
product, you still come out ahead.
Redundancy of
disks. Log Shipping employs two separate servers to accomplish
redundancy – redundancy that includes two separate disk
arrays.
One Secondary
server can service multiple Primary servers.
Maintenance is
much simpler than that required for clustering

Disadvantages

Manual fail-over.
The duration of system fail-over is longer than that experienced
with clustering.

Decision

For us, log shipping

provided my company with the best mix of benefit versus cost. We

decided to continue with Bill’s home grown solution, as it had

everything that we needed in a single package.

Implementation
Details

It’s important to

note the specifics of our implementation. We felt that one of the

major drawbacks to log shipping is the need to point applications to

a new server after the fail-over. Our solution to this was to use an

IP alias setup by our network group and point our applications to it

instead of the production server. This alias points to a specific IP

address – the same as the primary.

For example, given the

following server setup:

Production Server:

Name : PROD1IP : XXX.XX.XX.223

Secondary Server:Name

: LOGSHIP1IP : XXX.XX.XX.224

We create an alias

like this:

Alias :Name :

PROD1VIP : XXX.XX.XX.223

At this point, we

changed all references in our application to point to the alias, not

the physical server. The reason is that now, at fail-over:

We restore the
database on the secondary server (LOGSHIP1)
Change the IP of
the primary to an unused IP address
Change the IP of
the secondary to that of the alias

Once this is complete,

client applications can re-connect to the alias (PROD1V) without

realizing that they’re actually connecting the secondary server

(LOGSHIP1).

Of course, for any

fail-over to work, there are details to attend to, including

Synchronizing data
directories and paths
- We synchronize
  hourly, using a simple batch script and RoboCopy
Synchronizing jobs
- We synchronize
  daily using DTS
Synchronizing
Logins
- We synchronize
  daily using DTS

Server/Application

redundancy is one of those things that you put in place, then hope

that you never have to use in production. We’ve had the

occasion to fail-over using log shipping and its worked well for us.

Let me know if you have

any questions or comments.

Mark F. Cook

Achieving Server Redundancy at Remote Offices

Opportunity

Background

Redundancy

Redundant
Servers – No Synchronization

Clustering

Log Shipping

Decision

Implementation
Details

Rate

Share

Categories

Share

Rate

Achieving Server Redundancy at Remote Offices

Opportunity

Background

Redundancy

RedundantServers – No Synchronization

Clustering

Log Shipping

Decision

ImplementationDetails

Rate

Share

Categories

Share

Rate

Related content

Why bother with backup?

Worst Practice - Detailed Disaster Plans

Incident Response - The Framework

Backup Scenarios for successful SQL Server Restores and Recovery

Save Yourself - Recovering from an XP Disaster

Redundant
Servers – No Synchronization

Implementation
Details