Achieving Server Redundancy at Remote Offices

,

Opportunity

For production systems,

uptime is a typical measure of how well we as administrators do our

jobs. Its not the only one, but its an important one. I was once

told that if a DBA keeps her/his database up, keeps it fast, and can

recover in the event of failure that they’ve earned their

salary no matter what it might be. While some may disagree with

this, it certainly encapsulates the role we all play in the service

we provide to our employers. This article focuses on the third item

– recovery.

Background

At the company I work

for, we have a centralized data center with many database servers and

several branch offices with a couple of database servers in each one.

A few months back, we began a project to assess where we were in

terms of data loss and recovery. After our review, we found that we

were in good shape in our data center, but realized that if one of

the servers in a branch office failed – not a disaster mind

you, but just failed – we would be in serious straits with our

customers who rely on our systems for data exchange. These servers

host databases that didn’t necessary require 24x7 uptime, but

an outage of more than a couple of hours would lead to escalating

fines and affect our ratings with our customers. What to do?

Redundancy

The first thing that

comes to mind for reducing the risks associated with downtime is

installing redundant servers. The idea is that if the production

server does fail, the application can be moved to another server and

availability restored. If you go this route, you have to be willing

to convince your boss that the application is critical enough that it

warrants letting thousands of dollars worth of equipment potentially

set underutilized. If you win this battle, then the next step is to

decide exactly how you’re going to achieve that redundancy.

To understand what

server redundancy really means, let’s look it up. According to

Wikipedia, redundancy can be defined as:

Redundancy,

in general terms, refers to the quality or state of being redundant,

that is: exceeding what is necessary or normal, containing an excess.

This can have a negative connotation, superfluous, but also positive,

serving as a duplicate for preventing failure of an entire system

Obviously, for our

purposes we’re taking the positive side of the topic and

considering those scenarios where applications must be available to

users whenever users are available to access those same applications.

But due diligence requires that we invest our employer’s money

as efficiently as possible. What are our options?

There are several ways

to maximize uptime for database systems, each with its own advantages

and disadvantages. In this document, I’ll try to review three

that we considered then try and explain why we selected the option we

choose.

The options we

considered were:

  • Redundant Servers – No Synchronization

  • Clustering

  • Log Shipping

Redundant

Servers – No Synchronization

Unsynchronized

redundant servers satisfy at least one requirement of maximizing

uptime by having replacement hardware onsite. This requires

purchasing duplicate hardware for single applications, in case of

catastrophic system failure. The advantage is

  • hardware is always

    available, ready to be deployed as needed.

The disadvantages

include

  • doubling the cost

    of hardware

  • time needed to

    restore applications and databases from the production system

    backups

Due to the trivial

nature of this solution, we didn’t see much value in going this

way, but we included it for the sake of completeness.

Clustering

Clustering was the

second method we looked at and is defined as a grouping of two or

more servers into a single logical entity known as a cluster.

Clusters typically come in two flavors –

  • Active/Active

  • Active/Passive.

An Active/Active

cluster is composed of two or more servers, with all servers in the

cluster providing resources to users. In an Active/Passive cluster,

one server in the cluster responds to user resource demands –

called the Primary, while the other is in standby mode – called

the Secondary. In the event of one of a number of predetermined

failures, the Primary server will transfer processing tasks –

or fail-over – to the Secondary. In our experience, this

transfer disrupts access to the applications installed on the cluster

– such as databases – for a period ranging from 5 seconds

to 2 minutes. The Active/Active configuration is more complex and

can be more expensive than the Active/Passive. Because we were

considering the needs of branch offices where no IT staff was

stationed, we only considered the Active/Passive type of cluster.

In order to evaluate

the fit for clustering at my company, a test environment consisting

of an Active/Passive cluster was setup. SQL Server was installed and

a typical production database restored. We ran a battery of tests on

the cluster and arrived at the following results:

Advantages

  • the cluster

    performed “as advertised” by failing-over after the

    occurrence of a failure event

  • the fail-over

    itself took a very short period time – ranging from 10 to 30

    seconds

  • the fail-over was

    automatic, ensuring that in most cases, downtime was minimized

  • clustering uses a

    single name and machine address, allowing applications to always

    connect to the same address, regardless of which server is actually

    servicing the request

Disadvantages

  • Setup and

    maintenance of a clustered environment was considerably more complex

    than other solutions. This includes everything from stopping and

    starting services to stopping and starting the individual servers in

    the cluster. This was a major drawback for us, again considering

    that our solution was to be installed at all of our branch offices.

  • Cost –

    clustering was the most expensive of the 3 options considered,

    requiring not only redundant hardware, but an Enterprise Edition

    license for Windows Server 2003 (WIN2K3EE) as well as an Enterprise

    Edition license for SQL Server 2000 (SQL2KEE) for each server in the

    cluster. In addition, the two servers in the cluster are

    recommended to be identical, for obvious reasons.

  • Automatic

    fail-over. In addition to being an advantage, this feature can be

    problematic as well, especially when IT staff isn’t

    experienced in dealing with clusters. Old habits are hard to break.

    For example, when stopping and starting services, you must do so

    from the Cluster management console. If you don’t, the

    cluster will assume that a failure has occurred and fail over. That

    was a painful lesson.

  • Clusters utilize

    separate server hardware, but share a single disk drive unit. As a

    result, clusters are as susceptible to disk failures as single

    servers.

Log Shipping

Log shipping is the

third and final option we considered. Log shipping is intended to

provide a type of fail-over similar to that of clustering with one

major difference - the fail-over process for log shipping is manual,

involving a predefined set of steps to accomplish.

It’s essentially

a solution composed of two servers, with one server acting as the

Primary – servicing user requests - and the other as the

Secondary. The purpose of the Secondary server is to receive and

then apply database changes from the Primary transaction log at

predefined intervals. By doing so, the Secondary is synchronized

with the Primary up to the last interval. This interval is

configurable and for testing purposes we varied the time span from 5

to 15 minutes.

Log shipping is

actually an old technology and can be implemented in various ways,

from the very simple to the very robust. You can implement it by

“rolling your own” or by using the wizards included in

SQL Server 2000 Enterprise Edition (but not Standard). We decided to

use a home grown version that was written by Bill Wunder and

graciously provided for free on his website at

http://b.wunder.home.comcast.net/15758.htm.

These scripts are designed to be used with the SQL LiteSpeed product

which we had already purchased. Thanks Bill, you rock.

In order to evaluate

log shipping, a test environment was setup, SQL Server was installed

and a typical production database restored. We ran a battery of

tests on this setup and arrived at the following results:

Advantages

  • Once the

    intricacies of setting up our log shipping routines were overcome,

    setup was straight forward.

  • We were able to

    accomplish fail-over in the test environment in a matter of minutes.

    While this time could reasonably be expected to grow to the range

    of 20-40 minutes (it is a manual fail-over after all), we considered

    this to be acceptable. It’s important to note here that if

    this duration is unacceptable, then log shipping probably won’t

    work for you.

  • Manual fail-over.

    Given our experience in managing our clustering tests, the argument

    could be made that this is a definite advantage.

  • Single machine

    name and address. Although – strictly speaking – this

    was not a function of log shipping, by utilizing an alias, we were

    able to eliminate log shipping’s primary drawback –

    having to reconfigure applications to look at a new machine name and

    address after fail-over.

  • Cost. Log

    shipping can function using the Windows Server 2003 standard license

    and SQL Server 2000 Standard license. You’ll need to purchase

    two OS and two database licenses, but by using Bill’s home

    grown solution and not using the Enterprise version of either

    product, you still come out ahead.

  • Redundancy of

    disks. Log Shipping employs two separate servers to accomplish

    redundancy – redundancy that includes two separate disk

    arrays.

  • One Secondary

    server can service multiple Primary servers.

  • Maintenance is

    much simpler than that required for clustering

Disadvantages

  • Manual fail-over.

    The duration of system fail-over is longer than that experienced

    with clustering.

Decision

For us, log shipping

provided my company with the best mix of benefit versus cost. We

decided to continue with Bill’s home grown solution, as it had

everything that we needed in a single package.

Implementation

Details

It’s important to

note the specifics of our implementation. We felt that one of the

major drawbacks to log shipping is the need to point applications to

a new server after the fail-over. Our solution to this was to use an

IP alias setup by our network group and point our applications to it

instead of the production server. This alias points to a specific IP

address – the same as the primary.

For example, given the

following server setup:

Production Server:

Name : PROD1 IP : XXX.XX.XX.223

Secondary Server: Name

: LOGSHIP1 IP : XXX.XX.XX.224

We create an alias

like this:

Alias : Name :

PROD1V IP : XXX.XX.XX.223

At this point, we

changed all references in our application to point to the alias, not

the physical server. The reason is that now, at fail-over:

  1. We restore the

    database on the secondary server (LOGSHIP1)

  2. Change the IP of

    the primary to an unused IP address

  3. Change the IP of

    the secondary to that of the alias

Once this is complete,

client applications can re-connect to the alias (PROD1V) without

realizing that they’re actually connecting the secondary server

(LOGSHIP1).

Of course, for any

fail-over to work, there are details to attend to, including

  • Synchronizing data

    directories and paths

    • We synchronize

      hourly, using a simple batch script and RoboCopy

  • Synchronizing jobs

    • We synchronize

      daily using DTS

  • Synchronizing

    Logins

    • We synchronize

      daily using DTS

Server/Application

redundancy is one of those things that you put in place, then hope

that you never have to use in production. We’ve had the

occasion to fail-over using log shipping and its worked well for us.

Let me know if you have

any questions or comments.

Mark F. Cook

Rate

5 (1)

Share

Share

Rate

5 (1)