Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Achieving Server Redundancy at Remote Offices

By Mark Cook,

Opportunity

For production systems, uptime is a typical measure of how well we as administrators do our jobs. Its not the only one, but its an important one. I was once told that if a DBA keeps her/his database up, keeps it fast, and can recover in the event of failure that they’ve earned their salary no matter what it might be. While some may disagree with this, it certainly encapsulates the role we all play in the service we provide to our employers. This article focuses on the third item – recovery.

Background

At the company I work for, we have a centralized data center with many database servers and several branch offices with a couple of database servers in each one. A few months back, we began a project to assess where we were in terms of data loss and recovery. After our review, we found that we were in good shape in our data center, but realized that if one of the servers in a branch office failed – not a disaster mind you, but just failed – we would be in serious straits with our customers who rely on our systems for data exchange. These servers host databases that didn’t necessary require 24x7 uptime, but an outage of more than a couple of hours would lead to escalating fines and affect our ratings with our customers. What to do?

Redundancy

The first thing that comes to mind for reducing the risks associated with downtime is installing redundant servers. The idea is that if the production server does fail, the application can be moved to another server and availability restored. If you go this route, you have to be willing to convince your boss that the application is critical enough that it warrants letting thousands of dollars worth of equipment potentially set underutilized. If you win this battle, then the next step is to decide exactly how you’re going to achieve that redundancy.

To understand what server redundancy really means, let’s look it up. According to Wikipedia, redundancy can be defined as:

Redundancy, in general terms, refers to the quality or state of being redundant, that is: exceeding what is necessary or normal, containing an excess. This can have a negative connotation, superfluous, but also positive, serving as a duplicate for preventing failure of an entire system

Obviously, for our purposes we’re taking the positive side of the topic and considering those scenarios where applications must be available to users whenever users are available to access those same applications. But due diligence requires that we invest our employer’s money as efficiently as possible. What are our options?

There are several ways to maximize uptime for database systems, each with its own advantages and disadvantages. In this document, I’ll try to review three that we considered then try and explain why we selected the option we choose.

The options we considered were:

  • Redundant Servers – No Synchronization

  • Clustering

  • Log Shipping

Redundant Servers – No Synchronization

Unsynchronized redundant servers satisfy at least one requirement of maximizing uptime by having replacement hardware onsite. This requires purchasing duplicate hardware for single applications, in case of catastrophic system failure. The advantage is

  • hardware is always available, ready to be deployed as needed.

The disadvantages include

  • doubling the cost of hardware

  • time needed to restore applications and databases from the production system backups

Due to the trivial nature of this solution, we didn’t see much value in going this way, but we included it for the sake of completeness.

Clustering

Clustering was the second method we looked at and is defined as a grouping of two or more servers into a single logical entity known as a cluster. Clusters typically come in two flavors –

  • Active/Active

  • Active/Passive.

An Active/Active cluster is composed of two or more servers, with all servers in the cluster providing resources to users. In an Active/Passive cluster, one server in the cluster responds to user resource demands – called the Primary, while the other is in standby mode – called the Secondary. In the event of one of a number of predetermined failures, the Primary server will transfer processing tasks – or fail-over – to the Secondary. In our experience, this transfer disrupts access to the applications installed on the cluster – such as databases – for a period ranging from 5 seconds to 2 minutes. The Active/Active configuration is more complex and can be more expensive than the Active/Passive. Because we were considering the needs of branch offices where no IT staff was stationed, we only considered the Active/Passive type of cluster.

In order to evaluate the fit for clustering at my company, a test environment consisting of an Active/Passive cluster was setup. SQL Server was installed and a typical production database restored. We ran a battery of tests on the cluster and arrived at the following results:

Advantages

  • the cluster performed “as advertised” by failing-over after the occurrence of a failure event

  • the fail-over itself took a very short period time – ranging from 10 to 30 seconds

  • the fail-over was automatic, ensuring that in most cases, downtime was minimized

  • clustering uses a single name and machine address, allowing applications to always connect to the same address, regardless of which server is actually servicing the request

Disadvantages

  • Setup and maintenance of a clustered environment was considerably more complex than other solutions. This includes everything from stopping and starting services to stopping and starting the individual servers in the cluster. This was a major drawback for us, again considering that our solution was to be installed at all of our branch offices.

  • Cost – clustering was the most expensive of the 3 options considered, requiring not only redundant hardware, but an Enterprise Edition license for Windows Server 2003 (WIN2K3EE) as well as an Enterprise Edition license for SQL Server 2000 (SQL2KEE) for each server in the cluster. In addition, the two servers in the cluster are recommended to be identical, for obvious reasons.

  • Automatic fail-over. In addition to being an advantage, this feature can be problematic as well, especially when IT staff isn’t experienced in dealing with clusters. Old habits are hard to break. For example, when stopping and starting services, you must do so from the Cluster management console. If you don’t, the cluster will assume that a failure has occurred and fail over. That was a painful lesson.

  • Clusters utilize separate server hardware, but share a single disk drive unit. As a result, clusters are as susceptible to disk failures as single servers.

Log Shipping

Log shipping is the third and final option we considered. Log shipping is intended to provide a type of fail-over similar to that of clustering with one major difference - the fail-over process for log shipping is manual, involving a predefined set of steps to accomplish.

It’s essentially a solution composed of two servers, with one server acting as the Primary – servicing user requests - and the other as the Secondary. The purpose of the Secondary server is to receive and then apply database changes from the Primary transaction log at predefined intervals. By doing so, the Secondary is synchronized with the Primary up to the last interval. This interval is configurable and for testing purposes we varied the time span from 5 to 15 minutes.

Log shipping is actually an old technology and can be implemented in various ways, from the very simple to the very robust. You can implement it by “rolling your own” or by using the wizards included in SQL Server 2000 Enterprise Edition (but not Standard). We decided to use a home grown version that was written by Bill Wunder and graciously provided for free on his website at http://b.wunder.home.comcast.net/15758.htm. These scripts are designed to be used with the SQL LiteSpeed product which we had already purchased. Thanks Bill, you rock.

In order to evaluate log shipping, a test environment was setup, SQL Server was installed and a typical production database restored. We ran a battery of tests on this setup and arrived at the following results:

Advantages

  • Once the intricacies of setting up our log shipping routines were overcome, setup was straight forward.

  • We were able to accomplish fail-over in the test environment in a matter of minutes. While this time could reasonably be expected to grow to the range of 20-40 minutes (it is a manual fail-over after all), we considered this to be acceptable. It’s important to note here that if this duration is unacceptable, then log shipping probably won’t work for you.

  • Manual fail-over. Given our experience in managing our clustering tests, the argument could be made that this is a definite advantage.

  • Single machine name and address. Although – strictly speaking – this was not a function of log shipping, by utilizing an alias, we were able to eliminate log shipping’s primary drawback – having to reconfigure applications to look at a new machine name and address after fail-over.

  • Cost. Log shipping can function using the Windows Server 2003 standard license and SQL Server 2000 Standard license. You’ll need to purchase two OS and two database licenses, but by using Bill’s home grown solution and not using the Enterprise version of either product, you still come out ahead.

  • Redundancy of disks. Log Shipping employs two separate servers to accomplish redundancy – redundancy that includes two separate disk arrays.

  • One Secondary server can service multiple Primary servers.

  • Maintenance is much simpler than that required for clustering

Disadvantages

  • Manual fail-over. The duration of system fail-over is longer than that experienced with clustering.

Decision

For us, log shipping provided my company with the best mix of benefit versus cost. We decided to continue with Bill’s home grown solution, as it had everything that we needed in a single package.

Implementation Details

It’s important to note the specifics of our implementation. We felt that one of the major drawbacks to log shipping is the need to point applications to a new server after the fail-over. Our solution to this was to use an IP alias setup by our network group and point our applications to it instead of the production server. This alias points to a specific IP address – the same as the primary.

For example, given the following server setup:

Production Server: Name : PROD1 IP : XXX.XX.XX.223

Secondary Server: Name : LOGSHIP1 IP : XXX.XX.XX.224

We create an alias like this:

Alias : Name : PROD1V IP : XXX.XX.XX.223

At this point, we changed all references in our application to point to the alias, not the physical server. The reason is that now, at fail-over:

  1. We restore the database on the secondary server (LOGSHIP1)

  2. Change the IP of the primary to an unused IP address

  3. Change the IP of the secondary to that of the alias

Once this is complete, client applications can re-connect to the alias (PROD1V) without realizing that they’re actually connecting the secondary server (LOGSHIP1).

Of course, for any fail-over to work, there are details to attend to, including

  • Synchronizing data directories and paths

    • We synchronize hourly, using a simple batch script and RoboCopy

  • Synchronizing jobs

    • We synchronize daily using DTS

  • Synchronizing Logins

    • We synchronize daily using DTS

Server/Application redundancy is one of those things that you put in place, then hope that you never have to use in production. We’ve had the occasion to fail-over using log shipping and its worked well for us.

Let me know if you have any questions or comments.

Mark F. Cook

Total article views: 5198 | Views in the last 30 days: 1
 
Related Articles
FORUM

Redundancy in SQL Server

about redundancy in sql server

ARTICLE

SQL Server Redundancy for SMBs

Building a highly fault tolerant and available SQL Server is expensive and difficult. This brings ab...

FORUM

SQL Server 2000 - Redundancy via Replication

I am looking for a design/solution fo redundancy using replication

BLOG

RAID (Redundant Array Independent Disk)

RAID (Redundant Array Independent Disk) As disk/storage plays a very important role in any applicati...

FORUM

SQL Server Cluster failing over but not showing databases

SQL Server Cluster failing over but not showing databases

Tags
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones