As time goes by, what was the latest product yesterday is old today. SQL Server is not an exception, and the day has arrived when the production cluster is out-dated. The intention of this series of articles is to outline a quick guide for upgrading SQL clustered instances to 2008. The main goal is to identify the steps needed to reduce the chances of failure. To do so, there are three articles that will cover each part of the upgrade process: 1) Before the Upgrade; 2) Upgrading the Cluster; 3) Post Upgrade Actions.
Before the Upgrade
The cluster I needed to upgrade consisted of four nodes, having three active ones and one passive. The active nodes had one instance running, hosting databases for different products like Sharepoint, Biztalk, Reporting Services and also custom databases. This cluster was critical to the organization, which meant that it was imperative that the upgrade was successful.
When you need to perform such a big change on a production system, you need to understand everything about your environment and also document your steps and review them over and over again before starting. If you have a co-worker or a colleague that can review your plan, don’t hesitate to share it with that person...
First, you need to know which SQL jobs are enabled and which are disabled. This is important because you want to disable all SQL jobs before you perform the upgrade. This is mainly because when you are upgrading a clustered SQL Server instance, you’ll need to perform a failover of the instance, so it’s better to have all jobs disabled and avoid the interruption of a SQL job while it’s executing.
Also, it’s a good idea to export the history of the jobs to understand if failure executions are frequent or normal. That can help you if a job starts failing after the upgrade because you can check the history of it and compare the behavior with the one before the upgrade.
Custom libraries, tasks and components
It’s common to have custom SSIS tasks deployed or custom libraries for a particular database or process. If you do, you need to ensure it will work on the new version, meaning that you’ll need to either refactor the custom library or contact the company who provided it to get an updated version compatible with SQL 2008.
Additionally you need to document the process for upgrading those libraries or tasks. Sometimes it can be as easy as replacing a dll file, reinstalling it using a “setup” program or even changing registry values. What is important here is that no matter which process is involved, you need to write down the steps and validate that it’s all you need.
Prepare the servers
Before starting the upgrade process, you need to ensure that the servers are ready for the upgrade. The first and most basic task you need to perform is a failover of the resources, to validate all instances can be online on all nodes.
As I mentioned earlier, this article describes the upgrade of a 4 node cluster with 3 active nodes having one SQL instance per active node. Testing the failover means that you need to failover Instance A to nodes one, two, three and four. Next, do the same with Instance B and C. If you cannot failover one group to a particular node or nodes, then you should not start the upgrade as it will fail.
Another thing you should check is the drivers and firmware of the servers, specifically the ones related to disks, storage cards and NICs. You should validate if the versions you have are up to date and investigate if there’s a known issue with those drivers (either the new ones or the drivers you already have) with SQL Server.
You also need to ensure that the drivers and firmware are the same on all the nodes, so it is worth spending some time checking the versions of the drivers of the NICs, Disks, SCSI controllers, etc. If you don’t have a program to verify this, just go to the Computer Management, Device Manager and select the item you want to check. Then, right click it, select Properties and go to the Driver tab. Take note of the driver’s details and go to the next one. Then, do the same on the other nodes and compare the values.
After that, disable the antivirus software on the cluster’s nodes and any other software that can install applications in an unattended way, like Windows Updates. After those applications and services are disabled, reboot each node. This is important because you don’t want to fail on the upgrade because of a "pending restart" situation.
Last but not least, ensure you’re not missing any important files in C:\Windows\Installer folder on each node. It’s not uncommon to find people deleting that folder’s contents because of disk space issues. If you do that, the upgrade will fail for sure. The way to validate this, I’m afraid, is a long manual process, and as any other manual process, it requires documentation and review.
For identifying these files, you must open the Registry in one cluster node and navigate to the following path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\UserData\S-1-5-18\Products. Once there, search for Microsoft SQL. You’ll find something similar to this:
As you can see, this example is for SQL Server Analysis Services. The LocalPackage key has the information of the filename inside C:\Windows\Installer folder. In this case, its name is 17724a1.msi. Now go check if that file exists. If it doesn’t, then you need to grab the CD of the SQL Server 2005 installation and locate the Analysis Services msi file. Then, just copy it to the C:\Windows\Installer folder with the name that shows the registry key.
Now, if you look at the left of that screen, you’ll see that under InstallProperties folder there’s a Patches folder. If you expand that, you’ll see a key called AllPatches that hosts all the patches installed to this product. Also you’ll find a folder per patch installed with the description of the patch (the KB number, for instance). Next, you need to locate that folder on the following path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\UserData\S-1-5-18\Patches and get the information of the msi filename again. If that file does not exists in the C:\Windows\Installer folder, then download the patch using the KB number as reference, locate the msi there and copy it to the C:\Windows\Installer folder with the name found on the registry, repeating this for each patch you have installed.
The following step then, is to locate the next product on the registry and perform this process all over again; validating the msi files are on the C:\Windows\Installer folder. Once you check all the products (SQL Server, SQL Analysis Service, SQL Integration Services, SQL Server Agent, SQL Client Tools) and patches, then you need to perform this on the other nodes. If you have some scripting knowledge, you can spend some time finding the way to automate this on one server and then run it on the other nodes. If not, then you need to manually check this.
Although this does not guarantee you’re not missing a file inside the C:\Windows\Installer folder, at least it does guarantee that the main files are there. For instance, I have had a problem once (standalone server, applying a Service Pack) because of the Office Web Components. The msi was missing from the C:\Windows\Installer so it failed. I could not reinstall it because of the missing file, so I had to extract the msi from the installer package and put it on the C:\Windows\Installer folder. Then, just to be sure, I uninstalled it using the latest version and installed it again. So, you can miss a file, or this can be a huge waste of time because all your files inside C:\Windows\Installer are there, but remember it’s better to waste time before doing the upgrade than wasting time trying to rebuild a cluster because of a missing file.
When you’re about to make such a change as a SQL Server upgrade, you should be prepared to face a potential failure, no matter how sure you are or how straight forward the process is. That preparation includes a valid backup of all your databases before you do the upgrade. A valid backup means that you perform the backup of the database and that you ensure the backup’s content is readable. If you perform a backup to disk and, for whatever reason it’s corrupted, then that backup is worthless.
Another type of backup you can do is a copy of the mfd and ldf files of the master and msdb databases. You can do that by shutting down the SQL Server instance and copying the files somewhere else (preferably on another server). In case if the upgrade fails badly and the instance is completely useless because of corruption on the master database (and yes, I’m thinking of weird and bad scenarios), you can shut down the instance and replace the mdf/ldf of the master database.
As a nice to have, you should perform a full backup of the whole servers to tape devices. As I said, you should be prepared for failure, as that’s the only way to act quickly and get the environment up and running if something goes wrong.
Get Everybody Out
Reducing or avoiding users and applications that are not connected to the SQL Instance is not a minor detail. As a general rule, you’ll perform SQL upgrades on maintenance hours, when there aren’t users online. Depending on the applications your company runs that use the SQL instance as a backend, during the maintenance windows the connections to SQL server are reduced, but you can’t rely just on that.
On my environment, besides custom applications executed on user’s machines relying on SQL Server, there are web servers, Sharepoint servers, Biztalk server, etc. running 24x7. Regardless if a user is connected or not to a Sharepoint site, there are custom tasks that Sharepoint performs based on its own schedule, that connects to its core databases and run transactions. The same applies to Biztalk, where you have active connections while its services are up.
For scenarios like this, you should shut down or disable this kind of application to avoid having active connections while you perform the upgrade. Also, try to disable any other custom applications that could connect to the SQL Server instance being upgraded. Another way is to disable accounts, or deny the ability to log on, to prevent the access to the instance.
For a successful upgrade, you need to plan, review the plan, execute the plan on a sandbox and review the plan again. Only by being prepared for an upgrade failure, can you act quickly if something goes wrong. If you check over and over every little piece of your environment before even starting the upgrade, you’ll reduce the chances of failure.
In the next part of this series, I’ll cover the actual upgrade process of the clustered SQL Server instances.