Having applied dozens and dozens of Service Packs to various SQL Server clusters, I still never cease to learn that not everything always goes as planned. J So, a funny thing happened on the way to apply my service pack to a SQL Server 2008 R2 cluster. Well, it really wasn’t all that funny. In fact, I was for the moment, somewhat sad at event viewer log pronouncing the untimely demise of my SQL Server:
“[sqsrvres] CheckServiceAlive: Service is dead”
Famous last words, “He’s dead Jim.” (youtube clip)
Can’t help to think the author/developer of that error msg, was a Trekkie (noun - a fan of the US science fiction television program Star Trek.)
So here’s a harrowing tale with a happy ending. Follow me on a rollercoaster ride of twists and turns on my journey through the event logs, and it’s very informative error messages. It turned out ok, but it temporarily gave me a “cluster headache” The solution was fairly simple, but not obvious.
I know I executed this service pack upgrade flawlessly! For a quick review here, I performed the following steps:
Identified the passive node
Log onto any of the SQL nodes -> Open command prompt -> Type ‘Cluster Group” and Enter.
If the cluster groups are hosted between the two nodes, you need to run the command “ group “<Group Name>” /move” where <Group Name> is the name of each cluster group e.g “cluster group “Available Storage” /move“ to move the group named Available Storage to the second node.
Type the command “cluster resource” and ensure all resources are online and owned by only one of the nodes.
2. Install the Service Pack on the Passive Node
Identify the correct service pack/build for your SQL Server version http://sqlserverbuilds.blogspot.com/
Download or copy the service pack locally (on each node)
Right-click à ‘Run as Administrator
Follow the prompts, select the instance/features, accept the license agreement, check for warnings, click <Update> (off you go!)
Several minutes later, Node 1 is successfully updated! Voila:
Great! Now all we need to do is apply the SQL service pack on the second node which is currently holding all the cluster resources. The next action is to failover all the cluster groups to the passive node which we just updated.
You can use the command line options as follows:
- cluster group “Cluster Group” /move
- cluster group “Available Storage” /move
- cluster group “SQL Server (Instance)” /move
When all groups have been moved over to the other node, verify all the resources using the command ‘cluster res’ Follow the above steps to install the service pack on the second node (which is now the passive node) so both servers are running the same service pack. When the second node had been updated, test the failover a few times to ensure that your failover clustering is working as expected.
So, I get the above success and completion screen on Node2, and all looks good. All I needed to do, was validate the services were online, and databases accessible. Launch SSMS, and try to connect. No go.
“17187 SQL Server is not ready to accept new client connections. Wait a few minutes before trying again....”
Ok, I can deal with that. I’m as patient as the next guy. I see another helpful error msg, allegedly referring to the state of my SQL Service:
“19019 [sqsrvres] OnlineThread: Could not connect to server but server is considered up.”
Yeah, well, try telling that to the users! “Uh, you may not be able to connect to the server, but the server is considered up” :-O
Checking the Failover Cluster Manager, I see the SQL Services, still offline. I attempt to bring it online, and monitoring the event viewer I get another more serious game changing error:
“error 912 Script level upgrade for database 'master' failed because upgrade step 'sqlagent100_msdb_upgrade.sql' encountered error 33009, state 2, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion”
So, this one is a “serious error”. I guess the other ones should NOT be taken too seriously, and just a light-hearted “YOU’RE SCREWED” but don’t worry error! Then again, it must be a serious error, there’s 82 words in this one! Of course, it’s a good thing the error message lets me know that, the fact that the ‘master’ database failed, “might interfere with regular operation and the database will be taken offline.” Do you think!??
Anyways, this cannot be going in a positive direction. At this time, I was looking for that motrin, because my cluster headache was underway.
Not having a good feeling about this, I sat breathlessly while I scrolled through the event logs, uncovering more and more red-alert level messages. The simultaneous IMs, email, and ringing phone to let me know the server is down, was not as helpful. And yes, there was an approved and scheduled change control ticket, as well as an fyi email, before I began.
Along the way, I see that, as expected, the SQLAgent could also not start, but was sure glad I caught that extremely informative error message:
“SQLServerAgent could not be started (reason: Unable to connect to server '(local)'; SQLServerAgent cannot start).”
So, let me try to understand this. The reason why SQLServerAgent could not be started is because….(long pause for dramatic effect) …SQLServerAgent cannot start. OK, that makes perfect sense! Moving on.
In my continued travel to unravel (ha a rhyme) this mysterious cluster conundrum, I attempted to failover to the other node. Hey, maybe it will come online there. Was worth a shot.
No dice! I peered into the event viewer there, to see what, if any, new error message would clue me in:
And then, the dreaded of all messages,
“error 3417 Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.”
Noooooooooooooooooooo! Ok, chill out, collect yourself Pearl, and well, find the installation media, and verify the most recent backup of the master database is at hand. This was going to be a long day. And, I didn’t even have lunch yet!
Just before, I was to embark on the point of no return, I wanted to see if there was anything, something, I could possibly do to avoid having to rebuild & restore the master.
I hopped on twitter, and used the most helpful hash tags of all social media: #SQLHelp I also used my bingoogle-fu, (a term I think was coined by one of the SQL Server MVPs). Oh, and I used my deductive powers of reasoning by looking over the error messages again.
Let me pause, and say thank you to Argenis Fernandez (@DBArgenis) and Joe Fleming (@Muad_DBA) for their quick replies. They mentioned to check the registry, compare keys on both nodes, as well as provided some links to look at.
Having possibly seen an error 435 along the way, Joe suggested this link: SQLAgent 435. I reference it here, but don’t think this was the issue. However, it reminded me of the upgrade error mentioned earlier:
“error 912 Script level upgrade for database 'master' failed because upgrade step 'sqlagent100_msdb_upgrade.sql'”
Hmm, SQLAgent, msdb, failed to upgrade. It does seem to involve something failing in the msdb_upgrade.sql script. I came upon this link, which was a triple-play, in the sense that it contained the three error messages I saw, 912, 3417, and 33009, respectively. Keep in mind, the referenced link discusses a different scenario where an upgrade from SQL 2000 to SQL 2008 failed. But it was useful in giving me clues towards the final resolution.
I did find one more msdn blog, which was a direct hit on the error message, “Script level upgrade….failed” Script! That’s it, my upgrade script failed! Ok, I thought I knew that. So, if the script failed, let’s try to disable script execution. How do we do that? The article tells us to start the server by using trace flag –T902, which basically its purpose is to bypass upgrade scripts.
In the meanwhile, the twitter conversation was continuing, and @Muad_DBA asked me about the tempdb. What is the size and configuration of the tempdb? Yikes, it was only the default 8,192 or 8MB, one file. He suggested that the upgrade was running out of temp space to do the upgrade, and the tempdb growth is impacting the upgrade process. I increased the size of the tempdb, just to be sure, but didn’t see anything indicating it ran out of space or couldn’t auto-grow.
I also tried one other article I found, that reflected the current state of affairs, ie, SQL Server 2008 Service fails to start after Service Pack Installation. It also mentions the Script level upgrade 'sqlagent100_msdb_upgrade.sql' failed error. It has you check and create some registry entries (specifically to add the DefaultData & DefaultLog paths). This didn’t seem to make a difference.
Back to –T902. By using –T902 as startup parameter, I was able to get the SQL Service running, via the services.msc, and log in via sqlcmd, as well as launch a SSMS instance.
Yet, another similar hit, but not exactly my build, I stumbled closer to the issue, and here was the closest yet. SQL Server 2008 R2 Unable to Start After Applying CU1. Bringing you attention to Step 3, listed here:
“ALTER AUTHORIZATION ON DATABASE::MSDB TO SA”
When I looked at the owner of database ‘msdb’, I did in fact see that it was an owner, other than ‘sa’. It was never restored, so not sure how it changed, but apparently this WAS the key.
I ran the above statement, and also, sp_changedbowner ‘sa’ (which I believe is essentially the same thing, other that Alter Authorization, (see the highlighted hyperlink) applies to greater securables in SQL Server, and provides more granularity.
Once I changed the ownership of msdb back to ‘sa’, I restarted the services via the Failover Cluster Manager, and….several hours later, it’s online, up and running. Failed over back and forth a few times, to make sure.
So, in a nutshell, remember to check the ownership of the system databases before any upgrade, and ensure that it is set to ‘sa’, because you too will end up with a cluster headache.
Quite anti-climatic for sure, but don’t avoid updating your SQL Servers. There is no patch for stupidity. I hope the above information is helpful.
May your SQL Servers Live Long, and Prosper !
You could and should follow me on Twitter! @Pearlknows
Please inquire about our comprehensive SQL Server Health Check assement. We will provide you with a plan to optimize your slow performance! firstname.lastname@example.org