Blog Post

Cluster Headache - Applying Service Pack and SQL Fails to Come On-line

,

Having

applied dozens and dozens of Service Packs to various SQL Server clusters, I

still never cease to learn that not everything always goes as planned. J 

So, a funny thing happened on the way to apply my service pack to a SQL

Server 2008 R2 cluster. Well, it really wasn’t all that funny.  In fact, I was for the moment, somewhat sad

at event viewer log pronouncing the untimely demise of my SQL Server:

“[sqsrvres]

CheckServiceAlive: Service is dead”

Famous last

words, “He’s dead Jim.”

(youtube clip)

Can’t help to

think the author/developer of that error msg, was a Trekkie (noun - a fan of the US science fiction

television program Star Trek.)

So here’s a

harrowing tale with a happy ending. 

Follow me on a rollercoaster ride of twists and turns on my journey through

the event logs, and it’s very informative error messages.  It turned out ok, but it temporarily gave me

a “cluster headache”   The solution was fairly simple, but not

obvious.

I know I

executed this service pack upgrade flawlessly! For a quick review here, I

performed the following steps:

  1. Identified

    the passive node

    1. Log

      onto any of the SQL nodes -> Open command prompt -> Type ‘Cluster Group

      and Enter.

    2. If the cluster groups are hosted between the two

      nodes, you need to run the command “cluster group “<Group Name>” /move” where <Group

      Name> is the name of each cluster group e.g “cluster

      group “Available Storage” /move to move the group

      named Available Storage to the second node.

    3. Type the command “cluster resource

      and ensure all resources are online and owned by only one of the nodes.

 

       2. Install

the Service Pack on the Passive Node

  1. Identify

    the correct service pack/build for your SQL Server version http://sqlserverbuilds.blogspot.com/

  2. Download

    or copy the service pack locally (on each node)

  3. Right-click

    à ‘Run

    as Administrator

  4. Follow

    the prompts, select the instance/features, accept the license agreement, check

    for warnings, click <Update> (off you go!)

Several

minutes later, Node 1 is successfully updated! 

Voila:

Great! Now

all we need to do is apply the SQL service pack on the second node which is

currently holding all the cluster resources. The next action is to failover all

the cluster groups to the passive node which we just updated.

You can use

the command line options as follows:

- cluster group “Cluster Group” /move

- cluster group “Available Storage” /move

- cluster group “SQL Server (Instance)” /move

- cluster group “MSDTC” /move

When

all groups have been moved over to the other node, verify all the resources

using the command ‘cluster res’ Follow

the above steps to install the service pack on the second node (which is now

the passive node) so both servers are running the same service pack. When the

second node had been updated, test the failover a few times to ensure that your

failover clustering is working as expected.

So,

I get the above success and completion screen on Node2, and all looks

good.  All I needed to do, was validate

the services were online, and databases accessible.  Launch SSMS, and try to connect. No go.

“17187

SQL Server is not ready to accept new client connections. Wait a few minutes

before trying again....”

Ok,

I can deal with that.  I’m as patient as

the next guy.  I see another helpful

error msg, allegedly referring to the state of my SQL Service:

“19019

[sqsrvres] OnlineThread: Could not connect to server but server is considered

up.”

Yeah,

well, try telling that to the users!  “Uh,

you may not be able to connect to the server, but the server is considered up”  :-O

Checking

the Failover Cluster Manager, I see the SQL Services, still offline.  I attempt to bring it online, and monitoring

the event viewer I get another more serious game changing error:

“error

912 Script level upgrade for database 'master' failed because upgrade step

'sqlagent100_msdb_upgrade.sql' encountered error 33009, state 2, severity 16.

This is a serious error condition which might interfere with regular operation and

the database will be taken offline. If the error happened during upgrade of

the 'master' database, it will prevent the entire SQL Server instance from

starting. Examine the previous errorlog entries for errors, take the

appropriate corrective actions and re-start the database so that the script upgrade

steps run to completion”

So, this one

is a “serious error”.  I guess the other

ones should NOT be taken too seriously, and just a light-hearted “YOU’RE

SCREWED” but don’t worry error!  Then

again, it must be a serious error, there’s 82 words in this one!  Of course, it’s a good thing the error

message lets me know that, the fact that the ‘master’ database failed, “might

interfere with regular operation and the database will be taken offline.  Do you think!??

Anyways, this

cannot be going in a positive direction. 

At this time, I was looking for that motrin, because my cluster headache

was underway. 

Not having a

good feeling about this, I sat breathlessly while I scrolled through the event

logs, uncovering more and more red-alert level messages.  The simultaneous IMs, email, and ringing

phone to let me know the server is down, was not as helpful.  And yes, there was an approved and scheduled change

control ticket, as well as an fyi email, before I began.

Along the

way, I see that, as expected, the SQLAgent could also not start, but was sure

glad I caught that extremely informative error message:

“SQLServerAgent

could not be started (reason: Unable to connect to server '(local)';

SQLServerAgent cannot start).”

So, let me

try to understand this.  The reason why

SQLServerAgent could not be started is because….(long pause for dramatic

effect) …SQLServerAgent cannot start. 

OK, that makes perfect sense! 

Moving on.

In my

continued travel to unravel (ha a rhyme) this mysterious cluster conundrum, I

attempted to failover to the other node. 

Hey, maybe it will come online there. 

Was worth a shot.

No dice!  I peered into the event viewer there, to see

what, if any, new error message would clue me in:

And then, the

dreaded of all messages,

“error 3417  Cannot recover the master database. SQL Server

is unable to run. Restore master from a full backup, repair it, or rebuild it.

For more information about how to rebuild the master database, see SQL Server

Books Online.”

Noooooooooooooooooooo!  Ok, chill out, collect yourself Pearl, and

well, find the installation media, and verify the most recent backup of the

master database is at hand.  This was

going to be a long day.  And, I didn’t

even have lunch yet!

Just before,

I was to embark on the point of no return, I wanted to see if there was

anything, something, I could possibly do to avoid having to rebuild &

restore the master.

I hopped on

twitter, and used the most helpful hash tags of all social media: #SQLHelp

 I also used my bingoogle-fu, (a

term I think was coined by one of the SQL Server MVPs).  Oh, and I used my deductive powers of

reasoning by looking over the error messages again.

Let me pause,

and say thank you to Argenis Fernandez (@DBArgenis) and Joe Fleming (@Muad_DBA)

for their quick replies.  They mentioned

to check the registry, compare keys on both nodes, as well as provided some

links to look at.

Having

possibly seen an error 435 along the way, Joe suggested this link: SQLAgent

435.  I reference it here, but don’t

think this was the issue.  However, it reminded

me of the upgrade error mentioned earlier:

“error 912

Script level upgrade for database 'master' failed because upgrade step

'sqlagent100_msdb_upgrade.sql'”  

Hmm,

SQLAgent, msdb, failed to upgrade.  It

does seem to involve something failing in the msdb_upgrade.sql script.  I came upon this

link, which was a triple-play, in the sense that it contained the three

error messages I saw, 912, 3417, and 33009, respectively.  Keep in mind, the referenced link discusses a

different scenario where an upgrade from SQL 2000 to SQL 2008 failed.  But it was useful in giving me clues towards

the final resolution.

I did find

one more msdn blog, which was a direct hit on the error message, “Script

level upgrade….failed”  Script!  That’s it, my upgrade script failed!  Ok, I thought I knew that.  So, if the script failed, let’s try to

disable script execution.  How do we do

that?  The article tells us to start the server

by using trace flag –T902, which basically its purpose is to bypass upgrade

scripts.

In the

meanwhile, the twitter conversation was continuing, and @Muad_DBA asked me

about the tempdb.  What is the size and

configuration of the tempdb?  Yikes, it

was only the default 8,192 or 8MB, one file. 

He suggested that the upgrade was running out of temp space to do the

upgrade, and the tempdb growth is impacting the upgrade process.  I increased the size of the tempdb, just to

be sure, but didn’t see anything indicating it ran out of space or couldn’t

auto-grow.

I also tried one other article I found, that

reflected the current state of affairs, ie, SQL

Server 2008 Service fails to start after Service Pack Installation.  It also mentions the Script level upgrade 'sqlagent100_msdb_upgrade.sql'

failed error.  It has you check and

create some registry entries (specifically to add the DefaultData &

DefaultLog paths).  This didn’t seem to

make a difference.

Back to –T902.  By using –T902 as startup parameter, I was

able to get the SQL Service running, via the services.msc, and log in via sqlcmd,

as well as launch a SSMS instance.

Yet, another

similar hit, but not exactly my build, I stumbled closer to the issue, and here

was the closest yet.  SQL

Server 2008 R2 Unable to Start After Applying CU1. Bringing you

attention to Step 3, listed here:

“ALTER AUTHORIZATION ON DATABASE::MSDB TO SA”

When

I looked at the owner of database ‘msdb’, I did in fact see that it was an

owner, other than ‘sa’.  It was never

restored, so not sure how it changed, but apparently this WAS the key.

I

ran the above statement, and also, sp_changedbowner ‘sa’ (which I believe is

essentially the same thing, other that Alter

Authorization, (see the highlighted hyperlink) applies to greater securables

in SQL Server, and provides more granularity. 

Once

I changed the ownership of msdb back to ‘sa’, I restarted the services via the

Failover Cluster Manager, and….several hours later, it’s online, up and

running.  Failed over back and forth a

few times, to make sure.

So,

in a nutshell, remember to check the ownership of the system databases before

any upgrade, and ensure that it is set to ‘sa’, because you too will end up

with a cluster headache.

Quite

anti-climatic for sure, but don’t avoid updating your SQL Servers.  There is no patch for stupidity.  I hope the above information is helpful.

May

your SQL Servers Live

Long, and Prosper !

=============================================================

You could and should follow me on Twitter! @Pearlknows

Please inquire about our comprehensive SQL Server Health Check assement.  We will provide you with a plan to optimize your slow performance!  pearlknows@yahoo.com

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating