MS-SQL failover instance did not start after a planned failover/hotfix installation.

Question

MS-SQL failover instance did not start after a planned failover/hotfix installation.

sql-lover

SSCoach

Points: 18530
More actions
April 15, 2013 at 2:00 pm

#394590

Ok,
I am running SQL2012 on Windows 2008R SP1 servers. Planned a short downtime to apply KB2718576 on my SQL Cluster, but when it was time to move SQL from the active (still not patched node) to the passive one (patched already) MS-SQL refused to start.
It bounced one or two times, running for a few seconds, then it went down. Disks were online tough. I tried to start manually and failed like 2 more times. I checked Windows and SQL errors without an actual indicator of the problem, just saying the node was down or rebooted, which it is "normal" as I did it.
M-SQL stayed online after third attempt and I kept an eye on the patched node, now active. After a while and reviewing the logs, my IT guy and I decided to pull the trigger and apply patch on passive node. I rebooted two times. And then I tried to move MS-SQL to the other now (now patched) and had same problem; MS-SQL refused to start. This time tough, did not start on the other node either. We had to shutdown the passive node and while operating with one node only, I started SQL and this time stayed alive for good.
I am having some issues finding the root cause of this. Can someone point me to the right direction? Right now, I'm afraid I can tell if the Cluster may or may not survive a potential downtime due a server crash. We do have to test fail over again, but in the meantime, would like to hear suggestions or if someone experienced this problem before.
Just in case , I am describing my patching steps. Let me know if I missed a step or did something wrong:
-Pause passive node
-Patch passive node
-Reboot
-Change registry key
-Reboot
-Check Win logs
-Unpause passive node
-Failover to passive < -- here's where the problem started
Remaining steps are the same, but for other node.
One thing I noticed , when I moved the SQL instance for 1st time, is that iSCSI initiator, "volume and devices", did not show the actual Windows drive letters but the Global Identifiers instead. I clear that up after I clicked "auto configure". I do believe, active node should show the actual drive letters, not the GUI.
This is what I am planning to do, but I need scheduled downtime though:
a)Test failover again from node2 (currently the active node) to node 1. Then back again, to node2. SQL should come online every time.
b)Run Cluster Validation again
c)Turn active node off, and SQL should move to the passive one, and starts without any issues.
Any comment or help is appreciated!

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Perry Whittle SSC Guru Points: 233915 More actions · Answer 1

sql-lover (4/15/2013)
I tried to start manually and failed like 2 more times. I checked Windows and SQL errors without an actual indicator of the problem, just saying the node was down or rebooted, which it is "normal" as I did it.

Can you detail the steps you took for the manual failover?

I find it difficult to believe that theres no info in the Windows application log, clear the log down and then try another failover.

sql-lover (4/15/2013)
Just in case , I am describing my patching steps. Let me know if I missed a step or did something wrong:
-Pause passive node
-Patch passive node
-Reboot
-Change registry key
-Reboot
-Check Win logs
-Unpause passive node
-Failover to passive < -- here's where the problem started

Pausing the passive node is not my preferred route. To avoid the instance attempting a failover to a node where an upgrade is in progress, modify the clustered instances virtual networkname and remove the node you are upgrading as a possible owner. Add it back once the upgrade has been completed and before you attempt failover.

sql-lover (4/15/2013)
One thing I noticed , when I moved the SQL instance for 1st time, is that iSCSI initiator, "volume and devices", did not show the actual Windows drive letters but the Global Identifiers instead. I clear that up after I clicked "auto configure". I do believe, active node should show the actual drive letters, not the GUI.

A little more detail about your shared storage would be ideal, are you using volume mount points at all?

sql-lover (4/15/2013)
This is what I am planning to do, but I need scheduled downtime though:
a)Test failover again from node2 (currently the active node) to node 1. Then back again, to node2. SQL should come online every time.
b)Run Cluster Validation again
c)Turn active node off, and SQL should move to the passive one, and starts without any issues.
Any comment or help is appreciated!

Definitely run the validation and check the results.

Perform a manual failover using the following steps and check where the failure occurs

take the group offline on the active node
right click the group and move to partner node
right click networkname and bring online
right click storage and bring online
right click sql service and bring online
right click sql agent and bring online

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

sql-lover SSCoach Points: 18530 More actions · Answer 2

Perry,

Thanks for reply.

Back to the night when I troubleshoot the 1st issue, that my SAN guy and I believe was due the hotfix we applied, I ended removing all resources and adding those back to Cluster. It was a long and crazy night! Well ... I just check right now and found this: please see attached image.

It looks Windows did not recreate the MS-SQL failover dependencies; disk dependencies are gone, for SQL instance I mean. I do believe, that's the reason why MS-SQL did not start after I applied the patch and moved to other node. I may be wrong, but it looks like ...

sql-lover SSCoach Points: 18530 More actions · Answer 3

Well,

Couple of things. I ran the validation test and passed with a few warnings. Some of them saying that I have mutiple NICs on same network segment, but this is normal, as I use "teaming" for the public NICs. It also says the patching level it's not the same, but that's not true, it is. So not sure if this is a bug or the Cluster Validation it's not properly reading the registry entries.

I also turned the whole SQL failover group down, all resources. I moved to other node, and started turning on one by one, it worked. I turned the whole group off again, moved back, and started turning services and resources one by one, it worked. However, when I attempt to failover on the fly, from one node to another one, all resources come online, including SQL, but after a few seconds, it fails. So I dug into SQL log and found this entry, just before going down:

2013-04-17 18:09:40.42 spid11s Could not create tempdb. You may not have enough disk space available. Free additional disk space by deleting other files on the tempdb drive and then restart SQL Server. Check for additional errors in the event log that may indicate why the tempdb files could not be initialized.

2013-04-17 18:09:40.42 spid11s SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.

2013-04-17 18:09:40.45 Logon Error: 18456, Severity: 14, State: 38.

The only other error when starting was this

2013-04-17 18:09:26.69 Server The SQL Server Network Interface library could not register the Service Principal Name (SPN) [ MSSQLSvc/xxxxxx] for the SQL Server service. Windows return code: 0x2098, state: 20. Failure to register a SPN might cause integrated authentication to use NTLM instead of Kerberos. This is an informational message. Further action is only required if Kerberos authentication is required by authentication policies and if the SPN has not been manually registered.

But I do not believe that may be the issue here.

I guess that my question now is,

Is tempdb too small for my SQL2012 Cluster, failing to initialize, and then bring SQL service down? How to calculate or measure that. tempdb is cleared up after a restart. And I've checked during normal hours and it looks fine.