SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


MS-SQL failover instance did not start after a planned failover/hotfix installation.


MS-SQL failover instance did not start after a planned failover/hotfix installation.

Author
Message
sql-lover
sql-lover
SSCarpal Tunnel
SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)

Group: General Forum Members
Points: 4441 Visits: 1930
Ok,

I am running SQL2012 on Windows 2008R SP1 servers. Planned a short downtime to apply KB2718576 on my SQL Cluster, but when it was time to move SQL from the active (still not patched node) to the passive one (patched already) MS-SQL refused to start.

It bounced one or two times, running for a few seconds, then it went down. Disks were online tough. I tried to start manually and failed like 2 more times. I checked Windows and SQL errors without an actual indicator of the problem, just saying the node was down or rebooted, which it is "normal" as I did it.

M-SQL stayed online after third attempt and I kept an eye on the patched node, now active. After a while and reviewing the logs, my IT guy and I decided to pull the trigger and apply patch on passive node. I rebooted two times. And then I tried to move MS-SQL to the other now (now patched) and had same problem; MS-SQL refused to start. This time tough, did not start on the other node either. We had to shutdown the passive node and while operating with one node only, I started SQL and this time stayed alive for good.

I am having some issues finding the root cause of this. Can someone point me to the right direction? Right now, I'm afraid I can tell if the Cluster may or may not survive a potential downtime due a server crash. We do have to test fail over again, but in the meantime, would like to hear suggestions or if someone experienced this problem before.

Just in case , I am describing my patching steps. Let me know if I missed a step or did something wrong:

-Pause passive node
-Patch passive node
-Reboot
-Change registry key
-Reboot
-Check Win logs
-Unpause passive node
-Failover to passive < -- here's where the problem started

Remaining steps are the same, but for other node.

One thing I noticed , when I moved the SQL instance for 1st time, is that iSCSI initiator, "volume and devices", did not show the actual Windows drive letters but the Global Identifiers instead. I clear that up after I clicked "auto configure". I do believe, active node should show the actual drive letters, not the GUI.

This is what I am planning to do, but I need scheduled downtime though:

a) Test failover again from node2 (currently the active node) to node 1. Then back again, to node2. SQL should come online every time.
b) Run Cluster Validation again
c) Turn active node off, and SQL should move to the passive one, and starts without any issues.


Any comment or help is appreciated!
Perry Whittle
Perry Whittle
SSC Guru
SSC Guru (53K reputation)SSC Guru (53K reputation)SSC Guru (53K reputation)SSC Guru (53K reputation)SSC Guru (53K reputation)SSC Guru (53K reputation)SSC Guru (53K reputation)SSC Guru (53K reputation)

Group: General Forum Members
Points: 53095 Visits: 17672
sql-lover (4/15/2013)
I tried to start manually and failed like 2 more times. I checked Windows and SQL errors without an actual indicator of the problem, just saying the node was down or rebooted, which it is "normal" as I did it.

Can you detail the steps you took for the manual failover?
I find it difficult to believe that theres no info in the Windows application log, clear the log down and then try another failover.



sql-lover (4/15/2013)
Just in case , I am describing my patching steps. Let me know if I missed a step or did something wrong:

-Pause passive node
-Patch passive node
-Reboot
-Change registry key
-Reboot
-Check Win logs
-Unpause passive node
-Failover to passive < -- here's where the problem started

Pausing the passive node is not my preferred route. To avoid the instance attempting a failover to a node where an upgrade is in progress, modify the clustered instances virtual networkname and remove the node you are upgrading as a possible owner. Add it back once the upgrade has been completed and before you attempt failover.



sql-lover (4/15/2013)
One thing I noticed , when I moved the SQL instance for 1st time, is that iSCSI initiator, "volume and devices", did not show the actual Windows drive letters but the Global Identifiers instead. I clear that up after I clicked "auto configure". I do believe, active node should show the actual drive letters, not the GUI.

A little more detail about your shared storage would be ideal, are you using volume mount points at all?



sql-lover (4/15/2013)

This is what I am planning to do, but I need scheduled downtime though:

a) Test failover again from node2 (currently the active node) to node 1. Then back again, to node2. SQL should come online every time.
b) Run Cluster Validation again
c) Turn active node off, and SQL should move to the passive one, and starts without any issues.


Any comment or help is appreciated!

Definitely run the validation and check the results.
Perform a manual failover using the following steps and check where the failure occurs

  • take the group offline on the active node

  • right click the group and move to partner node

  • right click networkname and bring online

  • right click storage and bring online

  • right click sql service and bring online

  • right click sql agent and bring online



-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" ;-)
sql-lover
sql-lover
SSCarpal Tunnel
SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)

Group: General Forum Members
Points: 4441 Visits: 1930
Perry,

Thanks for reply.

Back to the night when I troubleshoot the 1st issue, that my SAN guy and I believe was due the hotfix we applied, I ended removing all resources and adding those back to Cluster. It was a long and crazy night! Well ... I just check right now and found this: please see attached image.

It looks Windows did not recreate the MS-SQL failover dependencies; disk dependencies are gone, for SQL instance I mean. I do believe, that's the reason why MS-SQL did not start after I applied the patch and moved to other node. I may be wrong, but it looks like ...
Attachments
dependencies.png (26 views, 11.00 KB)
sql-lover
sql-lover
SSCarpal Tunnel
SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)

Group: General Forum Members
Points: 4441 Visits: 1930
Well,

Couple of things. I ran the validation test and passed with a few warnings. Some of them saying that I have mutiple NICs on same network segment, but this is normal, as I use "teaming" for the public NICs. It also says the patching level it's not the same, but that's not true, it is. So not sure if this is a bug or the Cluster Validation it's not properly reading the registry entries.

I also turned the whole SQL failover group down, all resources. I moved to other node, and started turning on one by one, it worked. I turned the whole group off again, moved back, and started turning services and resources one by one, it worked. However, when I attempt to failover on the fly, from one node to another one, all resources come online, including SQL, but after a few seconds, it fails. So I dug into SQL log and found this entry, just before going down:


2013-04-17 18:09:40.42 spid11s Could not create tempdb. You may not have enough disk space available. Free additional disk space by deleting other files on the tempdb drive and then restart SQL Server. Check for additional errors in the event log that may indicate why the tempdb files could not be initialized.
2013-04-17 18:09:40.42 spid11s SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.
2013-04-17 18:09:40.45 Logon Error: 18456, Severity: 14, State: 38.



The only other error when starting was this


2013-04-17 18:09:26.69 Server The SQL Server Network Interface library could not register the Service Principal Name (SPN) [ MSSQLSvc/xxxxxx] for the SQL Server service. Windows return code: 0x2098, state: 20. Failure to register a SPN might cause integrated authentication to use NTLM instead of Kerberos. This is an informational message. Further action is only required if Kerberos authentication is required by authentication policies and if the SPN has not been manually registered.


But I do not believe that may be the issue here.

I guess that my question now is,
Is tempdb too small for my SQL2012 Cluster, failing to initialize, and then bring SQL service down? How to calculate or measure that. tempdb is cleared up after a restart. And I've checked during normal hours and it looks fine.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search