Cluster Errors - No Auto Fail over??

  • I am somewhat of a beginner when it comes to clustering. I am learning on the job and looking into good training to speed up this process. That being said, I had a cluster fail this morning and I am not sure why. Questions I have been researching all morning, What caused this error (which I think we may know and the VMWare/SAN guy is looking into), Why didn't the automatic fail over happen? and What can I do to mitigate this issue?

    My environment is a 64 bit Windows Server 2012 Standard edition that is running SQL Server 2012 SP1 Standard Edition (64 bit) which has an Active Passive Cluster set up and I unfortunately had no involvement in this cluster build. The person who did is no longer with the company and is not responsive to questions.

    This morning this following errors popped up:

    Event ID: 1230 - Cluster resource 'Databases' (resource type 'Physical Disk', DLL 'clusres.dll') did not respond to a request in a timely fashion. Cluster health detection will attempt to automatically recover by terminating the Resource Hosting Subsystem (RHS) process running this resource. This may affect other resources hosted in the same RHS process. The resources will then be restarted.

    The suspect resource 'Databases' will be marked to run in an isolated RHS process to avoid impacting multiple resources in the event that this resource failure occurs again. Please ensure services, applications, or underlying infrastructure (such as storage or networking) associated with the suspect resource is functioning properly.

    AND

    Event ID: 1146 - The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource. Please determine which resource and resource DLL is causing the issue and verify it is functioning properly.

    When I saw these errors, I brought the "Databases", "Logs", and "Quorom" disks back online and all was well. I then proceeded to check Application and System Logs noticing there where some VMWare/SAN errors that caused the CPU spike resulting in the Cluster errors.

    Regarding the questions, why didn't a failover occur and What can I do to mitigate this process are now what I am focusing on. Researching the above errors, I discovered that you could a policy based what happens in a situation like occurred this morning. You can set policies at the Role level, "Server" level, and at the disk level.

    The Role Policy seems to be a little less involved, screenshot RoleProperties.png show these. I assume these are defaults. The server name properties are set to the following screenshot, Servername_Properties_Policy.png. This seems to be correct to me, but there is where I am lacking. I think this means that if it fails at this level all associated resources, disks, etc will failover to node 2.

    Now where I think the issue is, the third screenshot, Database_Properties_Policy.png. There is no policy set to for situation where a disk resource fails. Which I believe is what happened this morning. Some questions:

    1. Do I need to set this similar to what is in the ServerName screenshot?

    2. What happens if I check the box for "If restart is unsuccessful, fail over all resources in the role"? Will this only fail over this one disk? If so, it doesn't seem like that would be ideal. Would it fail over at the Role level and move over all associated resources?

    I am also not sure what time limits I should set for Period restarts, maximum restarts, delay between restarts, etc...

    I have not had a lot of luck finding information on this online and clicking on the "More about restart policies" gives good explanations, but unfortunately its not enough for me to be confident in what I am about to do.

    My apologies in advance for the long post but, ANY and ALL help will be greatly appreciated!!!

  • It's entirely likely that the service failed over. If the SAN (or the switch network supporting the SAN) was having problems, this could affect both servers at the same time meaning neither server could host the service properly. After a few tries, the cluster service will give up, and leave a set of resources down. This is better than the infinite ping-pong game you could have on older clusters (personally experienced on Windows 2000).

    This is the reason that the recommendations are for each node to be connected to their own distinct network switch.

  • Thanks for the reply Matt!!

    To your point of the SAN being unavailable, if that were the case, I should be able to log onto node 2 and see errors pertaining to that. Logging onto Node 2 and combing thru the Application and Systems logs, I don't see any issues. To me it looks like Node 2 was ready, the fail over just didn't happen.

    I appreciate the feedback...

  • GBeezy (6/25/2014)


    I am somewhat of a beginner when it comes to clustering. I am learning on the job and looking into good training to speed up this process. That being said, I had a cluster fail this morning and I am not sure why. Questions I have been researching all morning, What caused this error (which I think we may know and the VMWare/SAN guy is looking into), Why didn't the automatic fail over happen? and What can I do to mitigate this issue?

    My environment is a 64 bit Windows Server 2012 Standard edition that is running SQL Server 2012 SP1 Standard Edition (64 bit) which has an Active Passive Cluster set up and I unfortunately had no involvement in this cluster build. The person who did is no longer with the company and is not responsive to questions.

    This morning this following errors popped up:

    Event ID: 1230 - Cluster resource 'Databases' (resource type 'Physical Disk', DLL 'clusres.dll') did not respond to a request in a timely fashion. Cluster health detection will attempt to automatically recover by terminating the Resource Hosting Subsystem (RHS) process running this resource. This may affect other resources hosted in the same RHS process. The resources will then be restarted.

    The suspect resource 'Databases' will be marked to run in an isolated RHS process to avoid impacting multiple resources in the event that this resource failure occurs again. Please ensure services, applications, or underlying infrastructure (such as storage or networking) associated with the suspect resource is functioning properly.

    AND

    Event ID: 1146 - The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource. Please determine which resource and resource DLL is causing the issue and verify it is functioning properly.

    When I saw these errors, I brought the "Databases", "Logs", and "Quorom" disks back online and all was well. I then proceeded to check Application and System Logs noticing there where some VMWare/SAN errors that caused the CPU spike resulting in the Cluster errors.

    Regarding the questions, why didn't a failover occur and What can I do to mitigate this process are now what I am focusing on. Researching the above errors, I discovered that you could a policy based what happens in a situation like occurred this morning. You can set policies at the Role level, "Server" level, and at the disk level.

    The Role Policy seems to be a little less involved, screenshot RoleProperties.png show these. I assume these are defaults. The server name properties are set to the following screenshot, Servername_Properties_Policy.png. This seems to be correct to me, but there is where I am lacking. I think this means that if it fails at this level all associated resources, disks, etc will failover to node 2.

    Now where I think the issue is, the third screenshot, Database_Properties_Policy.png. There is no policy set to for situation where a disk resource fails. Which I believe is what happened this morning. Some questions:

    1. Do I need to set this similar to what is in the ServerName screenshot?

    2. What happens if I check the box for "If restart is unsuccessful, fail over all resources in the role"? Will this only fail over this one disk? If so, it doesn't seem like that would be ideal. Would it fail over at the Role level and move over all associated resources?

    I am also not sure what time limits I should set for Period restarts, maximum restarts, delay between restarts, etc...

    I have not had a lot of luck finding information on this online and clicking on the "More about restart policies" gives good explanations, but unfortunately its not enough for me to be confident in what I am about to do.

    My apologies in advance for the long post but, ANY and ALL help will be greatly appreciated!!!

    The policy for the shared disk does not confiorm to the standard policy applied during resource creation. Someone has changed it, likely your non responsive predecessor 😉

    In this situation if it were me (and I have done this before 😉 ) I would pull the nodes from the cluster one at a time and rebuild them to my exacting standards that way I'd be confident of the system i'm supporting.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply