Failover restart policies

Question

Failover restart policies

Sean cavanagh

Old Hand

Points: 324
More actions
January 21, 2013 at 7:00 am

#393982

Helpful folks,
I'm in an odd situation where, in my new job, I don't have a test cluster in which I can experiment. Else I could probably answer this question myself. In my previous job, the system admins configured the SQL cluster config policies, so I had no opportunity to experiment with this myself.
My question concerns the failure restart policies, specifically the config values displayed on the Policies tab of the Properties dialog for a given clustered resource.
In the "Response to resource failure" area of this dialog, when the "If resource fails, restart on current node" radio button is selected, there are two parameters that can be configured:
1) Period for restarts (mm:ss) and, 2) Maximum restarts in the specified period.
I have searched the web for hours for references to these parameters hoping to find a comprehensive explanation of how they work together. Every reference I've found seems to assume choices of these values should be self-evident, but to me they seem ambiguous, especially considering the default values MS recommends (15 minutes for the restart period and 1 for the maximum restarts.) So, I'm hoping that someone on this forum can answer, or point me to the answers, to these questions:
For a given clustered resource, say the SQL Instance itself, if the "Period for Restarts" is set to 15 minutes, and the "Maximum Restarts in the specified period is say, 3, does this mean that after 3 attempts to restart the instance have failed, and if these 3 restart attempts take say 5 minutes, the cluster service will wait another 10 minutes before failing over the cluster?
* OR *
Is the logic: make 3 attempts to restart OR wait 15 minutes, whichever takes the least amount of time? Such that if the 3 restarts took 5 minutes to attempt, the cluster would fail over after the third attempt failed, regardless of whether a total of 15 minutes has passed?
Any help or advice would be greatly appreciated.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Perry Whittle SSC Guru Points: 233832 More actions · Answer 1

Sean cavanagh (1/21/2013)
Helpful folks,
I'm in an odd situation where, in my new job, I don't have a test cluster in which I can experiment.

No excuse, you can build a virtual sandpit quite easily using my guide at these links Part1[/url], Part2[/url], Part3[/url]

Sean cavanagh (1/21/2013)
In my previous job, the system admins configured the SQL cluster config policies, so I had no opportunity to experiment with this myself.
My question concerns the failure restart policies, specifically the config values displayed on the Policies tab of the Properties dialog for a given clustered resource.
In the "Response to resource failure" area of this dialog, when the "If resource fails, restart on current node" radio button is selected, there are two parameters that can be configured:
1) Period for restarts (mm:ss) and, 2) Maximum restarts in the specified period.
I have searched the web for hours for references to these parameters hoping to find a comprehensive explanation of how they work together. Every reference I've found seems to assume choices of these values should be self-evident, but to me they seem ambiguous, especially considering the default values MS recommends (15 minutes for the restart period and 1 for the maximum restarts.) So, I'm hoping that someone on this forum can answer, or point me to the answers, to these questions:
For a given clustered resource, say the SQL Instance itself, if the "Period for Restarts" is set to 15 minutes, and the "Maximum Restarts in the specified period is say, 3, does this mean that after 3 attempts to restart the instance have failed, and if these 3 restart attempts take say 5 minutes, the cluster service will wait another 10 minutes before failing over the cluster?
* OR *
Is the logic: make 3 attempts to restart OR wait 15 minutes, whichever takes the least amount of time? Such that if the 3 restarts took 5 minutes to attempt, the cluster would fail over after the third attempt failed, regardless of whether a total of 15 minutes has passed?
Any help or advice would be greatly appreciated.

Help file cites this information

Specify the number of times that you want the Cluster service to try to restart the resource during the period you specify. If the resource cannot be started after this number of attempts in the specified period, the Cluster service will take actions as specified by other fields of this tab.

For example, if you specify 3 for Maximum restarts in the specified period and 15:00 for the period, the Cluster service attempts to restart the resource three times in a given 15 minute period. If the resource still does not run, instead of trying to restart it a fourth time, the Cluster service will take the actions that you specified in the other fields of this tab.

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

Sean cavanagh Old Hand Points: 324 More actions · Answer 2

Perry,

Thanks for your response. I will definitely look into your guide for creating a virtual sandpit.

Regarding the restart policies, I should have presented a more comprehensive question. The other two options in the same dialog area are a source of confusion to me also. The help text explaining these are:

If restart is unsuccessful, fail over all resources in this service or application

Use this box to control the way the Cluster service responds if the maximum restarts fail:

Select this box if you want the Cluster service to respond by failing the clustered service or application over to another node.

Clear this box if you want the Cluster service to respond by leaving this clustered service or application running on this node (even if this resource is in a failed state).

If all the restart attempts fail, begin restarting again after the specified period (hh:mm)

Select this box if you want the Cluster service to go into an extended waiting period after attempting the maximum number of restarts on the resource. Note that this extended waiting period is measured in hours and minutes. After the waiting period, the Cluster service will begin another series of restarts. This is true regardless of which node owns the clustered service or application at that time.

From these descriptions, these two options appear to be mutually exclusive, yet they are presented as check boxes such that both can be selected. I would have expected radio buttons such that only one could be selected. Would you know how these work together?

Thanks, Sean

Perry Whittle SSC Guru Points: 233832 More actions · Answer 3

Sean cavanagh (1/23/2013)
Perry,
Thanks for your response. I will definitely look into your guide for creating a virtual sandpit.

Do, you'll find it extremely useful 😉

The help states the following as you are aware;

Failover cluster manager snap in help
If restart is unsuccessful, fail over all resources in this service or application
Use this box to control the way the Cluster service responds if the maximum restarts fail:
Select this box if you want the Cluster service to respond by failing the clustered service or application over to another node.
Clear this box if you want the Cluster service to respond by leaving this clustered service or application running on this node (even if this resource is in a failed state).
If all the restart attempts fail, begin restarting again after the specified period (hh:mm)
Select this box if you want the Cluster service to go into an extended waiting period after attempting the maximum number of restarts on the resource. Note that this extended waiting period is measured in hours and minutes. After the waiting period, the Cluster service will begin another series of restarts. This is true regardless of which node owns the clustered service or application at that time.

Sean cavanagh (1/23/2013)
From these descriptions, these two options appear to be mutually exclusive, yet they are presented as check boxes such that both can be selected. I would have expected radio buttons such that only one could be selected. Would you know how these work together?
Thanks, Sean

Not mutually exclusive at all, you are specifying the actions that the cluster service should take if a restart of the failed resource is unsuccessful.

If all the attempted restarts fail too (i.e. the 3 retrys within 15 minutes) then specify what should happen here.

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

Sean cavanagh Old Hand Points: 324 More actions · Answer 4

Well, the wording seems pretty ambiguous to me:

X - "If restart is unsuccessful, fail over ... "

X - "If all the restart attempts fail, begin restarting again ... "

That sounds mutually exclusive. But regardless, I'm assuming the last choice probably trumps the first one.

The reason I'm delving into this, is to try to troubleshoot a cluster incident that happened back in September, before I was hired. I'm looking at the Cluster config parameters to try to determine the possible causes. Apparently a cluster resource failed, but the cluster never failed over. My suspicion is the restart policies that are set:

"If Resource Fails, attempt restart on current node" is selected, with values:

Period for restarts: 15:00 mins

Maximum restarts: 1

"If restart is unsuccessful, fail over all resources in this service or application", is selected

"If all the restart attempts fail, begin restarting again after the specified period (hh:mm)", is selected with a value of 6 hours (!?!)

I suspect this last parameter is what shot them in the foot.

BTW, thanks for the quick and informative replies.