Ownership of cluster disk 'Cluster Disk xxx has been unexpectedly lost by this node.

Question

Post reply

Ownership of cluster disk 'Cluster Disk xxx has been unexpectedly lost by this node.

sql-lover

SSCoach

Points: 18530
More actions
April 23, 2013 at 8:09 am

#274505

My Cluster went down again. I don't have to say... I am having a not so good morning already ... 🙁
Here's the Cluster's error:
Ownership of cluster disk 'Cluster Disk xxx' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.
This looks to me as a Os or SAN error. The LUN is gone, then SQL goes down. Now, following our SAN admin's advice, we did apply this patch: http://support.microsoft.com/?id=2718576 ... but does not look like it resolved the main issue.
We just started having this issue few weeks ago. But it was running fine for a two month period, maybe a bit more but with less workload.
Has someone experienced this problem before?

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply

Perry Whittle SSC Guru Points: 233915 More actions · Answer 1

are you using iSCSI attached storage and MPIO?

Can you provide a little more info on the storage and the connectivity from the nodes?

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

sql-lover SSCoach Points: 18530 More actions · Answer 2

Hi Perry,

The shared storage is a Dell Compellent SC8000 SAN, connected via iSCSI / MPIO to both nodes. The Windows Cluster runs on Win2008R2 SP1. MS-SQL runs SQL2012 Standard.

I also found this error on Windows log:

Connection to the target was lost. The initiator will attempt to retry the connection.

It clearly looks like an iSCSI / MPIO issue. On both incidents, the iSCSI mapping got lost, then SQL went down.

Our SAN expert advice is remove MPIO ???? :ermm: ... but I've helped configuring and deploying dozens of SQL Clusters before with MPIO, and this is the 1st time I see this problem. Moreover, I believe removing MPIO will create Cluster validation issues and data corruptions.

Perry Whittle SSC Guru Points: 233915 More actions · Answer 3

sql-lover (4/23/2013)
Hi Perry,
The shared storage is a Dell Compellent SC8000 SAN, connected via iSCSI / MPIO to both nodes. The Windows Cluster runs on Win2008R2 SP1. MS-SQL runs SQL2012 Standard.

I'm assuming you're using the Microsoft iscsi initiator?

Are you using the default MPIO driver or a Dell DSM?

If the MS driver what policy are you using?

sql-lover (4/23/2013)
Our SAN expert advice is remove MPIO ???? :ermm: ....

Some expert huh? 😉

Without multi pathing things could be a whole lot worse.

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

sql-lover SSCoach Points: 18530 More actions · Answer 4

The MCS policy was set to "round robin".

Now, I do believe we are using the default Microsoft MPIO driver, but where can I check than on Windows and confirm? I do not remember where ...

Also, forgot to mention and I actually was not aware about this until yesterday, we do not have two switches but only one and both nodes are connected to same switch. That actually defeats part of MPIO purpose, I think. Not sure why our IT resource made it that way.

Perry Whittle SSC Guru Points: 233915 More actions · Answer 5

sql-lover (4/24/2013)
The MCS policy was set to "round robin".

You use either MPIO or MCS not both, so, are you using MCS or MPIO?

MCS is specific to the Microsoft iSCSI Initiator and comprises single session\multiple connection.

MPIO uses multiple sessions.

For more info on iSCSI see my article at this link[/url].

sql-lover (4/24/2013)
The MCS policy was set to "round robin". Now, I do believe we are using the default Microsoft MPIO driver, but where can I check than on Windows and confirm? I do not remember where ...

Open the Microsoft iSCSI Initiator console, select the disk device and open the properties. You should see the MPIO button which will open the MPIO properties to view\change.

sql-lover (4/24/2013)
Also, forgot to mention and I actually was not aware about this until yesterday, we do not have two switches but only one and both nodes are connected to same switch. That actually defeats part of MPIO purpose, I think. Not sure why our IT resource made it that way.

When using storage multi pathing one would sort of hope that the hardware would be in place to support the topology otherwise a switch hardware failure will leave MPIO redundant!!

You should have more than 2 switches for your iSCSI network. A typical topolgy would have at least 2 core switches with edge switches feeding off these to provide multiple redundant paths down to your storage. This is all detailed in my article linked above.

The whole point of multi pathing is to allow Windows server to host highly available local SAN disks otherwise the OS would see the multiple paths as separate disk devices, which they are not.

With 10GBoe available you're exceeding the capabilities of a standard FC setup 😉

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

sql-lover SSCoach Points: 18530 More actions · Answer 6

Perry Whittle (4/24/2013)
When using storage multi pathing one would sort of hope that the hardware would be in place to support the topology otherwise a switch hardware failure will leave MPIO redundant!!
You should have more than 2 switches for your iSCSI network. A typical topolgy would have at least 2 core switches with edge switches feeding off these to provide multiple redundant paths down to your storage. This is all detailed in my article linked above.
The whole point of multi pathing is to allow Windows server to host highly available local SAN disks otherwise the OS would see the multiple paths as separate disk devices, which they are not.
With 10GBoe available you're exceeding the capabilities of a standard FC setup 😉

You are correct and I understand that! It has been very difficult to explain and support my arguments though. I've been questioned a lot (knowing this by experience) and it is really FRUSTRATING! 🙁

Anyway, I appreciate the follow up. I can check those other settings you mention, I'll post once I get that ...

Perry Whittle SSC Guru Points: 233915 More actions · Answer 7

sql-lover (4/24/2013)
It has been very difficult to explain and support my arguments though. I've been questioned a lot (knowing this by experience) and it is really FRUSTRATING! 🙁

point them to my article 😉

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

sql-lover SSCoach Points: 18530 More actions · Answer 8

Just in case someone else is reading this thread and face a similar issue.

Our IT guy / SAN expert contacted Microsoft. He had a meeting with them and the Microsoft engineer revised the whole Cluster implementation. He did not find anything wrong on MS-SQL and its configuration but suggested these two Os changes:

-Change the network binding order. Put HeartBeat second and SAN last (SAN was 2nd and heartbeat the last one)

-Assign fix IP values on the iSCSI initiator properties

While I was absent during the meeting, I do not understand the 1st suggestion. It is usually how I setup my Cluster implementations. I'll give a try to the second one though.

Perry Whittle SSC Guru Points: 233915 More actions · Answer 9

sql-lover (5/2/2013)
-Change the network binding order. Put HeartBeat second and SAN last (SAN was 2nd and heartbeat the last one)

I can sort of see why but I don't see this is to relevant, the cluster communication can still take place over the public network (the default setting)

sql-lover (5/2/2013)
-Assign fix IP values on the iSCSI initiator properties

Now this is relevant, your heartbeat and iscsi adapters should be set to not register themselves in DNS. Always provide fixed IP details to the initiator disk device connection to ensure the correct adapters are bound. My article linked above details this.

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

sql-lover SSCoach Points: 18530 More actions · Answer 10

Perry Whittle (5/3/2013)
sql-lover (5/2/2013)
-Change the network binding order. Put HeartBeat second and SAN last (SAN was 2nd and heartbeat the last one)
I can sort of see why but I don't see this is to relevant, the cluster communication can still take place over the public network (the default setting)
sql-lover (5/2/2013)
-Assign fix IP values on the iSCSI initiator properties
Now this is relevant, your heartbeat and iscsi adapters should be set to not register themselves in DNS. Always provide fixed IP details to the initiator disk device connection to ensure the correct adapters are bound. My article linked above details this.

Agree.

So I would give a try to the 2nd one (1st is done) which was initially configured by our IT resource, by the way.

Ramasankar Molleti Ten Centuries Points: 1044 More actions · Answer 11

Ramasankar Molleti

Ten Centuries

Points: 1044

May 3, 2013 at 9:56 pm

#1612364

Is your problem got resolved?

sql-lover SSCoach Points: 18530 More actions · Answer 12

sankar276 (5/3/2013)
Is your problem got resolved?

No.

My cluster has been stable past 7 days or so, but I have not changed the iSCSI setting yet.

Ramasankar Molleti Ten Centuries Points: 1044 More actions · Answer 13

Seems to me it is an OS issue.

http://support.microsoft.com/kb/972797

Better test before changing.

sql-lover SSCoach Points: 18530 More actions · Answer 14

sankar276 (5/3/2013)
Seems to me it is an OS issue.
http://support.microsoft.com/kb/972797
Better test before changing.

Interesting.

We applied a hotfix already, but did nothing, it happened again. Now, not sure if it was the same or not.

I'll take notice though, thanks for reply.