So ... let me give a brief description of the problem, before going into details.
Three weeks ago, I just got a cellphone alert from my monitoring system, indicating the SQL failover instance went down. When I connect and check, SQL failed over the other node, and everything was up and running. After digging more, I found that one of the LUNs got disconnected and of course, without disks, SQL fails. The issue did not occur that day but it happened next day again. 1st problem was a Saturday. Second problem was Sunday. I should say the LUNs/disks came online on each event, so no manual intervention. I just checked for the db recovery and everything was up and running.
We waited few days so our SAN guy could review and take action. And he decided to upgrade SAN's firmware (DELL compelent with automated tier) Since then, the issue is now worse. Now I can't even run FULL or DIFF backups because it disconnects the LUNs triggering a failover and the disks stay down after that, they no longer start on their own. I tried running backups on small databases and it goes fine, but when it hits one that is medium size or big... booom ... down again.
Our SAN expert says it's an MPIO problem and he wants to get rid of it.
I say, this is a SAN's firmware or Os driver problem and we should not get rid of MPIO or keep changing configuration. Instead, we should address the problem and fix it.
Here are the typical Os errors, before SQL fails over to the passive node:
Connection to the target was lost. The initiator will attempt to retry the connection.
The initiator could not send an iSCSI PDU. Error status is given in the dump data.
\Device\MPIODisk25 is currently in a degraded state. One or more paths have failed, though the process is now complete.
I've never seen so many SAN problems so frequently in my life with any of my SQL Cluster implementations, ever.
Right now I have horrible performance issue due disk and iSCSI bottlenecks (it may be exacerbating the problem) and this problem has been added to the list of SAN's problems I have. I can't even run backups. It is really frustrating, to say the least.
Have anyone seen this issue before? I am running Windows2008R2 SP1 with SQL2012 Standard edition, SP1 as well.