Backup jobs (FULL or DIFFs) bring the Cluster down

Question

Backup jobs (FULL or DIFFs) bring the Cluster down

sql-lover

SSCoach

Points: 18530
More actions
August 16, 2013 at 9:17 am

#276281

OK,
So ... let me give a brief description of the problem, before going into details.
Three weeks ago, I just got a cellphone alert from my monitoring system, indicating the SQL failover instance went down. When I connect and check, SQL failed over the other node, and everything was up and running. After digging more, I found that one of the LUNs got disconnected and of course, without disks, SQL fails. The issue did not occur that day but it happened next day again. 1st problem was a Saturday. Second problem was Sunday. I should say the LUNs/disks came online on each event, so no manual intervention. I just checked for the db recovery and everything was up and running.
We waited few days so our SAN guy could review and take action. And he decided to upgrade SAN's firmware (DELL compelent with automated tier) Since then, the issue is now worse. Now I can't even run FULL or DIFF backups because it disconnects the LUNs triggering a failover and the disks stay down after that, they no longer start on their own. I tried running backups on small databases and it goes fine, but when it hits one that is medium size or big... booom ... down again.
Our SAN expert says it's an MPIO problem and he wants to get rid of it. :ermm:
I say, this is a SAN's firmware or Os driver problem and we should not get rid of MPIO or keep changing configuration. Instead, we should address the problem and fix it.
Here are the typical Os errors, before SQL fails over to the passive node:
Connection to the target was lost. The initiator will attempt to retry the connection.
The initiator could not send an iSCSI PDU. Error status is given in the dump data.
\Device\MPIODisk25 is currently in a degraded state. One or more paths have failed, though the process is now complete.
I've never seen so many SAN problems so frequently in my life with any of my SQL Cluster implementations, ever.
Right now I have horrible performance issue due disk and iSCSI bottlenecks (it may be exacerbating the problem) and this problem has been added to the list of SAN's problems I have. I can't even run backups. It is really frustrating, to say the least.
Have anyone seen this issue before? I am running Windows2008R2 SP1 with SQL2012 Standard edition, SP1 as well.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

PretendDBA SSCrazy Points: 2060 More actions · Answer 1

Have you logged a case with the hardware vendor of both the servers and SAN? might be worth checking the NIC drivers are up to date? Maybe get MS involved as well if the SAN team thinks it's MPIO related?

HanShi SSC-Dedicated Points: 33506 More actions · Answer 2

Hi,

What happens if you copy a large file to the backup location? Will this also bring the SAN disk down? If so, it looks like the problem is caused by large I/O throughput and/or buffer cache of the SAN. If the problem also occurs with a file copy action, you know it is not SQL related.

** Don't mistake the ‘stupidity of the crowd’ for the ‘wisdom of the group’! **

sql-lover SSCoach Points: 18530 More actions · Answer 3

After checking online with a DELL representative, and me checking with a former coworker too, we are going to try "flow control" setting, both sides: NIC level and run the command on the actual switch where the SAN is connected to.

Based on the errors, the backups (which any DBA knows are pretty read intensive) are overloading the iSCSI because too much TCP/IP traffic on that SAN, disconnecting the LUNs. The flow control may provide a way to slow down the backup traffic a bit and avoid the overload.

A 2nd alternative (suggested by our SAN guy) is getting rid, remove for good, MPIO. I am not a big fan of it and don't even know if a SQL Cluster can run without it, or if the Cluster will break after removing MPIO. But if he does, he will remove it from passive node, reconfigure all NIC and stuff setting there and then we will failover that one and test backups. I do not think this is doable though ...

ELLEN-610393 Ten Centuries Points: 1244 More actions · Answer 4

ELLEN-610393

Ten Centuries

Points: 1244

September 17, 2013 at 9:45 am

#1650840

Did the experiment work? or did the problem continue?