SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Backup jobs (FULL or DIFFs) bring the Cluster down


Backup jobs (FULL or DIFFs) bring the Cluster down

Author
Message
sql-lover
sql-lover
SSCarpal Tunnel
SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)

Group: General Forum Members
Points: 4409 Visits: 1930
OK,

So ... let me give a brief description of the problem, before going into details.

Three weeks ago, I just got a cellphone alert from my monitoring system, indicating the SQL failover instance went down. When I connect and check, SQL failed over the other node, and everything was up and running. After digging more, I found that one of the LUNs got disconnected and of course, without disks, SQL fails. The issue did not occur that day but it happened next day again. 1st problem was a Saturday. Second problem was Sunday. I should say the LUNs/disks came online on each event, so no manual intervention. I just checked for the db recovery and everything was up and running.

We waited few days so our SAN guy could review and take action. And he decided to upgrade SAN's firmware (DELL compelent with automated tier) Since then, the issue is now worse. Now I can't even run FULL or DIFF backups because it disconnects the LUNs triggering a failover and the disks stay down after that, they no longer start on their own. I tried running backups on small databases and it goes fine, but when it hits one that is medium size or big... booom ... down again.

Our SAN expert says it's an MPIO problem and he wants to get rid of it. Ermm

I say, this is a SAN's firmware or Os driver problem and we should not get rid of MPIO or keep changing configuration. Instead, we should address the problem and fix it.

Here are the typical Os errors, before SQL fails over to the passive node:


Connection to the target was lost. The initiator will attempt to retry the connection.



The initiator could not send an iSCSI PDU. Error status is given in the dump data.



\Device\MPIODisk25 is currently in a degraded state. One or more paths have failed, though the process is now complete.


I've never seen so many SAN problems so frequently in my life with any of my SQL Cluster implementations, ever.

Right now I have horrible performance issue due disk and iSCSI bottlenecks (it may be exacerbating the problem) and this problem has been added to the list of SAN's problems I have. I can't even run backups. It is really frustrating, to say the least.

Have anyone seen this issue before? I am running Windows2008R2 SP1 with SQL2012 Standard edition, SP1 as well.
PretendDBA
PretendDBA
SSC-Addicted
SSC-Addicted (482 reputation)SSC-Addicted (482 reputation)SSC-Addicted (482 reputation)SSC-Addicted (482 reputation)SSC-Addicted (482 reputation)SSC-Addicted (482 reputation)SSC-Addicted (482 reputation)SSC-Addicted (482 reputation)

Group: General Forum Members
Points: 482 Visits: 503
Have you logged a case with the hardware vendor of both the servers and SAN? might be worth checking the NIC drivers are up to date? Maybe get MS involved as well if the SAN team thinks it's MPIO related?
HanShi
HanShi
SSCrazy Eights
SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)

Group: General Forum Members
Points: 8456 Visits: 3718
Hi,
What happens if you copy a large file to the backup location? Will this also bring the SAN disk down? If so, it looks like the problem is caused by large I/O throughput and/or buffer cache of the SAN. If the problem also occurs with a file copy action, you know it is not SQL related.

** Don't mistake the ‘stupidity of the crowd’ for the ‘wisdom of the group’! **
sql-lover
sql-lover
SSCarpal Tunnel
SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)SSCarpal Tunnel (4.4K reputation)

Group: General Forum Members
Points: 4409 Visits: 1930
After checking online with a DELL representative, and me checking with a former coworker too, we are going to try "flow control" setting, both sides: NIC level and run the command on the actual switch where the SAN is connected to.

Based on the errors, the backups (which any DBA knows are pretty read intensive) are overloading the iSCSI because too much TCP/IP traffic on that SAN, disconnecting the LUNs. The flow control may provide a way to slow down the backup traffic a bit and avoid the overload.

A 2nd alternative (suggested by our SAN guy) is getting rid, remove for good, MPIO. I am not a big fan of it and don't even know if a SQL Cluster can run without it, or if the Cluster will break after removing MPIO. But if he does, he will remove it from passive node, reconfigure all NIC and stuff setting there and then we will failover that one and test backups. I do not think this is doable though ...
ELLEN-610393
ELLEN-610393
SSC Veteran
SSC Veteran (204 reputation)SSC Veteran (204 reputation)SSC Veteran (204 reputation)SSC Veteran (204 reputation)SSC Veteran (204 reputation)SSC Veteran (204 reputation)SSC Veteran (204 reputation)SSC Veteran (204 reputation)

Group: General Forum Members
Points: 204 Visits: 102
Did the experiment work? or did the problem continue?
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search