sql server 2012 cluster take way too long to backup/restore

  • Hi experts,

    I need some directions on what sort of things should I look into why backup and restore take way too long, e.g. 130GB db take 4 hrs to backup and 4hrs to restore.

    We have 2 brand new physical servers with identital hw (12-core, 288GB ram). We use VMware to create 1 virtual machine on each server and give 8-core and 260GB ram to each machine. From these 2 virtual machines, I just setup 2-node active/active failover cluster on windows 2008R2 and SQL Server 2012 Enterprise. Everything goes smoothly with installation and setup. I test out failover and stuff and things work great. I configure the windows/sql server:

    - enable "Instant File Inititalzation"

    - set sql server 100GB max mem

    - 4 tempdb files

    Our plan is to replace the current single sql 2005 with this 2012 failover cluster.

    When I start poking around with backup/restore on this cluster, i notice it take way too long, e.g. 130GB db take 4 hrs to backup and 4hrs restore to the SAN. First I blame the SAN for IO latency, but when I try to back up to the 'nul' device, it take around the same time too. These 2 new servers are on the same VLAN as the old server. The old server is 4-core 32GB ram only takes 10ms per read/write. These 2 new servers take around 80s per read/write. These 2 new servers connect to SAN via fibre channel with 1Gbs connection. I also poke around with backup/restore options such as BufferCount, MaxTransferSize, BlockSize, compression, etc... but it does not make any big difference. Of course from sql error log, i got lots of warning about 'IO take more than 15s'. I even try to backup/restore to/from multiple files and still does not help. I notice the restore speed only 8MB/sec (e.g. RESTORE DATABASE successfully processed 14701610 pages in 14676.626 seconds (7.825 MB/sec))

    Just a note, when I copy the 130GB file(s) from SAN to the server, take me around 20 min. BTW, all disks storage from server are allocated from the SAN, including C: drive.

    Another note, I've applied CU3 to sql server 2012 cluster

    What else should I look into? please help.

    Thanks very much,

    Thanh

  • I've never noticed clustering affecting backup speed before. So, I'd be more suspicious of the virtual machines preventing you from getting proper throughput to the SAN, not the cluster causing the problem.

    By the way, why bother with SQL Server clustering if we're just talking two VMs on the same virtual box. Clustering is meant as a mechanism to cover for hardware failure. In this instance, if your hardware fails, both virtual servers are gone.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanh Ngay Nguyen (9/16/2012)


    Of course from sql error log, i got lots of warning about 'IO take more than 15s'.

    This is a clear indication that something is wrong. Without knowing a lot about your setup it's going to be diffcult to provide any useful help. Start by checking all routes through to the storage and try to get some performance metrics both through windows and the VMWare console.

    Grant Fritchey (9/17/2012)


    By the way, why bother with SQL Server clustering if we're just talking two VMs on the same virtual box. Clustering is meant as a mechanism to cover for hardware failure. In this instance, if your hardware fails, both virtual servers are gone.

    They're not, each VM's is on a separate host. At least that is what the OP has indicated.

    Clustering across hosts can be useful and provide redundancy at the VM OS layer rather than the Host OS layer.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (9/17/2012)


    Grant Fritchey (9/17/2012)


    By the way, why bother with SQL Server clustering if we're just talking two VMs on the same virtual box. Clustering is meant as a mechanism to cover for hardware failure. In this instance, if your hardware fails, both virtual servers are gone.

    They're not, each VM's is on a separate host. At least that is what the OP has indicated.

    Clustering across hosts can be useful and provide redundancy at the VM OS layer rather than the Host OS layer.

    Oops, misunderstood. And I'm fine with clustering the VM, just missed the two machines part.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Our SAN administrator monitors the NetApp SAN while I'm doing the db restore and he sees the IO latency is around 10-20ms which is normal, but he sees very little activity hitting the SAN and only sees the write speed around 8-9MB/s which is consistent with what I see from the sql server side. I've used the following tools to measure IO performance from the windows/sql side:

    - Windows perfmon: Avg disk sec/write, Avg disk sec/read, CPU, mem

    - sqliosim

    - sql server error log

    all those 3 gives me very similar IO result which is around 80-100 sec per read/write.

    Yes. I understand this can be million things can go wrong with this issue. I just need some general guideline/direction on where else should i look into.

    Thanks,

    Thanh

  • Try using multiple destination files for your backup. Also, you haven't mentioned if you're using compression or not. These should significantly help the backup and restore speeds. I'm not saying there might not be some underlying system issue, but these are things I usually do to speed up backups/restores in any case.

  • To thoroughly test the storage for I\O patterns use SQLIO not sqliosim

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Yes I did use backup/restore to/from multiple files and yes with compression, but all these didn't make any difference.

    Anyways, I got the issue resolved. We found out the problem was at the VMWare. Let me share my resolution here just in case anyone out there run into same issue i got. Here's the comments from VMWare consultant guy who helped us resolve the issue:

    The configuration of the VMware / NetApp MPIO had to be modified to work with Microsoft Cluster Services. The default is ALUA with Round Robbin PSP on the vSphere hosts. MSCS places permanent scsi reservations on LUNs and requires only a single active path per device per host. All IO on the "other" active path in RR would result in Blocked IO errors in the /var/log/messages.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply