Slow Disc Access on Single Cluster Node

  • Hi there,

    We currently have an issue with one of our SQL 2008 R2 Clusters. The cluster contains 2 virtual nodes with 2 SQL instances, and both of the nodes are identical in terms of build and updates/hotfixes etc. (The only difference is that the primary node has more memory and processors allocated to it, as the second node is just kept as a passive node).

    The issue is that when all the resource groups are running on the primary node, the disk performance seems very poor. For example, when the databases are backed up (whether to another server or even to a local drive) it runs slowly, as the read/write performance is so slow. Also, even when just copying and pasting files between drives on the server it is very slow, so it does not appear to be just a SQL issue.

    Initially it appeared there was an issue with the disks, but when the cluster is failed over to the secondary node the read/write performance is suddenly fine, so it seems the issue is with the primary node.

    We have removed, rebuilt and re-added the node to the cluster but the issue remains. I'm not quite sure where to go next....any thoughts?

    Thanks,

    Matt

  • matt.gyton (4/23/2013)


    Hi there,

    We currently have an issue with one of our SQL 2008 R2 Clusters. The cluster contains 2 virtual nodes with 2 SQL instances, and both of the nodes are identical in terms of build and updates/hotfixes etc. (The only difference is that the primary node has more memory and processors allocated to it, as the second node is just kept as a passive node).

    The issue is that when all the resource groups are running on the primary node, the disk performance seems very poor. For example, when the databases are backed up (whether to another server or even to a local drive) it runs slowly, as the read/write performance is so slow. Also, even when just copying and pasting files between drives on the server it is very slow, so it does not appear to be just a SQL issue.

    Initially it appeared there was an issue with the disks, but when the cluster is failed over to the secondary node the read/write performance is suddenly fine, so it seems the issue is with the primary node.

    We have removed, rebuilt and re-added the node to the cluster but the issue remains. I'm not quite sure where to go next....any thoughts?

    Thanks,

    Matt

    Both nodes should have same RAM and CPU, specially on a two node Cluster. Yes, you can use different CPU and RAM specs, but on my personal experience and for a two node Cluster, that's not recommended.

    Having more RAM will give you a boost in performance . The node with more RAM, will do less paging to disk, which makes the server faster and won't feel so slow. Also, if that node has a faster CPU, will also processes stuff faster.

    Schedule a downtime window, if you can, and run SQLIO. Follow the instructions on this link, given by Brent Ozar: http://www.brentozar.com/archive/2008/09/finding-your-san-bottlenecks-with-sqlio/

    Put same amount on RAM on both nodes ... check again ...

  • Thanks - I'll give SQLIO a try....I've head of it several times but never used it - that guide looks good though!

    The irony is that the node that is slow is the one that has twice the memory and 3 times the number of virtual CPUs of the other, and that one runs fine....

  • Can you provide a little more info on the storage itself and the connectivity to the storage from each node?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • The storage consists of separate RAID arrays with high speed discs for Data (4 discs RAID10), Logs (2 discs RAID1), Quorum and MSDTC (2 discs RAID1). The nodes are virtual and each runs on a separate ESX host. These are both connected to the storage via a fibre channel. Both nodes map to the same RDMs and the server's local discs (I.E. C: drives) also exist on the same data store.

    That's about as far as my knowledge of the storage goes I'm afraid...!

    I ran SQLIO on both nodes and got the following results, which just confirms my suspicions:

    NODE 1 (Problem Node):

    Writes using 8KB random IOs - IOs/sec: 971.79 MBs/sec: 7.59

    Reads using 8KB random IOs - IOs/sec: 887.03 MBs/sec: 6.92

    Writes using 64KB sequential IOs - IOs/sec: 157.00 MBs/sec: 9.81

    Reads using 64KB sequential IOs - IOs/sec: 154.86 MBs/sec: 9.67

    NODE 2:

    Writes using 8KB random IOs - IOs/sec: 1149.83 MBs/sec: 8.98

    Reads using 8KB random IOs - IOs/sec: 1645.81 MBs/sec: 12.85

    Writes using 64KB sequential IOs - IOs/sec: 2319.13 MBs/sec: 144.94

    Reads using 64KB sequential IOs - IOs/sec: 2081.33 MBs/sec: 130.08

    I have now caved in and logged a Support Call with Microsoft as I think I have exhausted all my ideas!

  • Hi Matt,

    Shall we get any update over this issue? You sorted it out?

  • Hi - I'm still waiting for a callback from Microsoft but I will certainly give you an update once it's resolved...

  • Firstly I would start with the ESX hosts, check the HBA types and settings

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Please check with ESX Host, HBA (fibre Channel), multipathing setting and driver.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply