Clustering conundrum

  • Hi,

    We are currently planning to upgrade and consolidate 2 of our active/passive database clusters into a 3 node N+1 cluster using larger more powerful nodes. Between them, the 2 clusters host approximately 140 databases of various sizes from 500Mb up to 200Gb with a total of 1.2Tb of data and at peak times they process approximately 1500 batch requests per second. The aim of the upgrade is to provide additional processing resource as the number of databases we need to host is constantly growing, while also reducing the amount of money we spend on the hardware that provides the redundancy in the environment (ie cutting the number of passive nodes from 2 to 1). I’m thinking of using 4 socket servers (eg Dell R910) but starting with 2 CPUs in each node and adding additional RAM, CPUs and SQL licences as requirements grow. Once this project is complete we are also planning to migrate databases from 2 further A/P clusters to the new multi-node cluster.

    A second option that I have been considering is to start off small and expand the cluster horizontally as resource requirements grow. This would mean we would start off with a 3 node N+1 cluster using small nodes, and expand it by adding additional nodes as processing requirements grow, so we could end up with a 6 node N+2 cluster in a year or two.

    So far, I have not worked on anything more complicated that 2 node active-passive clusters and I was wondering if anyone had any experience or advice that they could share on the potential advantages and pitfalls of working with large multi-node clusters?

    My feeling is that we would get a better return on our investment using the larger nodes and that the administration of a 3 node cluster would be significantly easier than a 6 node cluster. However, my concern with the larger nodes is that any problems that occur on a node will affect many more clients compared to the smaller nodes.

    Any advice would be great!!

    Thanks,

    Dave

  • I've recently had to support a 4 node cluster (all active) that was built and still is being built using your plan of adding nodes over time. In general administration once built shouldn't be a big issue as long as the build is consistent and well documented. This also means keeping the documentation up to date as nodes are added.

    A few recommendations I'd make based on my experience are:

    1) Keep the design and naming as simple and consistent as possible. The type of things that caused problems for me were that the server names and SQL Server instance names didn't always collate, so coming in as a consultant I had to use a cheat sheet for weeks to find my way around the various nodes.

    2) If possible match IP Addresses with names, e.g. xxx.xx.xx.01 relates to Servername\instance01 or something along these lines.

    3) Have definite road map for the whole process. The cluster I came in to support seemed to have run out of steam, and in another year or so they would need to redesign. The eventual plan was to get to 8 nodes, I don't think they will get there before the hardware is obsolete.

    4) Plan very carefully how you balance the databases across the nodes. Any single node needs to be able to carry the load of all the instances it may need to support, or pretty nearly. In a disaster people may accept a drop in performance, but clustering is for high availability and users tend to expect the same level of performance.

    5) Remember that multi-node clusters require Enterprize edition. Is this what you really want? Why not stick with Standard Edition, go to SQL 2008 R2 and run 2 Active/Active nodes. This may be more cost effective and just as easy to manage, if not easier. This would also give you "smaller nodes" and resolve the client impact issue you raise. If it were my choice I would tend to go for multiple 2 node A/A clusters. I think this gives a bit more flexibility in terms of new hardware, upgrades etc. and less interdependence.

    6) Watch out for putting DEV/Test/UAT systems on servers with Production nodes. The primary problems I saw were IO and CPU contention during testing. Some of this can be managed, but it takes extra admin and design. Also watch out for Data warehouses and OLTP databases.

    Large clusters sound impressive, but they are definitely a bit more complex to manage than multiple small clusters and I tend to think you need a really good reason to go for a big cluster. If all you are doing is reducing the number of passive nodes, go to Active/Active. As long as you keep an eye on total, average CPU load across the 2 nodes (ignoring peaks) and have the RAM and NIC capacity to run on 1 node only you should be sweet. And you reduce the passive nodes from 2 to 0.

    Sorry this is written a bit how I would present it to the business, not to a DBA. I hope it helps.

    Leo

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

  • Thanks for the reply Leo.

    I hadn't considered A/A clusters before as we have always wanted some level of passive redundancy, but now I'm not sure that it is worth the additional cost for the number of times that they get used. I'm going to look into our options for running multiple small AA clusters like you suggested. I think that they could offer the level of flexibility and resilience that we need.

    You also raised an valid point on the hardware lifetime, with the current pace of hardware development I suspect that by the time we got to the point of needing more nodes there would be a new generation of servers/CPUs available, which would offer further improvements in ROI.

    Regards

    Dave

  • Hi,

    I actually run a 2 node A/A's with 1200 db's on each instance with dual quad core's and 32 gigs of ram for each node. Works out pretty good, but it is all dependent on your sql load.

  • dave.hobbs (7/13/2010)


    Thanks for the reply Leo.

    I hadn't considered A/A clusters before as we have always wanted some level of passive redundancy, but now I'm not sure that it is worth the additional cost for the number of times that they get used.

    Dave

    Interesting comment about "level of passive redundancy". I can't really see any significant benefit in running A/P cluster 100% of the time unless you particularly want, or only have 1 SQL Instance. A/A gives much better use of the hardware and more flexibility. About the only time you need to go to A/P is for patching and failover testing.

    Cheers

    Leo

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

  • Hi Leo,

    I think that we always just opted for the safest option and having the passive node gave us the reassurance that we were completely covered, but now I'm not convinced that it is worth the money. I'm fairly sure that AA is the way to go and I'm going to explore the costs with our hosting provider. Now all I need to do is put together a convincing proposal to the management!

    Thanks

    Dave

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply