High Availability Query

  • We are planning to migrate SQL Server from 2012 to 2017. Currently we have pre-prod and prod environment and ideally it would be good to have dev, test, pre-prod and prod but the decision has already been made to replicate the similar environment.

    Now the question is around High Availability. We dont have currently have it but would like to get in new environment. The reason for this is in the past it took 2 months to get server ready when it failed. This was when it was tradition on site sql server. However with azure this has changed significantly and Microsoft can provide server within hours if needed.

    I've following queries:

    1. Am i right in thinking that the timing of failing azure server and getting new one is very quick so we could avoid HA?
    2. In the case of getting HA for Prod, do we really need HA for pre-prod? I assume we can have same spec on both servers but without HA on pre-prod, it shouldn't create any issue of something not working on prod when deployed. The concern is both servers need to be same to minimise the risk but HA is very different functionality and it shouldn't affect any deployment.
    3. I assume it cost significantly high with HA on any server.

    Thanks

  • What are you thinking with Azure? Be careful about talking about servers in the cloud without specifying what you're thinking. The nomenclature gets confusing, and there are lots of options.

    If you are planning on a virtual machine in Azure to run SQL Server, you can spin up a new one in minutes. Literally a 64 core, 400GB RAM machine in minutes running SQL Server. Deployment time for you to access this is probably on the order of 10-15 minutes.

    That doesn't mean you can necessarily avoid HA. You might be able to reduce some need, but get an architect to help you ensure you can get disk set up correctly. There also is the time to reconfigure and restore. Understand the cost of downtime v the cost of something like an AG set up.

    For pre-prod, depending on what your HA is, likely you don't need HA.

    High is relative. Until you understand the cost of uptime/downtime and labor, it's hard to decide what is high. It is more expensive, but high depends on what this means to your organization.

  • I would not under any circumstances run without HA, even if you are using Azure virtual machines. There have been a couple outages in Azure that took the main machines down - or made them impossible to boot and then also made backup storage inaccessible for days.

    I'd recommend doing a node in AWS and one in Azure. If you have the money for only one site, I would recommend AWS. AWS will be more reliable and their quality of their standard support is so much better than the best you can buy from Azure it is absolutely ridiculous.

    The notion that a new machine can be spun up very quickly to replace and existing VM is sound logic under normal circumstances, but there have been cases where machines couldn't be provisioned at all.

    for HA in staging, it would be up to you to determine whether you need HA. I have only ever set up HA in anything other than production when doing trial runs of patches and application deployments. Sometimes HA can introduce other nuances to patching something and if there is a peculiarity in your environment that causes some unexpected behavior to be revealed, it can be useful to figure it out there before you do it in production.

    Cost doesn't really need to be that high. If you use BYOL and have software assurance with license mobility you'd only need to license your live nodes. SQL developer edition is free and you could keep your non-production machines turned off most of the time.

  • dva2007 wrote:

    1. Am i right in thinking that the timing of failing azure server and getting new one is very quick so we could avoid HA?
    2. In the case of getting HA for Prod, do we really need HA for pre-prod? I assume we can have same spec on both servers but without HA on pre-prod, it shouldn't create any issue of something not working on prod when deployed. The concern is both servers need to be same to minimise the risk but HA is very different functionality and it shouldn't affect any deployment.
    3. I assume it cost significantly high with HA on any server.

    1.  Failover in Azure is going to be very much like failover within your local environment. There's nothing magically faster up in the cloud (assuming we're talking Availability Groups). That's assuming failover within a region. If you cross regions (not a bad idea), the failover time will go up, but you'll be getting all the benefits of multiple regions, so it's worth the time. If you start to talk about platform as a service, Azure SQL Database, and the abilities there, the failover time is higher, but again, their HA solution (not counting the internal design HA capabilities) is multi-region.

    2. Tougher answer. Development environments are the production environments for the development team. Should they be set up the same as your production system? Probably not. Does that mean no protections or plans for HA/DR of any kind? Probably not. You have to strike a balance here. For most people, in most circumstances, no, your Continous Integration server, just as an example, doesn't need an HA solution. However, you do need to define what degree of pain comes with the CI server being offline for an hour, a day, a week, and then plan accordingly.

    3. Yes. Same as with on-premises. More servers, more money, more licensing, more management costs. Cost of doing business.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • BrownCoat42 wrote:

    I'd recommend doing a node in AWS and one in Azure. If you have the money for only one site, I would recommend AWS. AWS will be more reliable and their quality of their standard support is so much better than the best you can buy from Azure it is absolutely ridiculous.

    Do you have hard numbers from an independent source to back up these statements? I've seen lots of evaluations done by reputable third parties and when it comes to a straight VM to VM comparison, they're pretty much equal on speed, cost and reliability.

    There have been a number of large scale outages of AWS. The one where a developer took down the entire eastern seaboard comes to mind. I'm not knocking AWS, but I'd need to see hard numbers to back up the blanket statements you made here.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Personal experience. And I mean you can just compare the instances. With all due respect to any reputable third party performing performance to cost analysis, all of the ones I have seen so far universally tend to ignore pretty critical metrics like disk i/o available in an instance or processor type in an instance. There is an absolutely massive difference between a general purpose AWS m5.xlarge instance with 4 cores and 16 Gb of RAM that supports about 400 Megabytes per second of disk i/o, using a Xeon scalable platinum series processor, and a general purpose d4s_v3 Azure instance with 4 cores and 16 Gb of RAM with a max of about 96 Megabytes per second of disk i/o, using a broadwell-based Xeon, even though Azure is only about 10 dollars per month more expensive. There also is a pretty big difference in an instance in more basic instances in Azure that may provision with an Opteron 8400 processor one day and a Haswell the next.

    Many of Azure's new instances also oversubscribe their processors 2:1, one virtual proc to one hyper-thread, while AWS passes through the hyper-threading ability into the VM. This may seem like splitting hairs, but considering that .net and most commodity applications are hyper-threading aware, there can be a pretty big performance difference in scheduling when you have no idea what other applications are running on the same physical server your equipment is running on.

    I have never opened a support ticket with AWS that they failed to respond to within the SLA and have never had a ticket with them they couldn't close satisfactorily. I have somewhat regularly had Azure miss their response SLA by days and have conservatively had a dozen tickets that they closed when they got tired of working on it.

    Azure also seems to rush products into production before they are ready either from a technical standpoint or from a support standpoint. Microsoft Azure Backup Server was very poorly documented (about a year ago) even though it had already been released for two or more years. There was a bug in MABS that randomly impacted writes to backup media in Server 2016 that I worked a support ticket on for almost 4 months before I had to come up with a workaround for the client instead of continuing to spend money fixing it. The web application firewall  appliance had a problem when the config started to get big, that it would take as long as an hour to save a config. It took them almost a year to fix that. The last time I worked on one, the Virtual Network Gateway would randomly change its egress IP address every so often breaking network connectivity. Randomly changing egress gateway is a huge problem when you are using RADIUS authentication in the appliance and can't just open up all IP addresses to your internal network because you are also peering with partner organizations off of the same appliance.

    And stuff certainly happens - but in the case of the AWS outage you mentioned that happened in the Virginia datacenter about 2-3 years ago, customers who paid for geo-redundancy, actually received georedundancy. In the Azure outage in the Austin datacenter about 1-2 years ago, few customers who paid for geo-redundancy actually received it because Microsoft had provisioned more resources in their Austin datacenter than their DR site had the capacity to carry. Full recovery took something like a week and a half and I had clients who weren't fully up for almost the full week and a half. The CyrusOne datacenter very close to the Azure datacenter managed to escape any outage at all. This outage also happened just a few months after another outage in the Austin was caused by Microsoft over-provisioning their virtual infrastructure and no 2 core or 4 core Ds virtual machines would boot. Stuff is going to happen every once in a while, but outages caused by over-provisioning are completely unacceptable.

  • Thank you all - this has been very useful.

  • One more, point to add... I have setup a VM in azure with backup going to a different geo location storage in azure. Also have used the feature disaster recovery with the VM to replicated the VM to a different geo location. Have regularly tested the DR process ( every three months and so far so good).

     

    Also, if none of the option works for VM for your satisfaction. Look for Azure SQL Manged instance. All the DR and HR are taking care of by azure and you can also setup a readonly to a secondary location .

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply