SLAs and Architectural Design

  • It's all very well agreeing SLAs for SQL services in terms of nines, but how do you ever translate any SLA (i.e. 99.9% or 99.99%) into 'which SQL architecture can actually achieve this'?

    If I have 3 VMs in a SQL AG, one in DR, all atop of some storage snapshot tech, in VMWare clusters, what SLA is that actually fulfilling?

    Interested to have your thoughts.

  • It is not easy(or even possible, perhaps) to mathematically/statistically ensure the numbers you speak of. While that level of redundancy certainly does offer a high possibility of you being able to achieve that SLA, there is no guarantee that you will. If you try to statistically calculate the probabilities, you'll have to take into account the MTBF's of each individual component in your redundant architecture. Theoretically, it will be very difficult to factor in the MTBF's of individual disk drives, HBA cards/iSCSI interfaces, memory, CPU's, mother boards, network switches and what have you in the complex architecture that is any corporate computer/network system today. Patching, system upgrades, application software upgrades also need to be factored in to this picture(as if hardware wasnt enough :-))

    Instead, the SLA is what you strive to achieve, in being pro-active and manage the redundant hardware/architecture & software by heading off potential failures before they can affect your uptime. Ultimately, you can only measure what you have achieved with the actual results, ie, your uptime / downtime against a certain period. 99.99 percent uptime means downtime of not more than 4 mins 19 seconds a month, or 51 minutes a year, etc. For an SLA calculator, you could use several available on the internet, one of which is :

    [/url]

  • It's a shame that there's no reference architectures for SQL for different target SLAs.

    Thanks

  • SLA isn't just about architecture. It's about people and process as well.

    If I have a requirement for no more than 30 min of downtime and no more than 2 hours of data loss, that's trivially achievable with log shipping. However if the failover process has never been tested and the people don't know what needs to be done to the application to redirect it to the second server, then the downtime in the event of a failure could be many hours.

    Meeting an SLA isn't a case of putting the 'right' architecture in and magically all will work.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Typing this on a train. Apologies for curtness and typos.

    You can't say for certain how many 9s you're satisfying until after the fact. but, the way you target a set of 9s is to determine how much down time you can support, then work through your architecture and process (Gail is so right, as usual) to see if you pull the plug on a server, what happens. then do that. you have to test your recovery and failover processes or you don't actually know how many 9s you're supporting. Yes, best to do this before you're in production, but if you haven't done, you may need to arrange to do it anyway. Better to find out in a controlled experiment than in an uncontrolled one.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I've never heard of the concept of "nines" in regards to SLAs. But regarding the architecture piece of SLAs, usually the system is architected for redundancy and to remove single points of failure for hardware issues.

    So under the assumption you are talking hardware failure SLAs and not software / application SLAs (which are a totally different kettle of fish)... I recommend considering clusters, high availability solutions, restore scenarios and multiple data centers with offsite backups created to meet the restore scenarios.

    Once all this is set up, simulate as many failures as you can during maintenance windows so everyone involved understands what roles they are to play, where the pain points of recovery are, and what parts of your recovery plan needs fixing. You can't truly come up with a realistic SLA until you understand all the working parts. Once you do come up with this SLA, PAD PAD PAD it with at least another 4-6 hours. That way if everything goes completely fubar during a real emergency, you've given yourselves enough grace to fix the fubar (you hope).

    Background Note: I used to work for Iron Mountain Records Management, a company that specializes in record retention, business continuity, and disaster recovery. When 9/11 happened, our NYC office was there for several of the World Trade Tower companies and we got them their off-site backups so many of them were running at alternate sites within hours of the disaster. Now that's an heck of an SLA to meet.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Part of the SLA is your best guess at downtime and the hardware, processes, and people you will need to support different levels.

    The business determines how much they are willing to pay for these scenarios that you outline.

    The business does not care whether it is hardware or software - downtime is still downtime.

    And they need to do some work on their side.

    For example, hot site vs. cold site.

    You supply anticipated hours in each scenario, and they determine how much business process they can conduct.

    Can they still manufacture, take orders, ship, etc. and enter transactions later?

    So in the end, both business and IT are weighing cost and risk, and have a way to measure and see if they met their goals.

  • Greg Edwards-268690 (10/15/2015)


    The business does not care whether it is hardware or software - downtime is still downtime.

    I would point out that not all SLAs are downtime related. Sometimes they are related to productivity issues. Example: At month end, certain processes are expected to finish in X amount of hours so the next department can do their thing. Then they have X number of hours before the next department picks up and does their thing.

    Hence my comment about application / software being a different kettle of fish.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • gingerdazza (10/13/2015)


    It's all very well agreeing SLAs for SQL services in terms of nines, but how do you ever translate any SLA (i.e. 99.9% or 99.99%) into 'which SQL architecture can actually achieve this'?

    If I have 3 VMs in a SQL AG, one in DR, all atop of some storage snapshot tech, in VMWare clusters, what SLA is that actually fulfilling?

    Interested to have your thoughts.

    So your server room loses all power due to a natural disaster.

    What is your expected recovery time?

    Does it differ if the building is flooded or flattened?

    Are other applications connecting to the SQL Servers? How are these affected?

    Would recovery be onsite or at a remote site?

    An SLA may include or exclude some things like this, or have different SLA's to cover this.

    Only by testing some scenarios can you take an educated guess as to what this would translate into for your environment.

    And measuring results to see if you met the SLA between you and the business.

  • Greg Edwards-268690 (10/15/2015)


    gingerdazza (10/13/2015)


    It's all very well agreeing SLAs for SQL services in terms of nines, but how do you ever translate any SLA (i.e. 99.9% or 99.99%) into 'which SQL architecture can actually achieve this'?

    If I have 3 VMs in a SQL AG, one in DR, all atop of some storage snapshot tech, in VMWare clusters, what SLA is that actually fulfilling?

    Interested to have your thoughts.

    So your server room loses all power due to a natural disaster.

    What is your expected recovery time?

    Does it differ if the building is flooded or flattened?

    Are other applications connecting to the SQL Servers? How are these affected?

    Would recovery be onsite or at a remote site?

    An SLA may include or exclude some things like this, or have different SLA's to cover this.

    Only by testing some scenarios can you take an educated guess as to what this would translate into for your environment.

    And measuring results to see if you met the SLA between you and the business.

    All excellent points. There is going to be more than one SLA for disaster recovery and business continuity. If there aren't, then the planning hasn't be thorough enough.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply