SLA Contracts

  • Comments posted to this topic are about the item SLA Contracts

  • Gosh. We should setup a poll. I wonder how many DBAs subscribe to the idea that it's ok to lose data on a production system... SLA or not... ever. 😉 I'll agree that you may not be able to keep a system up 100% of the time over the years because of budget constraints, etc, but losing data? There's just no excuse in my book especially with all the tools built into SQL Server. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Personally I have asked/educated the business if they need transaction log backups or only full backups. But that's it...

    However I think it is necessary to know how much downtime they can take and I'm sure more questions can be asked to obtain vital information for disaster recovery.

    To get this discussion going :-D: What information do you gather? Or do you have formal documents with SLAs?

  • Jeff Moden (3/3/2011)


    I wonder how many DBAs subscribe to the idea that it's ok to lose data on a production system.

    I don't think any REAL DBA (i.e those who chose the career, as opposed to those who were told to "look" after the database) would subscribe to the idea of losing data, but...

    I am flabergasted at the number of companies (well, managers anyway) who brush this sort of thing under the carpet as soon as they hear there is a cost involved.

    I've only ever worked for one customer (police) where this was specifically written into the SLA, and thoroughly tested as part of their test criteria before accepting the system into production.

    The majority seem to think it's worth the risk, given budgetary constraints and the reliability of modern hardware/software.

  • I find this topic funny, in a sick way. We do have SLAs, and they're a big deal to management. However, the DBAs and developers are not told what the agreements/priorities are. In my organization, it's more about looking/sounding good to the customers than actually doing a good job. There is absolutely no communication.

    Thankfully we do not sell/use these products outside our own company, so none of you need to worry about getting junk from us! :w00t:

  • I don't think anyone would ever want to lose data, but it's possible and requires some expense to avoid. Synchronous mirroring is the only way I see to guarantee no data loss at all, and that requires Enterprise Edition, which not everyone has.

  • I always ask two questions of the business entity that would be considered the "owner" of the data: 1) How much data can you re-enter, measured in time (typical would be one hour); and 2) How long can you be without your system, working via manual procedures? (typical 1/2 to two hours). The answers to these questions establish your "worst case scenario" and governs how you protect via backup strategy (full and incremental). If the business entity can justify a replicated or clustered environment, then that becomes the first option for DR purposes; but any strategies you have in place should always protect according to the SLA.

  • Jon-413357 (3/4/2011)Personally I have asked/educated the business if they need transaction log backups or only full backups...

    I never ask end users about transaction log backups vs full backups or backups vs no backups at all, because I assume that they have no understanding of the ramifications of this choice. I assume that they do not really care until there is a problem, at which point it will be vital to recover the data.

    One user told me that database backups were not necessary. I asked him if it would be OK if I dropped the database right now. He said that it was vital to the business so of course it wasn't OK to drop it. I asked how he thought we would get the database back after it was dropped by accident or was lost due to hardware failure.

  • As a consultant, I'm seeing a lot of companies going the VM path before they call out a DBA. The typical problem that I keep seeing (besides all databases except model in Simple recovery mode <sigh>), is even the full backups not safe.

    The network guys apparently think it is safe to carve off a partition or a vdisk which sits on the same underlying shared storage as the data, log files and backups. Aaaahhh.

    Every shop I walk into gets a big education of the ramifications of their folly and that the word RAID does not mean that they'll never again loose data, but each company is willing to make the changes to protect their data. At least, after I explain it with similar questions as the above posts.

    Jim

    Jim Murphy
    http://www.sqlwatchmen.com
    @SQLMurph

  • I think some of you may be looking at the question of losing data differently than I do.

    I'm not saying you should turn off the transaction logs, I'm saying that some of you may realize that it is acceptable to lose a certain amount of data under some catastrophic scenarios.

    If a meteor hit my data center, or a truck full of explosives was driven into it (and a pickup truck would be too small) then we would lose all data back to the last off site backup. This is up to about 30 hours(1) and varies depending on when a particular backup is taken and the "send off-site" schedule.

    If the server dies because someone put a bullet through it, (which involves getting past police officers and through hefty physical barriers) then we're still only talking about seconds of data loss (because the data is on a SAN or NAS).

    I think those are acceptable numbers. We(2) (my servers) aren't Amazon/priceline/eBay... a huge ongoing cost against a minuscule risk isn't reasonable. I think that's where Steve was heading when he said, "[You should] efficiently plan what level of resources will be devoted to your systems."

    Steve - to answer your SLA question I "sort of" know what our SLA is because I know the relatvie importance of each system. There are some systems that are disaster recovery (DR) tested twice a year, and there are others where there is a DR, but the clients don't formally 'accept' the success or failure of the DR.

    So, while I may not have access to the SLA documents - I do know what order to recover the systems in, hince the phrase "sort of".

    As a side note, at one point a few years ago I was asked what it would take to cut the 'catastrophic' scenario data loss time down to under a minute. After several meetings with the platform people and network specialists about how we could ship logs in real time between the data centers and have redundant hardware in each, we came up with a cost estimate. This number was about nine times what we were currently spending.

    (1) - Backups are now also shipped electronically to the other datacenter, so that number is now lower than 30, but I don't have an estimate of what it is.

    (2) - In case you know where I work, I feel I should mention that there are areas here that have a much lower tolerance for data loss, I'm just talking about my area.

  • A recent discussion with a client brought up a point it is vital to understand clearly - the difference between (1) hardware failure, eg. a drive dies, and (2) data corruption, eg. accidental DROP or DELETE, software/application issues.

    RAID seems to be sprinkled around like holy water as a general solution to data loss, but in case (2) you'll just end up with two perfect copies of the same corrupted data.

    I'm sure this is IT 101 for most readers here, but I've been surprised how many (even supposed IT) people don't really understand this distinction well.

  • A lot of folks on this thread keep talking about "HUGE COSTS". Why do you people think that there are such huge costs involved for simple data safety? :blink:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • That is really close to the point I make as well.

    The end user HAS to be involved in the decisions that govern how we will respond and protect their systems and data. We have multiple platforms (SQL, Oracle, Cache) with different SLA's to different entities, but each one has at least two recovery stratgegies in place. Normally, we end up with some type of replication/shadowing as our primary; then full/iincremental backups as secondary -- but both need to fulfill the SLA requirements. Different options also has different cost requirements, and typically, the business entity will have to have these monies in their budget to support the strategies chosen.

    It's all about accountability, and our accountability to an SLA is what drives us to test strategies. We have to know that we're protecting our rear ends should one of these systems go down. The users need to know that should this happen, they'll be able to continue to perform for their customers.

  • Jeff Moden (3/5/2011)


    A lot of folks on this thread keep talking about "HUGE COSTS". Why do you people think that there are such huge costs involved for simple data safety? :blink:

    It doesn't have to be HUGE costs to put people off. Even the cost of another server to store transaction log backups away from the main server is too much for some companies.

    At a recent customer, a quick look at the hardware at each remote site showed that, although the database and backups were on different drive letters, they were all part of the same RAID array. Even though this was high profile enough for a loss of data to make the newspapers, no amount of throwing hands up in horror could do anything about the fact that there simply was no budget to add a few more disks to split the array or add an external disk, let alone add another server.

  • We have a reasonably well established structure relating to DR, including data loss.

    DR planning has been conducted, including the sequence of actions from the disaster, through declaration of disaster, requirement to switch to DR, and data/system recovery and switch over. Identification of individuals and their backup has been completed.

    We have about 100 systems (and growing) both internally and externally focussed which have been assessed. Each has been assessed on a financial basis. Assessment incorporates T's and C's of SLAs where they exist, be they in external contracts or internal business requirements. Other issues assessed include cost of data loss/re-entry, financial losses including loss of profits, loss of clients and loss of reputation. I remember including the statutory liability that may arise if our Financial System were lost in a reporting period due to missing deadlines at the Stock Exchange.

    Having numbers on potential losses in front of a manager focuses their attention. Ditto signed-off numbers in a client contract for SLA failure penalties. Cost the solutions, and they can then decide which of the data security scenarios they are prepared to fund. From an IT perspective I just love this mechanism - their system, their money, their decision. We serve to research, advise and implement.

    DR testing on select systems, where possible, occurs regularly. I have seen a number of systems tested on 6-monthly cycles. We have different solutions for different requirements. Mostly we have primary and secondary data centres and ship logs from one to the other.

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply