MTTD

  • Comments posted to this topic are about the item MTTD

  • I don't think that you can get a 1 minute diagnosis on a newish system.

    Continuous improvement and engineering excellence is the way to get there.  The cycle I would expect to go through would look something like that below

    1. Work out what you think can go wrong
    2. Work out how you would detect it
    3. Define useful error messages, alerts and communication paths
    4. Work out how to engineer out the things in #1

    Even the best of us get surprised by  the way things manage to go wrong in unanticipated ways.  The important thing is to do the root cause analysis and feed that into the 4 step process above.

    In my experience continuous improvement naturally leads to refactoring and simplification.   This makes systems less likely to go wrong in the first place and much quicker to diagnose when they do.

    There is a lot to learn from The Clean Coder by Robert C Martin

  • The issue of HADR used to be difficult to set up in previous versions of SQL Server and Windows. However the multitude of new HADR configuration set ups are always improving and made easier to administrate not just on premise but also in the cloud. Further HADR topologies combined with integration and migration into non Microsoft HADR solutions are pushing and pushing better and better designs whilst definitely minimising/automating administration burden therefore maximising up time. In short 5 9's are more than achievable on today's HADR Eco Systems by diverting high administration costs into automated management and monitoring.

     

  • I'm unconvinced whether AOG and other replication or scale out technology increase or decrease database availability. Designing the applications (and monitoring) to be more fault tolerant can increase system availability in terms of end user perception and uptime reporting.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • David.Poole wrote:

    I don't think that you can get a 1 minute diagnosis on a newish system.

    Even the best of us get surprised by  the way things manage to go wrong in unanticipated ways.

    Agreed , but I have come across a few scenarios where a DBA has advised a developer that "this is a huge mistake waiting to happen" and been overruled.

    On these occasions you have your monitoring in place and can prove the issue in minutes (hopefully a good dba would also have the rollback plan ready to go)

    I'm running a server consolidation project at the minute with lots of linked servers involved... there's no way it's going live without every scenario I can conceive being tested and lots of scripts ready to protect us....

    fingers crossed we can respond in 1 minute (but I think I just jinxed us)

     

    MVDBA

  • HA/DR can lower uptime with complexity. Loose coupling and independence can help. We could argue that spending time on a broken secondary v a broken primary might increase downtime, but in most situations I think uptime is as high or higher.

  • Eric M Russell wrote:

    I'm unconvinced whether AOG and other replication or scale out technology increase or decrease database availability. Designing the applications (and monitoring) to be more fault tolerant can increase system availability in terms of end user perception and uptime reporting.

    To my mind data availability is as important as database availability that is important.  Obviously your DB has to be up and running but if your scaled out DB has just replicated a delete without a where clause then you are just as stuffed as if the scaled out cluster went down.

    I know that a lot of people are sadder and wiser for having experienced the horrors of BASE rather than ACID

  • David.Poole wrote:

    Eric M Russell wrote:

    if your scaled out DB has just replicated a delete without a where clause then you are just as stuffed as if the scaled out cluster went down.

    4 words that need to be used in any delete situation

    begin tran

    rollback tran or (commit if your rowcount is good)

    I never put a commit tran in a script until I know my where clause is good. I keep a special cupboard where we lock the naughty developers who forget this 🙂

    MVDBA

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply