The Weather Outside Is Frightful

  • Comments posted to this topic are about the item The Weather Outside Is Frightful

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Most of us know Murphy's law that 'anything that can go wrong will go wrong', but fewer of us realise that Murphy was an optimist.

    Back in 2018 we had implemented a local cloud under Hyper-v with multi-site replication for servers and distributed availability groups for SQL. Everything ran on the latest kit with all solid-state storage. Performance was wonderful, BCO failover was tested and worked. We thought we were in a good place.

    We wanted to add a new batch of nVME storage to the system, the latest and fastest stuff, so we followed the same process for when we last added some SSD.  At this stage Murphy came to lend a hand.

    When adding storage to a Hyper-v cluster, the MS software takes a snapshot of the storage topology before putting the storage online, so it can roll back if there is a problem. We had just about the fastest kit available back then, and the MS snapshot had a 'timing issue' and could not get itself to a stable state. Both the before and after versions were corrupt.  So it proceeded to drop the volume topology information.

    We watched with increasing horror as server after server reported it had crashed. Then the server reporting the crashes crashed.  It was like having a necklace 1 mile long where the string had been cut and all the beads were on the floor.  All our storage was there and unharmed, but we had lost all the information needed to string it together in the right sequence.

    Fortunately our BCO site was not affected, but BCO practice now turned real. The main business critical systems were back in use within about 2 hours.  The last systems, including our Dev environment, took about 30 hours before they were all OK. As far as we could tell, we had had no data loss.

    Our storage vendor was able to replicate our problem on one of their test environments, and could also reliably lose all their storage.  All this went to MS who did decide that maybe a bit of software that had no problems in almost 20 years of use needed a bit of an upgrade. So, a few months later the whole MS world got a fix to the MS snapshot component that allowed it to work on the fastest kit around.

    All this happened at a charity with a GBP 70 m income, a medium sized business. Because we had planned and practised, when a bigger problem than anything we expected happened, we were able to get through it at just the cost of some staff time.

    A couple of years later we had another forced BCO failover due to a problem with city-level data cables not being located as per plan and a JCB digger.  Murphy will come and help in all sorts of situations, it can be useful to be prepared for him.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • In the real world, one has to hope for the best, plan for the worst and insure for anything else!

    😎

    Weather is just one of the many factors, solar storms, volcanic eruptions, forest fires, seismic activity, etc. down to assessing the delivery reliability of all the base services (power, network, people....)

  • EdVassie wrote:

    Most of us know Murphy's law that 'anything that can go wrong will go wrong', but fewer of us realise that Murphy was an optimist.

    ...

    Good gosh. That's incredible. Well done. And yeah, Murphy was an optimist.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Eirikur Eiriksson wrote:

    In the real world, one has to hope for the best, plan for the worst and insure for anything else! 😎 Weather is just one of the many factors, solar storms, volcanic eruptions, forest fires, seismic activity, etc. down to assessing the delivery reliability of all the base services (power, network, people....)

    Don't forget extra-solar radiation fields that we occasionally pass through as we journey through the universe. Those have negatively impact hard drives in the past. The universe is actively out to get us.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey wrote:

    The universe is actively out to get us.

    This made me laugh.

    It's not paranoia if it's true, right!?  😉

  • JJ B wrote:

    Grant Fritchey wrote:

    The universe is actively out to get us.

    This made me laugh.

    It's not paranoia if it's true, right!?  😉

    Exactly.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I couldn't agree more about being prepared ahead of time. When I read your post, what immediately came to mind were two words: Crowdstrike 🙂    That horrible morning is still in my brain, when I was the unlucky soul to be the oncall that week.

    My contribution to our recovery that day was not even a SQL solution. I requested, and received, ESX access to our VMs in order for our team to be able to fix and reboot our hundreds of SQL VMs, while other teams in the org worked on the thousands of non-SQL servers. We were 80% recovered by noon, and fully back before 5pm.

     

  • Randy Rabin wrote:

    I couldn't agree more about being prepared ahead of time. When I read your post, what immediately came to mind were two words: Crowdstrike 🙂    That horrible morning is still in my brain, when I was the unlucky soul to be the oncall that week.

    My contribution to our recovery that day was not even a SQL solution. I requested, and received, ESX access to our VMs in order for our team to be able to fix and reboot our hundreds of SQL VMs, while other teams in the org worked on the thousands of non-SQL servers. We were 80% recovered by noon, and fully back before 5pm.

    Respect.

    I was in an airport trying to get home. Happily, my airline, American, didn't suffer as much as others (sheer dumb luck, AA isn't any better than most other airlines, just the one I use). It all worked out OK.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply