Backup Plans for Data Loss

  • Comments posted to this topic are about the item Backup Plans for Data Loss

  • The rumour here in the UK, about the British Airways systems failures, was that they had disbanded a existing, long standing team, in favour of outsourcing the work to India ("a spokeswoman from India's Tata Consulting Services (TCS) told El Reg: "As BA has already confirmed, the problems over the weekend were caused by a power supply issue and not due to outsourcing of IT services, so we can't comment further. BA and all its partners including TCS have been working very hard to restore the services fully." https://www.theregister.co.uk/2017/06/02/british_airways_data_centre_configuration/).  This is quite common over herein the UK.  I like to think that it's not just justifying our own jobs to say that it normally results in a less professional service with communication issues (it seems to be hard for those contractors to admit they don't understand the code or requirements and can take a long while for such issues to get back to the UK team).  In our experience (myself, my husband, and nearly 30 years as developers) the outsourced services then return in-house, at least until there's a new company secretary who adds up the IT salaries and then the whole cycle starts again...  Hopefully this helps explain some of the issues around the outage.

  • I work for a multi national that has tried to automate and offshore all its servers.  We had a power blip.  The servers survived it.  The Air Conditioning system did not.
    The problem was on a hot Saturday night, with nobody there.  Did I mention automate and offshore.  The server room got hot enough that the servers started having heat failures.  Intermittent, and then back working then something melts.  The problems continued through several incremental backups.  Monday morning is when there was enough failures that someone walked into the room.
    Recovery took much longer than if the room had been on fire or if the power lines had just been cut.

  • I think it would be interesting to know exactly what went down with the British Airways failure and I know you, Steve, have made appeals many times to be open with that type of information so we can all learn from it.  And maybe, over time - when it hurts less, they will come forward with the full story.

    One of the challenges in a server/site-down situation is when and how to make the call to switch to the offsite location.  It may have been the type of thing where the staff on-site kept saying "5 more minutes" and the switchover may have taken 2 hours, so they kept trying to get the onsite server back up.

    Alas, we can only speculate at this time ...

  • IowaDave - Thursday, June 22, 2017 8:06 AM

    I think it would be interesting to know exactly what went down with the British Airways failure and I know you, Steve, have made appeals many times to be open with that type of information so we can all learn from it.  And maybe, over time - when it hurts less, they will come forward with the full story.

    One of the challenges in a server/site-down situation is when and how to make the call to switch to the offsite location.  It may have been the type of thing where the staff on-site kept saying "5 more minutes" and the switchover may have taken 2 hours, so they kept trying to get the onsite server back up.

    Alas, we can only speculate at this time ...

    True, though given some of the reporting, I suspect this was a problem with data corruption. Could powering things on have caused this? Sure, but that seems like a separate data management failure to not be able to handle some set of machines failing, even from data corruption.

    My  guess? Maybe power, but they ran some sort of mirroring type setup, moving data for HA, but not DR.

  • In my previous job because there was only 2 of us, we wore all the hats. We had a DR plan, we wrote it up and tested it. But in the job I'm in now, its a large IT shop, very position is specialized. I'm a developer here, so I've no idea what the DR plans are or if they're practiced. I would assume that they exist and have been at least once practiced. I kind of hate asking about it though, as I feel both like I'm walking on egg shells and also because I know I'll get the, "this isn't any of your business" speech. Which, in fairness, it isn't any of my business.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • IowaDave - Thursday, June 22, 2017 8:06 AM

    One of the challenges in a server/site-down situation is when and how to make the call to switch to the offsite location.  It may have been the type of thing where the staff on-site kept saying "5 more minutes" and the switchover may have taken 2 hours, so they kept trying to get the onsite server back up.

    This makes a lot of sense to me. A restore feels like an option of last resort. I've certainly been reluctant to call for one, trying to fix the problem another way and in the end, wasting more time than if I'd gone for a restore right away. Probably a reflection of not practicing restores enough & not being 100% confident that a restore won't cause other problems. After all, no-one wants to be the guy from Jurassic Park who restarted the power and let the velociraptors out.

    Leonard
    Madison, WI

  • Rod at work - Thursday, June 22, 2017 9:33 AM

    In my previous job because there was only 2 of us, we wore all the hats. We had a DR plan, we wrote it up and tested it. But in the job I'm in now, its a large IT shop, very position is specialized. I'm a developer here, so I've no idea what the DR plans are or if they're practiced. I would assume that they exist and have been at least once practiced. I kind of hate asking about it though, as I feel both like I'm walking on egg shells and also because I know I'll get the, "this isn't any of your business" speech. Which, in fairness, it isn't any of my business.

    I'd be sure I can recreate my stuff. Can I reproduce my dev work, is my VCS being backed up (I'd ask), and more. Might trigger others to be sure they can recover.

  • Steve Jones - SSC Editor - Thursday, June 22, 2017 11:36 AM

    Rod at work - Thursday, June 22, 2017 9:33 AM

    In my previous job because there was only 2 of us, we wore all the hats. We had a DR plan, we wrote it up and tested it. But in the job I'm in now, its a large IT shop, very position is specialized. I'm a developer here, so I've no idea what the DR plans are or if they're practiced. I would assume that they exist and have been at least once practiced. I kind of hate asking about it though, as I feel both like I'm walking on egg shells and also because I know I'll get the, "this isn't any of your business" speech. Which, in fairness, it isn't any of my business.

    I'd be sure I can recreate my stuff. Can I reproduce my dev work, is my VCS being backed up (I'd ask), and more. Might trigger others to be sure they can recover.

    Years ago, I worked at a place where the disk storage for the Team Foundation server crashed and there were no backups (disclaimer: the TFS server was managed at the time by the QA team, and I was a developer). We lost not only our VCS but also work items and change control history for several projects. Fortunately between all team members, we were at least able to salvage the most recent branch of all projects from our local work folders, and I was able to retrieve a few prior versions for some files from my local \AppData\Local\Temp folder and by scripting out objects from backups of development and production database servers.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • kerry_hood - Thursday, June 22, 2017 6:16 AM

    The rumour here in the UK, about the British Airways systems failures, was that they had disbanded a existing, long standing team, in favour of outsourcing the work to India ("a spokeswoman from India's Tata Consulting Services (TCS) told El Reg: "As BA has already confirmed, the problems over the weekend were caused by a power supply issue and not due to outsourcing of IT services, so we can't comment further. BA and all its partners including TCS have been working very hard to restore the services fully." https://www.theregister.co.uk/2017/06/02/british_airways_data_centre_configuration/).  This is quite common over herein the UK.  I like to think that it's not just justifying our own jobs to say that it normally results in a less professional service with communication issues (it seems to be hard for those contractors to admit they don't understand the code or requirements and can take a long while for such issues to get back to the UK team).  In our experience (myself, my husband, and nearly 30 years as developers) the outsourced services then return in-house, at least until there's a new company secretary who adds up the IT salaries and then the whole cycle starts again...  Hopefully this helps explain some of the issues around the outage.

    Now that's super-interesting.  Being in the US, I hadn't heard that, so thank you.  Like the editorial said, I didn't believe the official statement either - never did.

    I once worked for a company that made the decision to outsource the infrastructure management.  Management did their due diligence, conducting tests and doing an ROI workup, ignored the failures and went ahead with it anyway.  They spent millions on a 7-year contract and ended up buying out the last 3 years of the contract just to kick them out because it was such a disaster.  There's nothing like having a team with a real interest on board.

  • One should always verify backups on a regular basis and make sure they have a good disaster recovery plan.

  • Markus - Friday, June 23, 2017 7:02 AM

    Do NOT think a disaster can happen to you and your company.  Last month we had a catastrophic failure of our SAN, myself and many others worked around the clock for 10 days to get everything back up and running and current.  It basically corrupted 12 of our production SQL Servers in which we had to reinstall and restore everything.  Thank goodness I had great documentation and had already practiced reinstalling and restoring master and all databases.  One thing that I learned is to have a secondary place to keep all of the install media.  The server that I had everything on was toast.  So, I did a lot of DVD installs and burned some of the Service Packs and CUs to DVD media.  Some of it I had to re-download.

    Once the OS and SQL Server are reinstalled one must then redo all the configuration settings, accounts, permissions, etc. and other dependencies to make the environment match what was before. That's why it's a good idea to make image backups of the OS / Program Files drive(s) in addition to database backups.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell - Friday, June 23, 2017 7:27 AM

    Markus - Friday, June 23, 2017 7:02 AM

    Do NOT think a disaster can happen to you and your company.  Last month we had a catastrophic failure of our SAN, myself and many others worked around the clock for 10 days to get everything back up and running and current.  It basically corrupted 12 of our production SQL Servers in which we had to reinstall and restore everything.  Thank goodness I had great documentation and had already practiced reinstalling and restoring master and all databases.  One thing that I learned is to have a secondary place to keep all of the install media.  The server that I had everything on was toast.  So, I did a lot of DVD installs and burned some of the Service Packs and CUs to DVD media.  Some of it I had to re-download.

    Once the OS and SQL Server are reinstalled one must then redo all the configuration settings, accounts, permissions, etc. and other dependencies to make the environment match what was before. That's why it's a good idea to make image backups of the OS / Program Files drive(s) in addition to database backups.

    I would agree.

  • You should make note now, if you didn't already, of the issues you encountered and what wasn't available. Or what you didn't have a backup of (a linked server, alogin, etc.). Then start building jobs to ensure you have backups of those.

    Not all your backup will be a database backup or windows backup. You might need some scripting backups that capture things.

  • Steve Jones - SSC Editor - Thursday, June 22, 2017 11:36 AM

    Rod at work - Thursday, June 22, 2017 9:33 AM

    In my previous job because there was only 2 of us, we wore all the hats. We had a DR plan, we wrote it up and tested it. But in the job I'm in now, its a large IT shop, very position is specialized. I'm a developer here, so I've no idea what the DR plans are or if they're practiced. I would assume that they exist and have been at least once practiced. I kind of hate asking about it though, as I feel both like I'm walking on egg shells and also because I know I'll get the, "this isn't any of your business" speech. Which, in fairness, it isn't any of my business.

    I'd be sure I can recreate my stuff. Can I reproduce my dev work, is my VCS being backed up (I'd ask), and more. Might trigger others to be sure they can recover.

    Now that's something we didn't do at my old job - put the database into source control. As it turns out, it didn't bite us, but it could have. We dodged that bullet, well at least until I was laid off. Can't say what happened after that.

    As far as where I work currently, I don't know if databases are in source control or not.

    Kindest Regards, Rod Connect with me on LinkedIn.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply