Earlier this week, Jeff Atwood suffered the sort of misfortune we all dread. The Server on which his popular CodingHorror blog was hosted had suffered a hard drive failure and, as result, the virtual machine hosting the blog was corrupted. No backups were available so it resulted in 100% data loss. Jeff initially railed against his provider, lambasting their lack of RAID arrays and the fact that their standard backup procedures did not extend to virtual machines. Ultimately Jeff realised that, after all, it was his data and so he felt obliged to accept full responsibility for failing to have his own backups. And so began the slow and painful process of recovering the blog posts, and their images.
Actually, I’m with Jeff up to the point that he stopped railing at his service provider. Obviously one cannot comment on Jeff and the arrangement that he had with his ISP. There may be all sorts of issues and agreements that we don’t know about. On the more general issue of whether one is justified in expecting one’s service provider to backup a live virtual machine, there is more to go on.
In order to run a production virtual server environment, you need to perform backups of the images at least once a week. You also have to separately backup the contents of the virtual machines, including RDMs, configuration files and virtual disks. This is in the manual. It is part of the service. Jeff is right in believing that he should have kept his own copy of his intellectual property: the content itself. However, the maintenance of the service itself can’t easily be delegated to the publisher of the contents. The backup of a virtual server image is a task that is just too near to the bowels of the server environment.
Clarity about who is supposed to do what is important. When there are no clear divisions of responsibilities in any organisation, or a shared understanding of where these lie, even the most sophisticated disaster recovery plan can fail. When one scales up from Jeff’s misfortune to a large company, this chain of responsibilities can seem arcane and tiresome, but is the result of decades of bitter experience learning from incidents where things have gone wrong. This is why DBAs usually have the ultimate responsibility for the integrity of data, and a speedy full recovery from all foreseeable disasters. It is good that there is a general acceptance of this within the industry, but it is a role that requires great vigilance. Other distractions conspire to make it difficult. There is a daily bombardment of requests and complaints from users, developers and managers, regarding connection problems, slow applications, deployment requests, data migration requests, and so on, as well the constant battles have to be waged against inefficient queries, blocking SPIDs, expanding files and vanishing disk space.
Managing it all often involves a constantly active pager, many broken night's sleep, 12 hour days of a weekend, and enduring the worst that the company vending machine has to offer. And yet, most DBAs I know accept these responsibilities gladly, and execute their tasks with pride and unsung efficiency.
As we approach the festive season, I hope most of the DBAs reading this will get to take at least some "downtime" to spend with family and friends, perhaps running a few extra "test restores" before you leave! I also know that there are many other who will be working, keeping things ticking over in lonely data centers, or tending to applications and databases that never rest, even when most people do.
To all of you, I raise a glass in admiration. Enjoy what holiday you have, and see you all in 2010!