Problems With Database Problems

Gitlab had a database problem recently. I'm sure you read about it. There have been commentaries from many people, including Brent Ozar and Mike Walsh. There are many ways to look at this outage and data loss (the extent of which is not known), but I'd like to stop and focus on a couple items that I think stand out: competence and care. I don't know how we prevent problems, but I certainly think these items are worth pondering.

First, there is the question of competence. I have no idea what the skills or experience are for the GitLab staff that responded to the event. They certainly seem to understand something about replication or backup, but are they skilled enough to understand deeply about the mechanics of PostgreSQL (or their scripting) to determine where things were broken? I have no idea, and without more information I don't question competence. The thing to be aware of, whether for this incident or your own, are the people working the problem well enough trained to deal with the issues. Perhaps most important, do they realize when they have reached the limit of their expertise? Do they know when (and are they willing to) to call in someone else or contact a support resource?

I saw a note from Brent Ozar that the GitLab job description for a database specialist doesn't mention backups. It does say a solid understanding of the parts of the database, which should include backups. I'd hope that anyone hiring a database specialist would inquire how someone deals with backups, especially in a distributed environment. It's great that you give database staff a chance to work on the application, tune code, build interesting solutions to help the company, but their core responsibility and focus needs to be on the database being stable, which includes DR situations.

The second item that I worry about is the care someone takes when performing a task. In this case, any of us might have been tired at 9pm. Especially if we'd spent the day working on a replication setup, which can be frustrating. Responding to a page, especially for a security incident can be stressful. Solving an issue like that, and then having performance problems crop up is disturbing. Anyone might question their actions, wondering if they had made a mistake and caused the issue. I know when multiple problems appear in a short time, many of us would struggle to decide if two issues are coincidental or correlated. I'm glad that after the mistakes, the individual responsible handed off control to others. As with any job, once you've made a serious mistake, you may not perform at the same level you normally do, and it's good to step back. Kudos, once again.

The ultimate mistake, and one that many of us have made, is to run a command on the wrong server. Whether you use a GUI or command line, it's easy to mistake db1 for db2. I've tried color coding for connections, separate accounts for production, even trying to get in the habit or looking at the connection string before running a command, but in the heat of the moment, nothing really works. People will make mistakes, which is why it becomes dangerous to allow any one person to respond in a production crisis. As a manager, I've wanted employees to take care, and use a partner to double check code before you actually execute anything.

And above all, log your actions. I have to say I'm very impressed with GitLab's handling of the incident and their live disclosure. This is what I like to see during a war room. Lots of notes, open disclosure, and an timeline that allows us to re-examine the incident later and learn from the response. This is an area that too few companies want to spend resources on, but learning from good and bad choices helps distribute knowledge and prepare more people for the future. I'd like to see more disclosure of post-incident review from many companies, especially cloud vendors. I can understand not disclosing too much information while the crisis is underway, as I'd worry some security related information might be released, but afterwards, I think customers deserve to know just how well their vendor deals with issues.

Backup Isn't Enough

by Steve Jones

SQLServerCentral.com

Editorial

When is backup not enough? Steve Jones talks about a few things that can cause you issues and a backup can't help you recover from.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

4 (1)

You rated this post out of 5. Change rating

2010-07-19

327 reads

Discuss

Rolling Back a Restore

by Steve Jones

SQLServerCentral.com

Editorial

Should rolling back a restore be an option for SQL Server? Steve Jones talks about the possibilities in today's editorial and where it might come in handy.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

5 (1)

You rated this post out of 5. Change rating

2010-03-22

153 reads

Discuss

Safety in Numbers

by Steve Jones

SQLServerCentral.com

Editorial

One of the core things a DBA must do is ensure backup and recovery of data. This Friday Steve Jones asks about your tolerance for safety and backup data in a Friday poll.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2010-02-12

112 reads

Discuss

Why Is It Complicated?

by Steve Jones

SQLServerCentral.com

Editorial

One of the very common things that is needed in SQL Server is performing a restore of a database. It's also one of the most important things that needs to take place. So why isn't this a simpler process? Steve Jones wonders why we can't make this a simpler process.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-11-02

106 reads

Discuss

Backup to the Cloud - No Excuses

by Brad McGehee

SQLServerCentral.com

Editorial

With the proliferation of high availability & specialist online Backup companies, do we really have any excuses left to NOT have an offsite backup location, even if it is in the Cloud? Brad things not...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-08-10

136 reads

Discuss

Problems With Database Problems

Rate

Share

Categories

Share

Rate

Problems With Database Problems

Rate

Share

Categories

Share

Rate

Related content

Backup Isn't Enough

Rolling Back a Restore

Safety in Numbers

Why Is It Complicated?

Backup to the Cloud - No Excuses