Scripting Makes Mistakes Easier Than Ever

A number of you likely use Atlassian products like Jira, Confluence, Opsgenie, or something else. You might have been affected by a large outage they had (post incident blog, Company Q&A, TechRepublic report) recently which lasted at least 9 days. I don't know if all customers have their data back and are working, but this was a surprisingly poorly handled incident according to a number of reports from customers. There's a great write-up from the outside that you might want to read.

The bottom line in this issue is that Atlassian looked to deactivate a legacy product with a script, but they apparently didn't communicate well among their teams. The script ended up using the wrong customer IDs and also marked the sites for permanent removal, not temporary removal (soft delete). While they supposedly test their restore capabilities, they weren't prepared for partial restores of subsites. I'm guessing this is likely a partial database restore, which many of us know is way more complex than a full database restore.

Leave aside the issue of a software-as-a-service (SaaS) company failing their customers, and the lack of communication with customers. The more interesting thing for me is the challenge of poor coding and communication internally. Clearly, the project to deactivate their legacy app wasn't well planned or tested and the code used was probably executed at too wide a scale initially.

When we deploy code changes to a large number of items, we want to test them at a small number first. Whether we are deploying to multiple databases, against many customers, or different systems, a standard method of making changes at scale involves working in rings. Azure DevOps describes this in docs, and they actually use rings to change the platform. We used the same pattern 20 years ago for software and database updates to many systems. We would internally deploy to a few users to look for issues. Then a week later we would deploy to a small number of systems to check for unexpected issues. Then typically to most systems in the third ring with a fourth ring a week later to catch up stragglers that needed more time to prepare.

I find many customers, especially those with sharded/federated databases or many systems unwilling to spread out deployments in this manner. Often they yield to pressure from business users to ensure everyone gets the same update at the same time. I would never recommend this approach as we need to ensure we are looking at scripts in a controlled environment, or even two, before we deploy things widely. I'd be even more cautious about one-off administrative scripts that might make a change similar to the one Atlassian attempted. Those are often not seriously tested enough.

At the very least, any of us working with multiple customers in a single database or in multiple databases ought to ensure we can backup and restore a single customer, but more importantly, can you restore a group of customers. If you make a mistake like Atlassian, which scripting allows us to do extremely rapidly, can you recover a partial set of data? Many of us don't test this, but that's likely something we ought to consider when we work with scripts that are designed to only change some data. Most of us don't experience complete failures, but partial ones, usually because of human error. We ought to know how to deal with these situations.

It’s a recovery strategy, not a backup strategy

by Additional Articles

SimpleTalk

I’ve talked about it before; you shouldn’t have a backup strategy, you should have a recovery strategy.

2024-07-29

Backup Architecture

by Steve Jones

SQLServerCentral

Continuous backup in Cosmos DB doesn't quite work as Steve would expect. He has a few comments on why it is important you know how your backup and restore system works.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2022-02-21

437 reads

Discuss

How Paranoid Are You About Backups?

by Steve Jones

SQLServerCentral

Sometimes just running a backup isn't enough, especially in this era of ransomware. Steve has a few thoughts on backup strategies and recovery skills.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2021-08-09

433 reads

Discuss

Recovering Databases From a Master Backup

by Steve Jones

SQLServerCentral

Losing your instance might result in the need to get information from what you have. Steve Jones looks at a way to get the proper version and patch, and database list, from what limited resources you might have.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(5)

You rated this post out of 5. Change rating

2020-10-27

2,513 reads

Discuss

A Self-Serve System to Refresh Databases (Part 3)

by Ben Kubicek

SQLServerCentral

This is the wrap up of this series on a system for developers to restore production database in test. It gets pretty detailed on the web setup side.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

(2)

You rated this post out of 5. Change rating

2019-05-16

2,936 reads

Discuss

Scripting Makes Mistakes Easier Than Ever

Rate

Share

Categories

Share

Rate

Scripting Makes Mistakes Easier Than Ever

Rate

Share

Categories

Share

Rate

Related content

It’s a recovery strategy, not a backup strategy

Backup Architecture

How Paranoid Are You About Backups?

Recovering Databases From a Master Backup

A Self-Serve System to Refresh Databases (Part 3)