I heard a joke years ago that went something like this. When a developer gets a pull request for code review that's 100 lines long, they will open the file(s), look at the code, and ensure standards are being followed. They might run some code, they'd walk through the algorithm in their mind, and spend 10, 20, or more minutes examining how the change is built. If you give them a code review of 1000 lines, they'll just assume it's great and approve it in seconds.
I know that many developers don't find that funny. Often I meet people that think they're professional and they spend the time needed to examine the changes and ensure they work. I'm sure that many people do this often, and some people are very meticulous about their evaluation of the change. However, everyone gets busy and I know that often depending on who submitted the PR and how busy the reviewers are, the joke does reflect reality. The longer the PR, the less likely someone will either a) spend the time to carefully review it, or b) catch small mistakes.
There have been a few large profile outages in 2023, one of which was an Azure DevOps outage in Brazil. I'm not picking on Microsoft as AWS, GitHub, GCP, and others have had issues. I know GitHub is part of Microsoft, but it's also a separate enterprise that really runs on its own in many ways. The point is more that there will be issues, and some of these are related to the rapid changes of a DevOps or GitOps workflow where PRs aren't always reviewed clearly and cleanly.
In this case, there was a typo in how a process worked. A cleanup process was supposed to delete databases in Azure, but the typo had it deleting the logical servers. Those had many databases, not just old ones that needed cleanup. This PR, however, contained a lot of changes, as there was an upgrade to swap out older Azure Manager packages with Resource Manager packages. I don't know if the cleanup job was related here or included in a large PR, but in any case, the PR was reviewed as in the joke above. It was approved and things started failing.
This wasn't caught in testing as there wasn't great test coverage. You can say MS should have more tests, and they should, but there will never be enough test coverage. There also weren't any systems in their ring 0 (first) deployment that triggered this typo, so no one realized there was an issue. Again, ring 0 systems might not be representative of larger rings. Another reality that we aren't likely to fix in every situation.
Microsoft recovered the data, but it took a long time. I don't know how easy or feasible it is to create smaller PRs with something like this when you are upgrading packages in many files. I just know that the time that I make large-scale changes in code, with large PRs, often I find some problem somewhere. Especially if there are changes that aren't all the related. If too many different things get included, the potential for mistakes and problems goes up.
I think this is actually a good DevOps story. They ran their process and there was a mistake. They fixed it and have started to adjust their process to add more testing in this area and potentially ensure this doesn't happen again. The logging helped them diagnose the issue quickly once it was reported. Their ability to deploy on-demand meant that once the problem was understood, a fix could be quickly deployed. That's what DevOps is: it's not perfect, but it does allow us to understand, learn, and adapt quickly.
Now we just need to ensure that humans use the process in a way that other humans can more easily understand, with smaller PRs.