SQLServerCentral Editorial

Serious Software Glitches

,

Recently Robert Sterbal pointed out a podcast to me. This link is for Apple Podcasts, but it's for the Journal, which is on other platforms (I listened on Spotify). It's the story of a computer glitch in UK post office software, which resulted in quite a few local postmasters being criminally prosecuted, many convicted, and even a few committing suicide. It's a sad story, and it's complex, but there are some technology-related elements.

First, the overall story is Fujitsu sold the UK a point-of-sale system for post offices. There was a computer glitch here, which incorrectly calculated lots of totals and showed postmasters owing more money than they should. They were upset, called support, got nowhere and many were liable for paying money they didn't owe. The UK postal management hid information about the widespread nature of the problem, while prosecuting many local postmasters. Fujitsu support didn't disclose to callers how others were experiencing this same issue. This also coincided with a (an unrelated) law that changed saying computer systems were presumed correct and anyone accused of a crime had to prove the computer was wrong.

Without a doubt, there are human failings here with support people, management, a vendor, and likely others. I don't want to minimize those, and I do think quite a few people involved, especially management, should face charges. However, since this is a database-related site, I wanted to focus on the code quality here. I don't know the exact nature of the calculation issue, but there is clearly a bug somewhere in the system. Do we, as technologists, think we're better developers or database people than those at Fujitsu? Would we not produce calculation bugs that might be hidden in aggregations? I have to say that I see this stuff all the time and not just in development. I run into these bugs in production, and I think this is often because we don't embrace enough testing. I see this in all sorts of systems, with developers of many different experiences.

While application developers have gotten very good at unit testing, that same habit hasn't gotten as widely deployed among database developers. What's more, I often find that people writing aggregation queries for reports often use lots of live data, and they don't write tests or even perform calculations to ensure complex formulas are correct. If you've ever done complex aggregations in SQL or DAX, you might find there can be strange effects from filters, from NULLs, and even from the way a window or range of rows is processed. It's easy to say that a report on 1,000 rows of data out of 100,000 is roughly correct with some total, when you haven't actually verified that calculation manually.

I certainly think Fujitsu deserves a lot of blame in this case. Ultimately, they are the source of issues. Those that covered up the problems, both at the UK government organization and at Fujitsu should be prosecuted and held liable, but the programmers and testers are also at fault. They didn't do a good job testing their software, and worse, didn't do the job of tracking down the bugs, finding issues, and correcting them. I hope those issues are fixed now, but they weren't addressed promptly as this situation took place across years.

I often work with companies trying to build software better, but I find it hard to get them to test database software. I know the testing frameworks are immature, the tooling is poor, and honestly, too few of us have a good test data management process in place. However, we can start to learn to add unit tests to our code. At the very least, we ought to write some repeatable, automated test when a bug is reported. Clearly, in that situation, we (as a team) didn't write good code if a bug was found. Either because of tech skills or we didn't get the specification correct. In either case, we need to improve and automated tests to ensure we don't make this mistake again are a way to start getting better.

Much of the software I've worked on isn't directly related to affecting human lives. That's probably true for most of you unless you write software that controls some sort of vehicle movement or medical device that dispenses care or drugs. My son works on rocket flight software, and he takes that seriously since people will be riding those, but for most of us, the work we do isn't critical to anyone living or dying.

However, this story shows that we might still affect human lives. We ought to take that responsibility seriously and ensure we are doing the best job we can to produce quality software. Having some testing (and good test data), is a way to double-check ourselves and our team. It's worked well to raise the quality level of mobile software dramatically. We database people ought to learn from that and adopt better testing.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating