Today we have a guest editorial as Steve is out of town. This editorial was originally published on 9 Aug 2017.
It's the nature of being in IT that things go wrong, often for less than obvious reasons. We start with the ritual question of "what changed?" and get the ritual answer of "nothing", and then we move on to figuring out the source of the problem (knowing that something did change!). The easy ones are easy, the hard ones test the culture because everyone is trying to quickly say "it's not my part of the stack".
You know how it goes. Application performance degrades and the devs yell DATABASE! It's unlikely (hmm) to be us, so we yell STORAGE and NETWORKING, because, well, those guys are always causing problems. Soon it's apparent that none of us caused the problem - it must be the code!
Not long ago I was working on a project to enhance security for a large company and part of that involved moving some servers to new subnets and tightening the firewall rules. We got all the teams (silos) together, planned the change carefully, and did the implementation. Things went smoothly and all the apps were working fine. Then overnight a database job took 6X the normal run time. The conversation about the slowness went about like this:
- Database: Nothing changed on our side, job has been steady for months. It's not us.
- Firewall: The app works, it's not us.
- Network: The app works, it's not us.
My suggestion was that since the db had not changed, it seemed reasonable to think about how the changes to the firewall and network could be the cause. No give at all from those teams, they were certain it was not them. So, we rolled it all back, putting the db back in the old subnet. The next night, performance was back to normal. Faced with proof that the db really was working, the other teams went back and looked again. This time, looking harder, they found that packet inspection was enabled on the new subnet, but not the old one, and it was maxing out the CPU on the switch when the job ran. They turned that off, we moved everything back, and all was good.
More recently a server VM I use for remote admin began running really slow. Slow as in 15 minutes to boot. Windows guys blamed me, saying it had to be the tools I installed (seemed unlikely to me). Storage team says everything is normal. Same for the network. Weeks go by (weeks!) and the problem seems to come and go. Turning off AV seemed to help, which to me pointed to some kind of network/storage issue. Finally it happened on a different server and then everyone took a harder look. Turned out there was a bad cable on the switch and because the port was used in some kind of round robin fashion we only saw the problem at random times and it was worse when we did a lot of IO to network storage on that bad cable (like the AV scan and booting).
Sometimes it's them, sometimes it's us. I think because the nature of our work involves skewed data, plans falling out of cache, and fragmented indexes - the kinds of changes that don't cause a change management ticket to be created - that we look a little harder.
If we everyone says "not me" the only reasonable approach is for everyone to assume "it is me" and look again and keep looking until the cause is identified. It's interesting to think about why that's so often not the case, isn't it?