From the article:
The same thing should be done for our relational systems. The Ops part of DevOps needs to be using monitoring and instrumentation to measure performance, adding capacity as appropriate, which should be before users realize there's an issue. With today's virtual systems, adding CPU and RAM usually is fairly easy, and it's easy in the cloud as well.
Of course, all this monitoring isn't just to add capacity. Having a better sense of what's going on can help you pinpoint poor code. Getting someone to fix that code becomes a lot easier if you can show that better code would cost less for our systems. It can be amazing how much more developers care about their code when the CFO gets involved.
I realize that this article is mostly about hardware and that it wasn't intentional but I think it a bit funny that you have the order of those two paragraphs as they are and mention code almost as if it should be a second thought rather than the first.
The company that I work at went through a pretty good hardware upgrade in May of 2010. It didn't help their performance problems. When I started working there in Nov 2011, they had multiple daily "outages" on the floor while batch jobs ran and, when they weren't running, the screen returns were slow (frequently >5 seconds and many up to 22 seconds). We fixed a lot of code in the year after that and things were much improved but a lot of screens still to up to 22 seconds to return. The batch runs ran faster because they rewrote the RBAR to be more column based and many of the runs dropped from 8 hours in duration to just 4 or so. Notice I didn't say "set based"; they still used Scalar and Multi-Statement functions a lot.
Over the years, the business grew a huge amount and so did the database (started with 65GB... now have multiple DBs that exceed a Terabyte). The perception was that we needed more memory and the cost of memory for the now 7 year old machines exceeded what it would cost to stand up more modern machines and so we went through another hardware upgrade (replacement, actually) in 2016. The hardware is great. Monster SAN, really good/modern on premise box with 32 CPUs instead of just 16, 256GB of ram instead of just 128, and a monster 3TB SSD disk caching system. Basically, we doubled the size and resources available. Everyone expected great things... except me, because I've been through this more than a couple of times.
After the upgrade, no real improvement was realized by the front end. That's because the code doesn't read from disk... it reads from memory and a lot of people forget that and the fact that memory speed didn't double. Worse yet, some of the screens still took almost 22 seconds to return. Some of the heavy lifting night batch code did run twice as fast... yeah... that was great... instead of something taking 4 hours to run, it now "only" took 2. CPU was still sitting at 22% average during most of the work day (we're not a 24 hour shop).
Over the summer, we ran into a tipping point. The summary story is that 32 CPUs started slamming against the wall at 85% usage with major blocking. They added another 128GB of RAM and and addition 16 CPUs (48 core, total). No help at all. In fact, it got worse. I'd never worked on a box that big before and it was interesting to watch 48 CPUs slammed into the wall, kind of like watching a burning building is interesting. That turned out to be MARS enabled connection strings. They fixed that but CPU was still at the 22% level during normal work hours even though we now had a machine with 50% more capability.
To make a much longer story shorter, they finally got around to fixing some of the code that I had identified years prior to the latest upgrade. (There were only two pieces of code that needed to be rewritten but they "didn't have the time" (even though it only took a couple of days to write and regression test) and they expected the hardware improvements to fix it) The 22 second screens now return in the proverbial blink of an eye and CPU has decreased from 22% to an average of 6 to 8%. They also started to fix the batch jobs. A segment of import/validation code that was taking 40 minutes to execute was reduced to just a couple of minutes (20 times faster) and other parts of the code was measured to be 80 times faster. The hardware upgrades had no such fantastic improvements.
With apologies for the long winded bit of history, this all makes me believe in the cloud a bit more but not for the reason most people would expect. The company thought hardware would be the answer and it took a long time to prove it wasn't. If we were in the cloud, we could have spun up more memory and more core virtually instantly to very quickly prove that it wasn't the hardware and that true performance is where it has always been...
...in the code.