Let's face it, when it comes to computers there's a 100% certainty that something is going to break eventually. Maybe it's going to be a hardware problem like a drive failure, maybe an OS patch will break something, or maybe a new release of software contains a memory leak. It's this kind of stuff that's part of the reason production guys like me have a job; After all, if everything worked perfectly all the time there would be no need for a production DBA, right?
Throughout many years of working in production environments I've seen my fair share of problems. I've also seen many different ways that people chose to fix them – most good but some very bad. Perhaps the biggest mistake that I see repeated constantly is when people "fix" a problem with a band-aid or some kind of workaround. I submit the following examples:
Problem: A memory leak in a custom ISAPI DLL causes IIS to stop responding to requests after 8-10 days.
Solution: Dev can't reproduce the problem. Reboot the affected servers once a week. Dev is working on a complete rewrite of the software that will be out in 3 months anyways and this will go away.
Problem: An application log file is filling up all available space on the drive it lives on.
Solution: Move the log file to a drive with more space.
Problem: A SQL Agent job is failing intermittently with an out of memory error.
Solution: Set the job to retry 10 times with a 5 minute wait in between retries.
The issue I have with each of the proposed "solutions" to these problems is that they are not solutions at all – they're band-aids. In real life when we get a cut or scratch and put a band-aid on it all we're doing it masking the problem while our bodies work on the real fix underneath. Similarly, when something on a server is failing and we employ a workaround all we're doing is masking the real problem.
In the first example, the problem was actually happening on 80 servers. Asking the production team to reboot each once a week meant that one person was spending an entire day out of each work week to do the reboot. Further, since dev couldn't reproduce the problem that means they didn't understand it; how then could they guarantee that their rewrite would fix it? And what if the project timeline slipped from 3 months to 6 months because of feature creep, bugs, or mismanagement?
In the second example, the application log was growing because of database corruption. Moving the log file to another drive would have just given it more room to grow until the corruption was fixed.
In the third example, the memory errors were due to the fact that no page file had been configured in the OS. Setting the job to retry may have allowed it to eventually run successfully, but the real problem would have just manifested itself in other (potentially worse) ways.
To be fair, I'm not against using a workaround…as long as you continue working on a real solution to a problem. What I am saying here is don't consider a workaround a solution. When things break it's usually for a reason so don't ignore what your servers are trying to tell you by masking the problem and calling it fixed.