I've run across several incidents in the last couple of months regarding something that is typically called "tribal knowledge." This is when something isn't written down because everyone knows "it." The problem is that most of the time folks that know "it" don't realize that there are folks that don't know "it." Or worse, the folks that know "it" have forgotten "it" because they aren't using a checklist or written procedure.
If you've been in IT any length of time, you've probably been guilty of this yourself. I know that when I managed Active Directory, the handful of folks (two) that handled group policy knew the procedures, but we never wrote them down. When a new guy came in who needed to be brought up to speed, we realized how difficult it was to do so quickly because there had only ever been the two of us and we worked out the procedures over time. The procedures were solid, they just weren't formally written into a checklist. Thankfully, we attacked that problem immediately, and the procedure was documented.
A situation over the past two weeks involved an installation where a simply registry value addition was missed. Because of the nature of my organization's environment, this registry value is critical. It means the difference between a key process working and not working. However, the process only has an issue in the non-production area. Production isn't affected. As a result, the need for the registry value slipped folks' minds when we build out several new servers for a project. Other folks, who assumed everyone knew about the registry value, put it in on the servers they were directly working with. So we had a group of servers that worked, and another group of servers that didn't. And unless you knew about the registry value, you had no idea of why.
We burned quite a few hours on this, even going down to the point where we were doing packet trace analysis of Kerberos communications and verification of secure channel between server and Active Directory. We were seeing two distinct behaviors between the two sets of servers, one that didn't make sense from an OS perspective. Basically, the servers that didn't work were asking an additional question, the wrong question at that, and that was what was leading to the failure. What we couldn't figure out was why they were asking the question in the first place. When Microsoft was engaged, they were similarly puzzled.
It turned out to be a third party app that was causing the question to be asked. And the registry value stopped the app from asking said question. When the app didn't ask the question, it worked correctly. Strange, but true. However, only after several days of troubleshooting, multiple meetings, a half-dozen senior engineers getting involved, and Microsoft being engaged did one of the folks that knew about the registry value asked if it was put in on the servers that didn't work. Of course it wasn't. I didn't know about said registry value so my first question was, "We knew about this?" Yes, yes, we did. It was an honest mistake, but one that could have been avoided.
If there had been a checklist for the installation of said application, we would never have run into the issue. The registry value would have been entered as part of the checklist procedure and all would be well. Even if it was missed, the first thing folks do is review the checklist. It would have been caught then. And that would have saved us numerous hours and real dollars because the Microsoft support call wasn't related to a security vulnerability.
The point is to never rely on tribal knowledge. If you have to do something more than once, or if there is a chance of having to do something more than once, document how it should be done. As that procedure is modified, modify the documentation accordingly. And always, always follow the documentation that's generated. Don't assume that because you've done it so many times that you know it cold. This is when mistakes are made. Preferably, build a checklist along with the documentation so that the installer can ensure every step is marked off as its done. Also, if possible, have someone other than the installer verify the checklist. This minimizes mistakes even further.