• I think it's also important that developers build baseline diagnostic record into the app. We generally need some way to prove the code during development anyway, but those checkpoints should be built with the intention of being left in place for gathering statistics during normal operation. Then when issues of scale or new usage patterns develop there is a baseline from which to compare the new behavior. It's difficult to assess user complaints of "slowness" if operations complete in under 2 seconds - but if normal operation before some critical event was 0.25 seconds then its clear that an 8x lag is worthy of investigation.

    I had a backup job recording start/stop/size with each nightly operation on 15 remote locations. When some of the backups were not complete by the start of the day, I checked the logs: throughput was down to 20% of normal. That was enough clue to look for what was consuming bandwidth overnight. 'Found two machines inside our network had become zombies/botted. I would not have even known to look for that if I hadn't been keeping logs/baseline to see the change.