• I think logging to a central repository in a meaningful way or having a central process that scans log is a good thing.

    I have had very bad experience with Microsoft MOM/SCOM with outages and it simply missing a bunch of stuff (e.g. not reporting on blocking for over 15 minutes).

    I have had very good experience with Nagios and have heard good things about SPLUNK.

    Still there are two issues that no one has brought up yet.

    One is what is monitoring the monitoring systems?

    The second is how do we analytically tie together disparate errors in a meaningful way to indicate some causal connection? This is particularly a challenge in a large enterprise. How do we correlate things that are actually related and how do we avoid spurious correlation.