i've been slowly building a monitoring system over the last few years
around 2 years ago due to SOX we had a requirement to save all security logs from domain controllers and some servers. i set up a system to dump them into a database and use SSRS to present the data to people.
the first year was mostly learning and this year i wrote some more reports and transferred them to a new scale out SSRS deployment we did. this also enabled the emailing of data to people. then i added to the system by exporting application logs as well.
every morning i get an email from SSRS with any application log errors from all our SQL servers in the last week. i don't check it every day which is why the report goes a week back.
another report has security log events from SQL servers and there is another one for failed jobs
for security i also get a few emails about wrong passwords for admin accounts as well as any AD group changes. this past week i caught someone adding a person to one of our AD groups that we use for Windows Authentication on a server that is in SOX scope and that gives rights to change revenue data on several servers and databases. the policy is to issue a ticket that has to be approved to add anyone to that group.
for backups i have a daily job to export the tables from msdb to a central database and query it. i get emails for any database that has never been backed up, no full backup in 7 days, a general report of the latest full/diff backups for all servers and databases and a few others i made up. i used to audit backups once every 6 months or so and always found databases not being backed up. sometimes it was a developer creating a database on a server they have access to and not telling anyone. other times it was a mistake when changing a script. Netbackup isn't very good in reporting the backup status of databases so i had to write my own process.
for performance i've been collecting perfmon counters for 9 months now and email an hourly report. we also bought a third party tool to monitor servers that does it as well except it started emailing alerts and we had no data of our own since it was controlled by someone else. so i wrote a report to query the last few hours of permon data and send it out hourly. it used to send only anything out of the accepted range but changed it due to the above application sending out alerts. going to code another report just for alert data.
i also have a report that sends hourly the amount of commands waiting to be replicated. have plans to write another one for the amount of commands at distributor waiting to be replicated
and the final report is an hourly report of all SSRS report modifications. our BI devs have access to create/modify reports and we've had a few tickets where people complained that some report didn't work. set this up so we know if anyone is modifying a report people are complaining about.
all this is done using logparser and normal SQL Server features with a central SQL Server used to store the data. i wanted to use powershell but version 1 had some limitations and looking to see if i can use version 2. once in a while i get calls about buying some expensive monitoring software and there is never any value compared to what you can do yourself.
some things like backup monitoring i coded from examples in the articles here and just modified them. other reports like querying log data i wrote myself and used http://www.ultimatewindowssecurity.com for explanations on what all the event ID's mean