SQLServerCentral Article

Customizable Error Log Scanning

,

Background

The SQL Server error log has a wealth of information contained inside.

Sometimes it is too much, sometimes it is not enough. And then situations occur

when you do not know if there is enough in the log or not until well after you

have a resolution. In my opinion, things are even more complex with 2005, as

there seems to be a lot of messages that end with "This is an

informational message only; no user action is required." If that is the

case, then I may or may not want to be alerted about this, right? User action

may not be "required", but what if it is actually

"desired"?

Like many shops, we developed an in-house system for monitoring our database

servers. Built over the past two years, the system is called DBA_Perform. One of the very first items we incorporated

into DBA_Perform was the error log scan. Code for all

of the tables and stored procedures for the solution can be found in the

downloadable zip files. There is also a readme file

included to assist with your implementation. I have tried to keep things as

general as possible, but some additional modifications may be necessary based

upon your specific environment.

The Need To Know

I believe that many shops have processes in place to scan the logs for

review. Some shops are small enough (less than ten total servers) that no

process is really necessary because they can simply review the logs by hand as

needed. However, I would argue that using an automated process to scan the log

periodically is the smart thing to do, so that you can be alerted should

something occur. That is exactly what we set out to build over two years ago.

What we found was that setting up a process to scan the error log is very

easy to do. There were lots of examples available from community sites. We

sampled some and found that while they did work, there was always something

missing. What one solution would have, another solution would not. So, we took

the best of everything we could find, added in our own ideas, and ended up with

the solution being presented. What set ours apart from others at the time was

the ability to include (or exclude) certain keywords that would appear in the

error log.

This was very important for our group as the number of servers we managed

went from around ten to over seventy in less than fifteen months. The

combination of production, test, and development servers also meant that the

errors we would be concerned about varied. For example, our production servers

are set to log any failed logins, but test and development are not. Any solution

we would roll out to a new server had to work regardless of environment. But do

we want to be alerted by email each and every time someone fails to log in? No,

probably not. But what if that failed attempt was by someone trying to login as

"sa", or "admin", or

"root"? That could be something that we want to know about quickly,

as it could be an attack or a virus.

The Magic

First, get the code. This has been tested with SQL Server 2000 and SQL Server 2005.

We needed to build a process that would scan and dump the contents of the

error log, clean up the contents based upon keywords we wanted to include

and/or exclude, and then send us an email if necessary. The procedure we

currently use is named Usp_error log_Notification,

and is included the zip files. The procedure makes use of the xp_readerror log extended stored procedure in order to

gather all details from the log into a temporary table as an initial starting

point. The process does a check to see what version of MS SQL Server you are

running, because Microsoft changed the output of the xp_readerrorlog

extended stored procedure in SQL 2005.

While checking the version, we also verify that the reading of the error log

results in a valid date that will ultimately be inserted before our emails are

sent. This is essential because the very next step does a calculation of the

maximum date, so we need to make certain the date is valid in order for the max() function to return a proper result. If this is the

first time a scan has been run, then a default date is assigned. Otherwise, it

will retrieve the max date from an earlier run of the process from the error

tables and that date will be used to filter out rows from the temporary tables

later in the process.

Before that cleanup happens, a cursor is built to go through the entries in

the temporary table and filter out the rows that do not have a keyword that we

want to scan for. The keywords for this part are found in the ErrorlogScanKeyword table, and include words and phrases

such as 'stack', 'is full', and 'error'. If a row exists in the #Errors

temporary table that do not have any of the words

listed in this table, then they are filtered out.

After filtering the temporary table, leaving behind only rows that have

keywords we are scanning for, we then cleanup the table by using the calculated

date mentioned in the previous paragraph. As our #Errors table gets smaller, we

next need to force the inclusion of specific words. We take the current rows in

#Errors and insert them into a separate temporary table named #ErrorsHOLD. At this point we remove rows from #ErrorsHOLD that do not have words and phrases found in the ErrorlogForceInclude table. In our experience, we found

that is not sufficient to only exclude certain words, there also needs to be

the forced inclusion of keywords as well. By forcing the inclusion of certain

keywords (for example, the word 'admin' would be a forced include), we can tell

the system to be more flexible. This is the magic. We can tell the system to

not bother us every time the phrase "Login failed for user", but to

make certain we know whenever "Login failed for user 'sa'"

appears in the error log.

After setting those rows off to the side, we do one final scan on the #Errors

temporary table to exclude certain keywords outright (for example, certain

error numbers above 50000, and this is where we would exclude the generic

'Login failed for user'). From there, it is a quick reinsert into the #Errors

from #ErrorsHOLD and we have the desired result set. The

last step is to generate an email and notify our team that there is something

in the error log that may be worth further investigation.

Create the Job

The last thing we needed to do was to put together a job that would run

every ten minutes. We created a job that simply calls the Usp_error

log_Notification procedure (also included in the zip

file). Feel free to modify the job as you see fit, changing the frequency of

the schedule or whatever you desire.

Conclusions

As your environment gets more and more complex, you may find that the number

of alerts increase substantially. Some of the processes that you build while

you have a small shop may not be as effective as more and more servers come

under your control. If you start with a customizable toolbox you can save

yourself some headaches later. With the above solution, you can decide for

yourself what you want to be notified about during an error log scan, and not

have to worry about filtering through the white noise that we all seem to be

swimming in these days.

Rate

5 (3)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (3)

You rated this post out of 5. Change rating