Error Logging Gone Wild

It’s a common pattern to log application errors to a SQL Server table. The insert volume can range from a handful per day to hundreds per hour depending on the number of concurrent users and what goes wrong. It’s simple, fast, effective and adds little additional load on the server, which makes it cost effective. But what happens when something really goes wrong?

Let’s set the stage. Here’s a simple ErrorLog table:

create table ErrorLog (
  ErrorLogID int identity (1,1) not null primary key clustered,
   ErrorDateTime datetime2 default sysdatetime(),
   ErrorID int,
   Message nvarchar(1000),
   ApplicationName varchar(50),
   ApplicationVersion varchar(50)
 )

We will use this stored procedure to log all errors:

Create proc LogError 
  @ErrorID int,
  @Message varchar (1000),
  @ApplicationName varchar(50),
  @ApplicationVersion varchar(50)
As
Set nocount on
Insert into ErrorLog (
  ErrorID,
  Message,
  ApplicationName,
  ApplicationVersion)
Values (
  @ErrorID,
  @Message,
  @ApplicationName,
  @ApplicationVersion)

And here’s a sample of calling the procedure (we’ll use this each time we test a change):

--insert 4 errors
truncate table ErrorLog
Exec LogError 500, 'This is a low volume error', 'MyApp', '1.1.1.2'
Exec LogError 501, 'This is the high volume error', 'MyApp', '1.1.1.2'
Exec LogError 501, 'This is the high volume error', 'MyApp', '1.1.1.2'
Exec LogError 501, 'This is the high volume error', 'MyApp', '1.1.1.2'
select * from ErrorLog

Vanilla stuff, right? Now imagine you’ve gone home after a solid week at work, and then...at 1 am on Saturday morning something goes wrong on an application server and it starts logging errors. A lot of errors. About 500,000 errors every 10 minutes. What happens next?

If you get an alert for every error, madness. If you respond and stop logging at the 10 minute mark you’ll get the last beep on your phone about a week later (assuming 1 every second).
If you get an alert based on unusual volume (that’s very good) you can login and start to diagnose. How long will it take to troubleshoot and fix/restart/reboot whatever has gone bad? Hopefully not too long!
You enjoy your weekend blissfully unaware. The big dashboard monitor at the office shows an unusual amount of errors when you arrive Monday - around 144 million, give or take, assuming you don’t run of space or values in the identity.

Note: It’s also possible you would get alerted based on a symptom resulting from the inserts - timeouts for example.

Option 2 is interesting and worth doing, because it’s better to know than not, and no one wants to start Monday like that! We could set a job to run every 10 minutes and throw an error if the total for the 10 minute period exceeded some threshold. For this article I’ve set it to throw an error if there are more than 2 errors in the past 10 minutes.

if (select Count(*) from ErrorLog with (NOLOCK) where datediff(mi, errordatetime, SYSDATETIME()) < 10) > 2
   throw 99999, 'Warning - unusual error log volume in previous 10 minutes', 1

In most cases it will be someone else who has to fix it, but while the diagnosing and fixing is in progress we’re still inserting 500k rows every 10 minutes. What options do we have?

Do nothing and keep logging (presumably we have the space and performance is not overly impacted). It’s simple, if not quite risk-free.
Turn off error logging. You can stop logging all errors by just putting a RETURN before the insert in the stored procedure (easy)
Turn off error logging only for specific messages. Check the message first and only doing the RETURN if it matches. It’s no zero risk, but it’s close - just run the proc manually with match and don’t match cases immediately after the change to confirm the behavior. This sort of hides any other errors that occur, though you could still see them being sent to the stored procedure via Profiler/Extended Events.
Create a job to monitor insert volume and if a severe surge occurs, enable a switch that activates the RETURN condition. Higher complexity, but a better outcome. If you go a step further you could disable logging of the high volume error for x minutes and then check again, enabling or disabling the inserting of that particular message.
Create an INSTEAD OF trigger on the table to do the practical equivalent of #2 or #3 depending on how you implement. It adds a bit of overhead for the trigger to fire. Note that is probably most useful if the app is inserting directly into the table instead of via stored procedure.
REVOKE insert permission on the proc (or on the errorlog table). It’s hard to know the impact on the app or apps that are logging to the table. Not recommended.
Add a WAITFOR to the procedure or instead of trigger to slow everything down. I add this for the sake of completeness because there might be a time where you need to slow down the entire process that is failing and not just the error logging (and this assumes the error logging is synchronous from the app - it shouldn’t be though). This would be done before or after the implicit insert transaction to minimize blocking. This is NOT a preferred option!

Let’s look at the simplest way to disable all logging (#2 above), the one where we just RETURN:

alter proc LogError 
  @ErrorID int,
  @Message varchar (1000),
  @ApplicationName varchar(50),
  @ApplicationVersion varchar(50)
As
Set nocount on
--log no errors while high volume incident in progress 
RETURN
Insert into ErrorLog (
  ErrorID,
  Message,
  ApplicationName,
  ApplicationVersion)
Values (
  @ErrorID,
  @Message,
  @ApplicationName,
  @ApplicationVersion)

If we run our test script again we should have zero rows in the table.

We can opt to only ignore certain errors (#3 above). For this example I’ve done what you might do in a production scenario - just handle the one specific case. You might need to use ErrorID plus the message, or a LIKE on the message, or some other combination, but this illustrates the basic implementation:

If @ErrorId = 501
  RETURN

Now, if we run our 4 calls to LogError we should have only one new row, the one for error 500. The three for error 501 were ignored.

A better solution is to move the list of errors to be excluded into a table and check the table for exclusions on each insert. This means that if the problem happens we change the data in the ErrorLogExcludeList table instead of code, always a better option. It adds a tiny bit of overhead to the error logging, but I think it’s worth it. We need to create the table, add the exclusion for errorid 501, and alter our LogError proc one more time.

Create table ErrorLogExcludeList (
  ID int identity (1,1),
  ErrorID int)
Go
Insert into ErrorLogExcludeList (ErrorID) values (501)
Go
--change the proc to check to see if we should exclude this error
If exists (select * from ErrorLogExcludeList with (NOLOCK) where ErrorID = @ErrorID)
  RETURN

Running our four error insert test will result in one row for errorid 500 being written. If we remove the row from ErrorLogExcludeList and repeat we will log all four rows as expected.

This gives us a good solution, but it requires us to manually add or remove rows from the table. Let’s take it one step further and create the code we could run in a job so that instead of just throwing the error for the alert we can automatically exclude (and un-exclude) high volume errors each time the job runs. We also log our own error so that we know something has changed.

set nocount on
declare @HighVolumeErrors table (ErrorID int)
declare @ExclusionChanges table (ErrorID int, Action varchar(10))
declare @ExcludeIfErrorCountPerErrorExceeds int = 2
declare @NumberOfMinutesToCheck int = 10
--get high volume errors
insert into @HighVolumeErrors (ErrorID) select ErrorID from ErrorLog with (NOLOCK) where datediff(mi, errordatetime, SYSDATETIME()) < @NumberOfMinutesToCheck group by errorid having count(*) > @ExcludeIfErrorCountPerErrorExceeds
--remove any exclusions that no longer exceed the threshold
delete from ErrorLogExcludeList output Deleted.ErrorID, 'Removed' into @ExclusionChanges where not exists (select ErrorID from @HighVolumeErrors)
--add any net new errors that should be excluded
insert into ErrorLogExcludeList output Inserted.ErrorID, 'Added' into @ExclusionChanges select ErrorID from @HighVolumeErrors where not exists (select ErrorID from ErrorLogExcludeList)
--log what we changed so can monitor
insert into ErrorLog (
  ErrorID,
  Message,
  ApplicationName,
  Applicationversion)
select
  case when Action = 'Added' then 99998 else 99999 end as ErrorID,
  case when Action = 'Added' then 'Error Volume Warning: Excluding error ' + convert(varchar, ErrorID) + ' due to high volume' else 'Error Volume Warning: Including error ' +    convert(varchar, ErrorID) + ' due to reduced volume' end as message,
  'ErrorVolumeMonitor' as ApplicationName,
  '1.0.0.0' as ApplicationVersion
from @ExclusionChanges

I hope the scenario and some potential solutions for it get you thinking about how (or if) you want to plan now for the rainy day that might or might not happen. I’ve attached scripts for all the code examples.

Resources

Rate

Share

Categories

Share

Rate

Error Logging Gone Wild

Resources

Rate

Share

Categories

Share

Rate

Related content

Resolving could not open a connection to SQL Server errors

Protecting Our Stream of Data

Integration Services Logging Levels in SQL Server 2016

Logging in Maintenance Plans

Parent Package Variable Configuration and Logging