Error Logging Gone Wild

Question

Error Logging Gone Wild

Andy Warren

SSC Guru

Points: 119922
More actions
October 10, 2016 at 9:47 pm

#314297

Comments posted to this topic are about the item Error Logging Gone Wild
Andy
Connect with me on LinkedIn

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Ity Ent Old Hand Points: 387 More actions · Answer 1

Nice article!

And we should not forget Sommarskogs great articles about Error Handling http://www.sommarskog.se/error_handling/Part1.html

Thanks

akljfhnlaflkj SSC Guru Points: 76202 More actions · Answer 2

akljfhnlaflkj

SSC Guru

Points: 76202

October 11, 2016 at 7:39 am

#1905892

Thanks for the article.

Eric M Russell SSC Guru Points: 125623 More actions · Answer 3

Hi Andy,

Let me run this past you.

So, the problem as I understand it is that we really don't need to be informed 1,000,000 times that (for example) an ETL process attempted to insert Feb 30, 2015 into a date column, even if the process in question was stuck in a loop and attempted 1,000,000 inserts. When it comes to logging errors, it's not as if we're counting cars at a toll booth, and all we typically need is a single golden record that can be used to quickly nail down the problem.

I have a simple solution that will cut down on the noise to signal ratio when logging errors, and this can be used for other types of logging as well such as: debug tracing or website click stream analysis.

I'll start with the ErrorLog table you presented in your article.

create table ErrorLog (

ErrorLogID int identity (1,1) not null primary key clustered,

ErrorDateTime datetime2 default sysdatetime(),

ErrorID int,

Message nvarchar(1000),

ApplicationName varchar(50),

ApplicationVersion varchar(50)

);

We'll now add a non-persisted computed column, which is the checksum of: ErrorDateTime, ErrorID, and Message. This can be thought of as a hash of a natural key, and the assumption is that any rows repeating this column combination would be redundant data. It's essential that an appropriate column combination be chosen depending on the application.

Note that this solution hinges around hashing ErrorDateTime at a lower resolution. For this example I'm using DATETIMEOFFSET(0), which gives us a resolution of one second.

alter table ErrorLog add ErrorCheckSum

as checksum(

cast(ErrorDateTime as datetimeoffset(0))

, ErrorID

, Message

);

We'll then create a unique index on this hash column. This solution hinges upon the usage of the option IGNORE_DUP_KEY option. Because we're limiting the number of inserts into the table and the index is on a single integer columns, this should be an efficient lookup with little overhead.

create unique index uix_ErrorLog_IgnoreDupKey

on ErrorLog ( ErrorCheckSum )

with (data_compression = page, ignore_dup_key = on );

Now we're done. That's all there is to it.

To test this, we will insert 100,000 rows in a tight loop.

declare @i int = 0;

while @i < 100000

begin;

insert into ErrorLog ( ErrorID, Message, ApplicationName, ApplicationVersion )

values ( 500, 'This is a low volume error', 'MyApp', '1.1.1.2' );

select @i = @i + 1;

end;

On my PC, this ran for about 7 seconds, so the table ErrorLog retained a total of (7) rows. At this rate it would log about 600 errors for every 10 minutes. It's more than enough detail to track the issue within the context of time but without so much redundancy that it consumes all available storage and crashes the server the process.

select * from ErrorLog;

ErrorLogIDErrorDateTimeErrorIDMessageApplicationNameApplicationVersionErrorCheckSum

12016-10-11 09:32:26.5822773500This is a low volume errorMyApp1.1.1.2-1716654948

99182016-10-11 09:32:27.5022589500This is a low volume errorMyApp1.1.1.2-248222563

273562016-10-11 09:32:28.5022389500This is a low volume errorMyApp1.1.1.21755245725

445012016-10-11 09:32:29.5022189500This is a low volume errorMyApp1.1.1.2-1008636770

617972016-10-11 09:32:30.5021989500This is a low volume errorMyApp1.1.1.2994831518

785812016-10-11 09:32:31.5021789500This is a low volume errorMyApp1.1.1.2-1831703393

960152016-10-11 09:32:32.5021589500This is a low volume errorMyApp1.1.1.2170192031

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Andy Warren SSC Guru Points: 119922 More actions · Answer 4

Eric, thats interesting. Thinking about it, I think its reasonable for the high volume scenario, definitely easy to implement. Low volume...say 3 a second for a few seconds, still works of course, but maybe harder to analyze/alert on? Im thinking of the dashboard view, where the counts are what I look at, usually for the last hour.

A nice alternative!

Andy
Connect with me on LinkedIn