Error Logging Gone Wild

  • Comments posted to this topic are about the item Error Logging Gone Wild

  • Nice article!

    And we should not forget Sommarskogs great articles about Error Handling http://www.sommarskog.se/error_handling/Part1.html

    Thanks

  • Thanks for the article.

  • Hi Andy,

    Let me run this past you.

    So, the problem as I understand it is that we really don't need to be informed 1,000,000 times that (for example) an ETL process attempted to insert Feb 30, 2015 into a date column, even if the process in question was stuck in a loop and attempted 1,000,000 inserts. When it comes to logging errors, it's not as if we're counting cars at a toll booth, and all we typically need is a single golden record that can be used to quickly nail down the problem.

    I have a simple solution that will cut down on the noise to signal ratio when logging errors, and this can be used for other types of logging as well such as: debug tracing or website click stream analysis.

    I'll start with the ErrorLog table you presented in your article.

    create table ErrorLog (

    ErrorLogID int identity (1,1) not null primary key clustered,

    ErrorDateTime datetime2 default sysdatetime(),

    ErrorID int,

    Message nvarchar(1000),

    ApplicationName varchar(50),

    ApplicationVersion varchar(50)

    );

    We'll now add a non-persisted computed column, which is the checksum of: ErrorDateTime, ErrorID, and Message. This can be thought of as a hash of a natural key, and the assumption is that any rows repeating this column combination would be redundant data. It's essential that an appropriate column combination be chosen depending on the application.

    Note that this solution hinges around hashing ErrorDateTime at a lower resolution. For this example I'm using DATETIMEOFFSET(0), which gives us a resolution of one second.

    alter table ErrorLog add ErrorCheckSum

    as checksum(

    cast(ErrorDateTime as datetimeoffset(0))

    , ErrorID

    , Message

    );

    We'll then create a unique index on this hash column. This solution hinges upon the usage of the option IGNORE_DUP_KEY option. Because we're limiting the number of inserts into the table and the index is on a single integer columns, this should be an efficient lookup with little overhead.

    create unique index uix_ErrorLog_IgnoreDupKey

    on ErrorLog ( ErrorCheckSum )

    with (data_compression = page, ignore_dup_key = on );

    Now we're done. That's all there is to it.

    To test this, we will insert 100,000 rows in a tight loop.

    declare @i int = 0;

    while @i < 100000

    begin;

    insert into ErrorLog ( ErrorID, Message, ApplicationName, ApplicationVersion )

    values ( 500, 'This is a low volume error', 'MyApp', '1.1.1.2' );

    select @i = @i + 1;

    end;

    On my PC, this ran for about 7 seconds, so the table ErrorLog retained a total of (7) rows. At this rate it would log about 600 errors for every 10 minutes. It's more than enough detail to track the issue within the context of time but without so much redundancy that it consumes all available storage and crashes the server the process.

    select * from ErrorLog;

    ErrorLogIDErrorDateTimeErrorIDMessageApplicationNameApplicationVersionErrorCheckSum

    12016-10-11 09:32:26.5822773500This is a low volume errorMyApp1.1.1.2-1716654948

    99182016-10-11 09:32:27.5022589500This is a low volume errorMyApp1.1.1.2-248222563

    273562016-10-11 09:32:28.5022389500This is a low volume errorMyApp1.1.1.21755245725

    445012016-10-11 09:32:29.5022189500This is a low volume errorMyApp1.1.1.2-1008636770

    617972016-10-11 09:32:30.5021989500This is a low volume errorMyApp1.1.1.2994831518

    785812016-10-11 09:32:31.5021789500This is a low volume errorMyApp1.1.1.2-1831703393

    960152016-10-11 09:32:32.5021589500This is a low volume errorMyApp1.1.1.2170192031

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric, thats interesting. Thinking about it, I think its reasonable for the high volume scenario, definitely easy to implement. Low volume...say 3 a second for a few seconds, still works of course, but maybe harder to analyze/alert on? Im thinking of the dashboard view, where the counts are what I look at, usually for the last hour.

    A nice alternative!

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply