 # A Lightweight, Self-adjusting, Baseline-less Data Monitor

,

## Introduction

As database professionals we all probably at some point had to deal with some kind of data monitoring. It could be system data from the very servers we manage or business operational data. And the main purpose of such monitoring usually is to answer one seemingly simple question – if the data shows the signs of being abnormal. The standard approach to tackle this is to have a baseline for the metric you are concerned with and compare the current value with the baseline.

That could be a really challenging task. Just imagine that your data has a high seasonality (low months vs. holiday season) or changes greatly between night and day hours. You might end up with collecting your baseline data hourly for an entire year. To make it worse - once the baseline is established you need to decide how much deviation is too much. In other words, passing what threshold is enough to warrant an alarm? You don’t want to be woken up in the middle of the night just to learn that nothing special is going on with your system; in that case you just set your threshold too low. Nor do you want to learn someday that your system has not worked as expected for a while but you were not aware, as you set your threshold too high.

So is there any other way? Can data itself give us a hint that something is going on, without baselines and thresholds? The answer is yes – if we use a statistical approach and let the data speak. In order to do so we will need...

## A Bit of a Theory

The best approach would be to judge the observed value based on its probability, as it’s much easier to deal with probabilities rather than the absolute values themselves. With probability one always can tell that the probability of 0.1% is indeed low and the probability of 99.9% is high. No questions here. Compare this to the case when you only know that your observed value differs from your average by 25%. This could mean anything: no issue at all if your data is highly dispersed and possibly critical situation if your data is very homogeneous.

To figure out what the probability is of seeing the observed value, we need to know the probability density function for your data. Simple. Well, not exactly. The function is known only for a few distributions, of which the Normal is the most widely used. If your data is distributed normally (or close enough), you can calculate probability for your observed value either by using the analytical probability density function or by looking through pre-calculated values in special tables.

For instance, to find out the value with the probability of 0.15% or less you just need to calculate the average value and the standard deviation of your data set (the functions are usually built-in in many program languages, including T-SQL). Once you know the average and the deviation it’s easy to find the value: AVG-3*STDEV (a well-known 68-95-99.7 rule). The rule refers to the both ends of the distribution and you are interested in only one, (either low or high values), thus you take only a half of the value. The problem though that there is a good chance that your data is not distributed normally.

Though you still can work from an assumption of normality I would like to introduce another approach. We can build our probability density function (table valued of course, not analytical) out of data at hand and use it in the exactly same way as one for the Normal Distribution to calculate probabilities. So what is a probability density function?

As complicated as it sounds it’s actually a data histogram on steroids. Imaging that you measure some characteristic of your system periodically. The value of each individual measure becomes a data point in your dataset (called time series). And a histogram simply displays how many of your data points fall to specific range of values you measured. That is shows a dependency of the quantities of data points (Axis Y) from measured values (Axis X).

To build a histogram we need to estimate a bin size (we will use Freedman-Diaconis rule as a guidance) and “pour” the data into the bins. More details can be found in the attached code. Once the histogram is completed we need to normalize the quantities inside the every bin by total number of data points in your data set. And the last step in converting your histogram to a probability density function is to divide normalized quantities by their respective value ranges (like “spreading” them over Axis X).

One can say – OK, I have the function but what about the probability? The probability is the area under the curve. That is to find the probability to see a value equal or greater than V, we need to find the area under the curve from V (on Axis X) to the very end (the right end) of the curve.

Now that we know how to go about it, we can get started.

## The Problem

Imagine that you have a server (or more likely a farm of servers) where users upload their files. We have a table where each record represents one upload with a start time, an end time, a file size and maybe a status with success or failure. We would like, from the data alone, to be able to tell if our system is healthy or if there is trouble brewing.

One really important thing here is to choose a proper metric. Though a few can be employed, we will use a ratio of uncompleted uploads in a period of time to the total number of uploads in the same period. We just need to make sure that the chosen interval is such that the substantial number of uploads normally get finished within it so the ratio is not too close to 1. Usually about a half should suffice.

Due to the natural limitations of the selected metric (it’s always between 0 and 1), the values will not be distributed normally, but nevertheless the distribution likely will resemble the normal one (well up to some degree of course). The figures below represent an example of a real probability density function where the time interval was chosen to have about half of uploads completed for Figure 1 and ~70% for Figure 2 (it should be mentioned that the data was collected at different times). You can see that the probability density function on Figure 1 is quite similar to the one of the Normal Distribution and the function on Figure 2 is noticeably different (due to the longer right tail). And indeed if you calculate the threshold value for the probability 0.15% using the Normal Distribution and this article approach for Figure 1 dataset you won’t find any significant difference while for Figure 2 dataset the difference will be quite noticeable.

## Bits and pieces

OK, let’s start coding! First of all as we will put all our objects to a schema called monitor we need to create it.

```create schema monitor
go
```

Next we will create a couple of new types that we will be using to represent our dataset:

```create type monitor.timeseries as table(value float)
go
```

and our probability density table function:

```create type monitor.pfunction as table(right_boundary float, cumulative_probability float, id int)
go
```

Now the core procedure that generates the probability density function:

```create procedure [monitor].[get_pfunction]
as
/*
Generates cumulative probablity table function
Returns table of type monitor.pfunction

Note that cumulative probability function is descending from 1 to 0.
*/
set nocount on
declare @t table(value float, id int identity(1,1))
insert @t(value)
select value from @timeseries order by 1
declare @bw float, @min float, @max float, @n float
select @n = count(*), @max = max(value), @min = min(value) from @t
-- finding initial bin size
declare @iqr float, @n1 float, @n3 float, @q1 float, @q3 float, @f1 float, @f2 float, @nb int
set @n1 = (@n +1)/4.00
set @n3 = 3*(@n +1)/4.00
if( @n1 != ceiling(@n1) )
begin
select @f1 = value from @t where id = floor(@n1)
select @f2 = value from @t where id = ceiling(@n1)
set @q1 = (@f1+@f2)/2
end
else
select @q1 = value from @t where id = @n1
if( @n3 != ceiling(@n3) )
begin
select @f1 = value from @t where id = floor(@n3)
select @f2 = value from @t where id = ceiling(@n3)
set @q3 = (@f1+@f2)/2
end
else
select @q3 = value from @t where id = @n3
-- this is interquartile value
set @iqr = @q3-@q1
-- using Freedman-Diaconis rule as a "guidance" to calculate initial bin size:
set @bw = 2*@iqr*power(cast(@n as float), -1/3.00)
-- if bin width is zero (which is a rare case) will use some rather arbitrary number
if( @bw = 0 ) set @bw = 2*sqrt(@n)
declare @x table(f float, bw float, id int identity(1,1))
declare @h1 table(bw float, id int identity(1,1) primary key)
declare @h2 table(rb float, id int identity(1,1) primary key)
declare @hst table(rb float, frequency int, id int identity(1,1) primary key)
declare @hist table(rb float, frequency int, density float, old_density float, id int identity(1,1) primary key)
declare @tp table(rb float, density float, cumulativeProb float, rba float, id int identity(1,1) primary key)
set @nb = ceiling((@max-@min)/@bw)
-- we are using variable bin size (proportional to value) if otherwise we have nonsensical number of bins
if( @nb > @n/3 /*to keep # of bins reasonably low - rather arbitrary*/ )
begin
insert @x(f, bw)
select case when value/@bw > 1 then ceiling(value/@bw) else 1 end, case when value/@bw > 1 then ceiling(value/@bw) *@bw else @bw end
from @t
order by value
insert @h1(bw)
select distinct bw from @x order by bw
end
else
begin
insert @h1(bw) select top(cast(ceiling(@max/@bw) as int)) @bw from @t
end
declare @id int, @idmax int, @rb float = 0
select @id = min(id), @idmax = max(id) from @h1
while @id <= @idmax
begin
select @rb = @rb +bw from @h1 where id = @id
insert @h2(rb) values(@rb)
set @id = @id +1
if( @rb>@max) break
end
-- histogramm table
insert @hst(rb, frequency)
select h.rb, sum(case when t.value is not null then 1 else 0 end)
from @h2 h
left join @h2 hp on hp.id = h.id-1
left join @t t on h.rb >= t.value and (hp.id is null or hp.rb < t.value)
group by h.rb
order by 1
-- p density table
insert @hist(rb, frequency, density)
select h1.rb, h1.frequency, h1.frequency/@n/(h1.rb-isnull(h2.rb, 0))
from @hst h1
left join @hst h2 on h2.id = h1.id-1
order by h1.id
-- approximate "tail" to the right
declare @pd float, @x0 float
select @pd = density from @hist where id = (select max(id) from @hist)
select top 1 @id = id from @hist where density > @pd order by id desc
-- linear regression (y = ax +b)
declare @xa float, @ya float, @a float, @b float
select @xa = avg(rb), @ya = avg(density) from @hist where id >= @id
select @a = sum((rb-@xa)*(density-@ya))/sum(power((rb-@xa),2)) from @hist where id >= @id
set @b = @ya -@a*@xa
-- this will be the last point where p density = 0
set @x0 = -@b/@a
select @idmax = max(id) from @hist
-- adjust previous points p density according to regression (so that total probablity = 1)
while @id <= @idmax
begin
update @hist set density = @a*rb +@b, old_density = density where id = @id
set @id = @id +1
end
-- now we need to adjust density due to increased length
declare @mrb float
select @mrb = rb from @hist where id = @idmax
if( @x0 > @mrb )
begin
declare @dx float
set @dx = @mrb -(select rb from @hist where id = @idmax-1)
update @hist set density = density*@dx/(@x0 -@mrb) where id = @idmax
end
delete from @hist where rb > @x0
insert @hist(rb, density) values(@x0, 0)
insert @tp(rb, density)
select rb, density from @hist order by id
-- calculate cumulative probability
declare @cp float, @idmin int
select @idmin = min(id), @idmax = max(id) from @tp
set @id = @idmax
while @id > @idmin
begin
select @cp = sum((h1.rb-isnull(h2.rb, 0))*h2.density)
from @tp h1
join @tp h2 on h2.id = h1.id-1
where h1.id >= @id

update @tp set cumulativeProb = @cp where id = @id-1
set @id = @id -1
end
update @tp set cumulativeProb = 0 where id = (select max(id) from @tp)
set @id = @idmin
while @id <= @idmax
begin
-- the first record with none zero density should have probability 1 as well
update @tp set cumulativeProb = 1 where id = @id
select @pd = density from @tp where id = @id
if( @pd > 0 ) break
set @id = @id +1
end
-- finally we have a cumulative probability table function
select right_boundary = rb, cumulative_probability = cumulativeProb, id from @tp order by id
go
```

The procedure accepts the time series dataset as a table of monitor.timeseries type and returns a dataset of monitor.pfunction type. It’s important to note that this function expects that you are interested in the right tail of the density curve (high end values) rather than the close to 0 part and will extrapolate to find x-intercept value on the right only.

You can find the full code of this procedure (that contains some useful comments) in the attached script “functionality.sql”.

Finally we need two twin functions: one to calculate the probability of a value, and another to calculate a value by its probability.

```create procedure monitor.get_probability_by_value
@value       float,
@probability float output
as
/*
Returns probability of provided observed value (based on @pfunction)
*/
set nocount on
declare @id2 int, @v2 float, @p2 float, @v1 float, @p1 float

select top 1 @id2 = id, @v2 = right_boundary, @p2 = cumulative_probability
from @pfunction
where right_boundary >= @value
order by right_boundary
set @probability = @p2 -- in case if @value == rb
if( (select right_boundary from @pfunction where id = @id2) > @value )
begin
-- approximate lineary
select @v1 = right_boundary, @p1 = cumulative_probability
from @pfunction
where id = @id2 -1
if( @p1 != @p2 and @v1 != @v2 )
set @probability = (@value-@v2)*(@p2-@p1)/(@v2-@v1) +@p2
end
-- if provided value is beyond x-intercept
if( @probability is null ) set @probability = 0
go
create procedure monitor.get_value_by_probability
@probability float,
@value       float output
as
/*
Returns value estimate for provided probability (based on @pfunction)
This sp requires the following temp table to be created in the same session before sp call:
*/
set nocount on
declare @id1 int, @v2 float, @p2 float, @v1 float, @p1 float
select top 1 @id1 = id, @p1 = cumulative_probability, @v1 = right_boundary
from @pfunction
where cumulative_probability >= @probability
order by cumulative_probability
set @value = @v1 -- in case if cprob == @probability
if( (select cumulative_probability from @pfunction where id = @id1) > @probability )
begin
-- approximate lineary
select @v2 = right_boundary, @p2 = cumulative_probability
from @pfunction
where id = @id1 +1
if( @p1 != @p2 and @v1 != @v2 )
set @value = (@probability -@p2)*(@v2 -@v1)/(@p2 -@p1) +@v2
end
go
```

Again the full code of these functions can be found in the attached script “functionality.sql”.

So far we created objects necessary for probability calculation. Next step is to create objects for the monitor itself. The table to store configuration values:

```create table monitor.time_series_config(
series_id int not null identity(1,1),
series_name varchar(256) null,
window_width_seconds int not null,
max_number_of_points int not null,
threshold_probability float not null,
notify_if_found_more_than int not null,
db_mail_profile varchar(128) null,
email_recipients varchar(1000) null,
email_subject varchar(512) null,
constraint PK_time_series_config primary key(series_id)
)
end
go
/*
window_width_seconds:      should be selected so that a substantial number of records (a half for instance) within it
have their complete_date set (that is completed)
max_number_of_points:      arbitrary number, large enough to provide sufficient number of point for our math (400-600)
threshold_probability:     the target probability we consider low enough to warrant an alert
notify_if_found_more_than: normally should be 0 but sometimes we want to make sure that it was not just a freak event
but something that persists, especially useful if you run your scan quite frequently
*/
```

The time series table:

```create table monitor.time_series(
id bigint not null identity(1,1),
series_id int not null,
create_time datetime not null,
value float not null,
probability float null,
constraint PK_time_series primary key(id)
)
go
```

The procedure that generates a new data point, evaluates its probability, adds a new record to the time series table (see above) and trims the time series table if necessary:

```create procedure [monitor].[new_point]
@series_id int
as
/*
Metric: a ratio of uncompleted uploads in a period to the total number of uploads in the same period
Adds new observed value (point) to the time series and calculates its probability.
Also implements "sliding window" by removing the very first point (if applicable)
*/
set nocount on
declare @timeseries monitor.timeseries
declare @pfunction  monitor.pfunction
declare @now datetime = getdate()
declare @window_width_seconds int, @max_number_of_points int, @threshold_probability float
declare @error_status int = -1
select @window_width_seconds = window_width_seconds,
@max_number_of_points = max_number_of_points,
@threshold_probability = threshold_probability
from monitor.time_series_config
where series_id = @series_id
declare @d1 datetime, @d2 datetime
set @d2 = @now
set @d1 = dateadd(ss, -@window_width_seconds, @now)
declare @current_value float
select @current_value = 1 -sum(case when complete_date is not null and status_id != @error_status then 1 else 0 end)
/cast(case when count(*) > 0 then count(*) else 1 end as float)
where create_date >= @d1
and create_date < @d2
and isnull(file_size,0) > 0
declare @current_probability float
-- finding current probability
insert @timeseries(value)
select value
from monitor.time_series
where series_id = @series_id
-- re-calc cumulative probabilities (only if we have sufficient # of points, the number is rather arbitrary - you can use a different one)
if( (select count(*) from @timeseries) >= 48 )
begin
insert @pfunction(right_boundary, cumulative_probability, id)
exec monitor.get_pfunction @timeseries = @timeseries
exec monitor.get_probability_by_value @pfunction = @pfunction, @value = @current_value, @probability = @current_probability output
end
-- add current point to the time series
insert monitor.time_series(series_id, create_time, value, probability)
values(@series_id, @now, @current_value, @current_probability)
-- if # of point > max limit remove the very first one
if( (select count(*) from monitor.time_series where series_id = @series_id) > @max_number_of_points )
begin
delete from monitor.time_series where id = (select min(id) from monitor.time_series where series_id = @series_id)
end
go
```

The procedure that checks the probability of the last observed value and sends a notification if applicable:

```create procedure [monitor].[scan]
@series_id int
as
/*
Scans time series to find if alarm/notification is warranted
*/
set nocount on
declare @t table(probability float, value float)
declare @notify_if_found_more_than int, @threshold_probability float
declare @series_name varchar(256), @db_mail_profile varchar(128), @recipients varchar(1000), @subject varchar(1000)
select @notify_if_found_more_than = isnull(notify_if_found_more_than, 0),
@threshold_probability = threshold_probability,
@series_name = isnull(series_name, cast(series_id as varchar(50))),
@db_mail_profile = db_mail_profile,
@recipients = email_recipients,
@subject = email_subject
from monitor.time_series_config
where series_id = @series_id
insert @t(probability, value)
select top(@notify_if_found_more_than +1) probability, value from monitor.time_series where series_id = @series_id order by id desc
if( (select count(*) from @t where probability < @threshold_probability) <= @notify_if_found_more_than )
return
if( @db_mail_profile is not null and @recipients is not null )
begin
declare @value float, @probability float
select top(1) @probability = probability, @value = value from @t where probability < @threshold_probability
set @subject = replace(replace(replace(@subject, '', @series_name), '', @value), '', @probability)
exec msdb.dbo.sp_send_dbmail
@profile_name = @db_mail_profile,
@recipients = @recipients,
@subject = @subject
end
else
print 'Value with less than threshold probability encountered!'
go
```

The full code for all these objects also can be found in the attached script “functionality.sql”.

The last component would be a job that will be obtaining a new time series data point and checking if sending a notification is warranted. You could schedule this job as you like. The only thing the job should do is to run the following:

```declare @series_id int
set @series_id = 1 /*or any other id*/
exec [monitor].[new_point] @series_id = @series_id
exec [monitor].[scan] @series_id = @series_id
go
```

Also in the attached script file, "distributions used in the article.sql", you can find scipt that generates a normally distributed dataset (to use it as a reference or play around with), and the one that generates distributions used in Figure 1 and Figure 2 . For your convenience I also included “test data.sql” script that creates the uploads table and populates it with some test data in case you would like to try the functionality but do not have anything immediately suitable in your system.

In conclusion I would like to discuss applicability and limitations.

## Applicability and Limitations

Though it’s designed to be “agnostic” to underlying process and based solely on the data, it’s still important that you know your data (and the system) well. The ratio metric likely will not work as expected if your data collecting interval is too long comparing to the average processing time and you have too few records within. It likely will be hyper-sensitive as any single “out of normal” record will set off the alarm. In this case you might try to choose another metric like a process speed instead.

Note the fact that get_pfunction is designed to work with the right tail (high end values) of the density curve. There are two reasons for that. The first one is that the most metrics we can come up with will be naturally limited from the left by 0 (like speed and count) and thus the curve might be steep there and not be able to produce a valuable insight (remember that we have to extrapolate). The second reason is that in case you aggregate your data and look for low values you might lose the signal among the high values “noise”.

Understanding the sensitivity is another important matter here. You don’t want it be either extremely sensitive or numb. It can be easily shown that the sensitivity of the density function from Figure 2 is significantly higher than the one from Figure 1. Imaging that you divided your records into a smaller subsets arbitrary or based on some characteristic of your system (for instance records coming from different servers) and calculated the ratio for each one. Then, given a target probability threshold of 0.15% and defining a failure as ratio = 1.0 , the Figure 2 function will “sense” the failure when about 50% of such subsets failed whereas the one from Figure 1 will need almost 80% of such. And it can be generalized that the further to left the curve peak is, the more sensitive it will be.

This approach (generating a probability function out of data samples, and using it to estimate the probability of the observed value) is universal and not limited to any particular process. The metric (the ratio of uncompleted uploads to all) needs to be adjusted to your concrete implementation however. For instance it could be a ratio of failed to succeeded events, or even a ratio of suspended to running processes for SQLServer sysprocesses system view.

5 (11)

5 (11)