Background
When we talk about statistics what we are talking about is a means to evaluate
an outcome or potential outcome against a model
The model is unlikely to provide and absolute answer but should be robust enough to
give a good indicator of behaviour in the real world.
The purpose of this article is to provide some thoughts on the use of statistical
functions provided by SQL Server and what they actually mean.
SQL Server statistical functions
- COUNT
- SUM
- AVG
- STDEV
- STDEVP
- VAR
- VARP
The first two you will definitely be familiar with
AVG
To the non-mathematical an average is simply the sum of the
numbers divided by the number of numbers. To a mathematician this is called the "Arithmetic Mean".
A mathematician/statistician probably won’t use the term
"average" as average is a generic term that can mean a variety of things. A statistician can use any of the terms below
to mean average.
Average Term | Meaning |
Arithmetic mean | The AVG function. Sum of the numbers divided by the number of numbers. |
Assumed mean | An estimated mean. This can be used when information is given in bands i.e. sales figures in bands by customer age 15 to 21 year olds, 22 to25 year olds etc. |
MEDIAN | The middle value when the values are arranged in order. When there are an even number of values take the mid-point between the two in the middle. |
MODE | The most common value. |
An average can even be a number picked at random within a
range of values, though this isn’t generally used.
Problems with the AVG function.
The problem with averages is that what we are trying to do
is describe a typical occurrence.
Let us suppose that you want to know the average price of an
item that a customer buys.
Take the example of a proper hi-fi shop. A shop selling hi-fi the purchase of which
can be considered reasonable grounds for a divorce!
A typical customer may buy a £1,000 amplifier and £100 of
connecting leads. Average item spend is
£550. But no-one spends £550! Your
average figure is completely useless!
To use an analogy, a man with his feet in the fridge and his
head in the oven is, on average, perfectly comfortable.
As another example take one of those catalogues selling
brushes and household knick-knacks. Prices range from £5 to £15 and all products sell equally well.
Average item spend is £10. But there is nothing special about items costing £10! They sell equally well as items selling for
£5 or £15.
Knowing an average value is only useful if the values that
you are measuring tend to cluster around that average.
Catering for extremes
Look at the unit prices in the products table in the NORTHWIND
database.
SELECT AVG(UnitPrice)
FROM Products
Returns £28.87
If you look at the actual products in the table you will see
that there is one product priced way over and above the others. This single product causes the average to be
an artificially high value.
Really we should exclude this extreme and other extreme
values to get a more representative figure for the average. The SQL statement
below chops 3 percent off either end of the selection.
SELECT TOP 97 PERCENT DT.*
FROM
( SELECT TOP 97 PERCENT *
FROM Products
ORDER BY UnitPrice ASC
) AS DT
ORDER BY UnitPrice DESC
The average of this new selection is £25.05
Even this could be considered high when you consider that
the median value (the point at which 50% of products cost more than this and
50% of products cost less)
SELECT TOP 1 DT.*
FROM
( SELECT TOP 50 PERCENT *
FROM Products
ORDER BY UnitPrice ASC
) AS DT
ORDER BY UnitPrice DESC
This returns £19.50!
Inter Quartile Range
Another way of looking at extremes is to use something called the inter-quartile range
The inter-quartile range is worked out by ordering the values that you want to evaluate
and splitting them into 4 equal parts. The inter-quartile range is the range of values represented
by the middle two parts.
The SQL to work out what this range is as follows
SELECT MIN(UnitPrice) AS LowerRange , MAX(UnitPrice) AS UpperRange
FROM (
SELECT D1.ProductId
FROM
( SELECT TOP 25 PERCENT ProductId
FROM Products
ORDER BY UnitPrice ASC
) AS D1
UNION
SELECT D2.ProductId
FROM
( SELECT TOP 25 PERCENT ProductId
FROM Products
ORDER BY UnitPrice DESC
) AS D2
) AS DT RIGHT JOIN Products ON DT.ProductId = Products.ProductId
WHERE DT.ProductId IS NULL
This tells us that the middle 50% or inter-quartile range of prices is between £14.00 & £32.80.
NB The combination of the TOP 25 PERCENT clause and duplicate prices the above SQL
doesn't find the true inter-quartile range. If you know the total number of records within your recordset
then you can use the TOP n clause instead to get a more accurate answer.
As the inter-quartile range is a broad brush measure rather than a precise measure we don't need to be
too fussy about it.
Within the SPSS stats package there was a useful charting tool that produced a graph called a box plot.
The box plot utilised the inter-quartile range and was a very handy way of illustrating extremes. An example is shown below.
The graph shows box plots for the prices of items bought as a result of two adverts.
- The green box represents the inter-quartile range.
- The line through the green box represents the median point.
- The T shape above and below illustrate the "outliers". In SPSS an outlier was determined as a
point 1.5 times the inter-quartile range.
- Any points beyond these "outliers" are considered extremes.
It doesn't matter which method you use, remember the goal is to produce a model that
acts as the most effective predictor or most accurate representation of real world behaviour.
VAR, VARP, STDEV & STDEVP
The VAR, VARP, STDEV & STDEVP functions all give a single figure that describes how
much the figures we are investigating deviate from our "average".
A point to make at this stage is that the way in which a computer or electronic calculator
works out the VAR,VARP, STDEV and STDEVP is an approximation.
Let us take VARP as an example
True VARP | Computer VARP |
---|---|
An explanation of the symbols is as follows:
The difference in the results between these two formulae are too small to be of much concern.
The reason the computer version of the
formula is different dates back to the days when computers were considerably
less powerful than they are today. Basically, in the computer version of the
formula you can calculate your VARP in one pass through the recordset, where as
the true version you would need to first calculate your average on one
iteration, and then calculate your VARP on the 2nd iteration.
The difference between VAR and VARP
The table below shows the difference between the two functions from a mathematical perspective
Function | Formula | Description |
---|---|---|
VARP | The variance within the entire population. | |
VAR | The variance within a sample of that population. |
Let us suppose that it is not possible to measure the variance of an entire population.
What we would do would be to take, what we hope would be, a representative sample of that
population and measure that instead.
If we were taking a sample, then wouldn't we expect to build in some factor to allow for
discrepancies in our sample?
Of course we would that is why the two functions have differing
denominators (bottom half of the fractions). The bottom half of VAR is the same as n^{2}-n and is known as Bessels correction.
As with any fraction, the bigger the denominator the smaller the eventual number so as we
increase our sample size so the differences between what the functions return becomes smaller.
Standard Deviation
The Standard Deviation is the square root of the Variance therefore
- STDEV is simply the square root of VAR
- STDEVP is simply the square root of VARP
Both the Variance and Standard Deviation can be represented by the Greek symbol Sigma as shown below
The standard deviation provides statisticians with a very useful tool. For example, let us go back to
the scenario where it is not possible to work with the full dataset and are therefore working with a sample.
If you take the average of your sample you also have to allow for some variability between the average of the
population and the average of your sample.
In statistics there is a term for this called the standard error of mean. and this has a formula as follows
Let us suppose that we have a sample of 10,000 with an average of 5 and a standard deviation of 0.5.
Our standard error of mean would be 0.005
This means that from our sample we know that the true average within the population could be expected
to fall +/-0.005 of our sample average of 5, or in other words between 4.995 and 5.005.
The Normal Distribution
If we were to measure the weight of a random sample of 500 employees within a company and plot them on a graph
then we would find that the graph forms a bell shape.
In statistics we can produce a model of this shape called the normal distribution.
One of the properties of the normal distribution is that 95% of all occurences occur within
approximately 2 standard deviations of the average.
If our 500 employees weigh on average 200lbs and the standard deviation is 10lbs then we know that
95% of all employees will weigh 200lbs +-20lbs.
In fact any high school maths student will tell you that there are three points that are
generally used in statistics.
Number of Standard Deviations | Percentage of occurrences. |
---|---|
+-1.65 | 90% |
+-1.96 | 95% |
+-2.58 | 99% |
The figures listed above are used for large samples such as those you would get when measuring business to consumer
activity. In certain cases i.e. when measuring business to business marketing statistics
where the samples are considerably smaller, we would even be interested in the 80% and 85% cases.
Returning to our weight example, if we wanted to cater for 99% of employees we could say that
we would expect an employee to weigh 200lbs +-26lbs.
Conclusion
From just the brief descriptions here we can see that knowing the average and standard deviation allows us to describe the
characteristics of our sample in more depth than providing a single figure.
We can also see that it is important to prevent our samples being distorted by extreme values.
I tried to apply the above techniques to measuring the performance of SQL Server, however I found that,
even with no-one else connected to a server, the performance of the server was apparently random.
I measured the time per itteration for 1,000 itterations for an UPDATE statement to UPDATE 10,000 rows.
Examining the results showed a huge variation in results and no particular pattern.
The techniques did work well on examining the characteristics and purchasing habits of customers.