# SQL Server Statistical Functions

By David Poole, 2004/01/12

Statistical Functions

### Background

When we talk about statistics what we are talking about is a means to evaluate an outcome or potential outcome against a model

The model is unlikely to provide and absolute answer but should be robust enough to give a good indicator of behaviour in the real world.

### SQL Server statistical functions

• COUNT
• SUM
• AVG
• STDEV
• STDEVP
• VAR
• VARP

The first two you will definitely be familiar with

### AVG

To the non-mathematical an average is simply the sum of the numbers divided by the number of numbers. To a mathematician this is called the "Arithmetic Mean".

A mathematician/statistician probably won’t use the term "average" as average is a generic term that can mean a variety of things. A statistician can use any of the terms below to mean average.

 Average Term Meaning Arithmetic mean The AVG function. Sum of the numbers divided by the number of numbers. Assumed mean An estimated mean. This can be used when information is given in bands i.e. sales figures in bands by customer age 15 to 21 year olds, 22 to25 year olds etc. MEDIAN The middle value when the values are arranged in order. When there are an even number of values take the mid-point between the two in the middle. MODE The most common value.

An average can even be a number picked at random within a range of values, though this isn’t generally used.

### Problems with the AVG function.

The problem with averages is that what we are trying to do is describe a typical occurrence.

Let us suppose that you want to know the average price of an item that a customer buys.

Take the example of a proper hi-fi shop. A shop selling hi-fi the purchase of which can be considered reasonable grounds for a divorce!

A typical customer may buy a £1,000 amplifier and £100 of connecting leads. Average item spend is £550. But no-one spends £550! Your average figure is completely useless!

To use an analogy, a man with his feet in the fridge and his head in the oven is, on average, perfectly comfortable.

As another example take one of those catalogues selling brushes and household knick-knacks. Prices range from £5 to £15 and all products sell equally well.

Average item spend is £10. But there is nothing special about items costing £10! They sell equally well as items selling for £5 or £15.

Knowing an average value is only useful if the values that you are measuring tend to cluster around that average.

### Catering for extremes

Look at the unit prices in the products table in the NORTHWIND database.

SELECT AVG(UnitPrice)
FROM Products

Returns £28.87

If you look at the actual products in the table you will see that there is one product priced way over and above the others. This single product causes the average to be an artificially high value.

Really we should exclude this extreme and other extreme values to get a more representative figure for the average. The SQL statement below chops 3 percent off either end of the selection.

SELECT TOP 97 PERCENT DT.*
FROM
(    SELECT TOP 97 PERCENT *
FROM Products
ORDER BY UnitPrice ASC
) AS DT
ORDER BY UnitPrice DESC

The average of this new selection is £25.05

Even this could be considered high when you consider that the median value (the point at which 50% of products cost more than this and 50% of products cost less)

SELECT TOP 1 DT.*
FROM
(    SELECT TOP 50 PERCENT *
FROM Products
ORDER BY UnitPrice ASC
) AS DT
ORDER BY UnitPrice DESC

This returns £19.50!

### Inter Quartile Range

Another way of looking at extremes is to use something called the inter-quartile range

The inter-quartile range is worked out by ordering the values that you want to evaluate and splitting them into 4 equal parts. The inter-quartile range is the range of values represented by the middle two parts.

The SQL to work out what this range is as follows

SELECT MIN(UnitPrice) AS LowerRange , MAX(UnitPrice) AS UpperRange
FROM (
SELECT D1.ProductId
FROM
( SELECT TOP 25 PERCENT ProductId
FROM Products
ORDER BY UnitPrice ASC
) AS D1

UNION

SELECT D2.ProductId
FROM
( SELECT TOP 25 PERCENT ProductId
FROM Products
ORDER BY UnitPrice DESC
) AS D2
) AS DT RIGHT JOIN Products ON DT.ProductId = Products.ProductId
WHERE DT.ProductId IS NULL

This tells us that the middle 50% or inter-quartile range of prices is between £14.00 & £32.80.

NB  The combination of the TOP 25 PERCENT clause and duplicate prices the above SQL doesn't find the true inter-quartile range. If you know the total number of records within your recordset then you can use the TOP n clause instead to get a more accurate answer.

As the inter-quartile range is a broad brush measure rather than a precise measure we don't need to be too fussy about it.

Within the SPSS stats package there was a useful charting tool that produced a graph called a box plot. The box plot utilised the inter-quartile range and was a very handy way of illustrating extremes. An example is shown below.

The graph shows box plots for the prices of items bought as a result of two adverts.

• The green box represents the inter-quartile range.
• The line through the green box represents the median point.
• The T shape above and below illustrate the "outliers". In SPSS an outlier was determined as a point 1.5 times the inter-quartile range.
• Any points beyond these "outliers" are considered extremes.

It doesn't matter which method you use, remember the goal is to produce a model that acts as the most effective predictor or most accurate representation of real world behaviour.

### VAR, VARP, STDEV & STDEVP

The VAR, VARP, STDEV & STDEVP functions all give a single figure that describes how much the figures we are investigating deviate from our "average".

A point to make at this stage is that the way in which a computer or electronic calculator works out the VAR,VARP, STDEV and STDEVP is an approximation.

Let us take VARP as an example

True VARPComputer VARP

An explanation of the symbols is as follows:

The difference in the results between these two formulae are too small to be of much concern.

The reason the computer version of the formula is different dates back to the days when computers were considerably less powerful than they are today. Basically, in the computer version of the formula you can calculate your VARP in one pass through the recordset, where as the true version you would need to first calculate your average on one iteration, and then calculate your VARP on the 2nd iteration.

### The difference between VAR and VARP

The table below shows the difference between the two functions from a mathematical perspective

VARP

The variance within the entire population.

VAR

The variance within a sample of that population.

Let us suppose that it is not possible to measure the variance of an entire population. What we would do would be to take, what we hope would be, a representative sample of that population and measure that instead.

If we were taking a sample, then wouldn't we expect to build in some factor to allow for discrepancies in our sample?

Of course we would that is why the two functions have differing denominators (bottom half of the fractions). The bottom half of VAR is the same as n2-n and is known as Bessels correction.

As with any fraction, the bigger the denominator the smaller the eventual number so as we increase our sample size so the differences between what the functions return becomes smaller.

### Standard Deviation

The Standard Deviation is the square root of the Variance therefore

• STDEV is simply the square root of VAR
• STDEVP is simply the square root of VARP

Both the Variance and Standard Deviation can be represented by the Greek symbol Sigma as shown below

The standard deviation provides statisticians with a very useful tool. For example, let us go back to the scenario where it is not possible to work with the full dataset and are therefore working with a sample. If you take the average of your sample you also have to allow for some variability between the average of the population and the average of your sample.

In statistics there is a term for this called the standard error of mean. and this has a formula as follows

Let us suppose that we have a sample of 10,000 with an average of 5 and a standard deviation of 0.5.
Our standard error of mean would be 0.005

This means that from our sample we know that the true average within the population could be expected to fall +/-0.005 of our sample average of 5, or in other words between 4.995 and 5.005.

### The Normal Distribution

If we were to measure the weight of a random sample of 500 employees within a company and plot them on a graph then we would find that the graph forms a bell shape.

In statistics we can produce a model of this shape called the normal distribution.

One of the properties of the normal distribution is that 95% of all occurences occur within approximately 2 standard deviations of the average.

If our 500 employees weigh on average 200lbs and the standard deviation is 10lbs then we know that 95% of all employees will weigh 200lbs +-20lbs.

In fact any high school maths student will tell you that there are three points that are generally used in statistics.

Number of Standard DeviationsPercentage of occurrences.

+-1.65

90%

+-1.96

95%

+-2.58

99%

The figures listed above are used for large samples such as those you would get when measuring business to consumer activity. In certain cases i.e. when measuring business to business marketing statistics where the samples are considerably smaller, we would even be interested in the 80% and 85% cases.

Returning to our weight example, if we wanted to cater for 99% of employees we could say that we would expect an employee to weigh 200lbs +-26lbs.

### Conclusion

From just the brief descriptions here we can see that knowing the average and standard deviation allows us to describe the characteristics of our sample in more depth than providing a single figure.

We can also see that it is important to prevent our samples being distorted by extreme values.

I tried to apply the above techniques to measuring the performance of SQL Server, however I found that, even with no-one else connected to a server, the performance of the server was apparently random. I measured the time per itteration for 1,000 itterations for an UPDATE statement to UPDATE 10,000 rows. Examining the results showed a huge variation in results and no particular pattern.

The techniques did work well on examining the characteristics and purchasing habits of customers.

Total article views: 15146 | Views in the last 30 days: 13

Related Articles
FORUM

### average number of days

average number of days

FORUM

### Caculating Average

Average across time/overall average

FORUM

### Averaging between days

Looking for help with a query to average the number of days between multiple dates

FORUM

code sample

FORUM

### Workout the average

Help with average calculation

Tags
 miscellaneous t-sql

## Join the most active online SQL Server Community

### SQL knowledge, delivered daily, free:

#### You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.