Gotcha! SQL Aggregate Functions and NULL

,

Introduction

Recently, I noticed a lot of questions about how SQL aggregate functions

handle NULL values on a discussion board for an article here at SQL Server

Central Understanding the difference between IS NULL and =NULL discussion board. This article was written to help answer some of those questions by explaining how SQL aggregate functions handle NULLs, and to describe some other "gotcha's" when dealing with SQL aggregate functions. Each section of the article ends with a summary of "gotcha's" to look out for when using these function.

The ANSI SQL-92 Standard defines five aggregate functions (which are also referred to in the SQL-92 standard as set functions.) SQL NULLs receive special internal handling that the database designer/programmer needs to be aware of when using these functions. This article describes how ANSI SQL-92 aggregate functions handle NULLs, per the ANSI SQL-92 standard.

I reference the ANSI SQL-92 Standard throughout the article, and the sample code requires the Microsoft SQL Server Northwind Sample Database to be run. All samples were run on SQL Server 2000.

Aggregate Functions

ANSI SQL-92 defines five aggregate functions. For all set functions that specify a column name, the following process takes place behind the scenes:

  1. The search expression (the WHERE clause) is applied.
  2. The value expression (scalar functions on the column like COALESCE())

    is applied to each row of the results of Step 1.

  3. NULL values are eliminated from our results*.

*NOTE:  COUNT(*) is a special case, which we describe in detail below.

ANSI SQL-92 specifies that a warning message will be returned when NULLs are eliminated by set functions. You can turn on

this option by using the SET ANSI_WARNINGS ON command. The SET ANSI_WARNINGS ON option returns the following error message

"Warning: Null value is eliminated by an aggregate or other set operation."

The SET ANSI_WARNINGS ON also affects

how SQL Server handles division by zero, arithmetic overflow and character/binary data truncation behavior. Before changing

your SET ANSI_WARNINGS option, be sure to reference the following MSDN article:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_set-set_6d2r.asp

Gotcha:  SET ANSI_WARNINGS

ON affects several aspects of SQL Server behavior.  Be sure to

check

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_set-set_6d2r.asp

before changing this option.

COUNT()

The COUNT() set function has two forms:

  • COUNT(column) – Returns the count of rows in a table, less NULLs
  • COUNT(*) – Returns the cardinality (total number of rows) of a table

The following code demonstrates the difference between COUNT(column) and COUNT(*):

SELECT COUNT(*) AS TotalRows, COUNT(region) AS NonNULLRows, COUNT(*) - COUNT(region) AS NULLRows
FROM [Northwind].[dbo].[suppliers]

This query returns the following result:

TotalRows NonNULLRows NULLRows
29 9 20
  1. TotalRows is COUNT(*), the cardinality (total number of rows) of the suppliers table
  2. NonNULLRows is COUNT(region), the total number of rows of the suppliers table in which region is not NULL
  3. NULLRows is COUNT(*) – COUNT(region), the difference between the cardinality of the suppliers table and the number

    of rows in which region is not NULL; basically this returns the number of rows

    in which region *is* NULL.

The fact that COUNT(*) and COUNT(column) treat NULLs differently can be used to your advantage in

many situations such as calculating the NULLRows column above.

Notice that we can get the NULLRows count from above with the following query as well:

SELECT COUNT(*) AS NULLRows
FROM [Northwind].[dbo].[suppliers]
WHERE region IS NULL

This will not work if you specify the region column in your count, however.

Here we'll take a look at why it is important to understand the order in which

SQL performs the steps in a SELECT statement that contains an aggregate

function. This query returns zero as the NULLRows count:

SELECT COUNT(region) AS NULLRows
FROM [Northwind].[dbo].[suppliers]
WHERE region IS NULL

Why does this query return zero? Remember, by definition COUNT(column) applies the

value expression, the search expression and

then eliminates NULLs before giving you a final count! Here's a play-by-play of what happens in our example:

  1. The WHERE clause eliminates every row in which region is not NULL.
  2. The value expression in this example is simply a column name - region.
  3. The COUNT(region) function eliminates every row in which region is NULL.
  4. Since we've explicitly eliminated all non-NULL columns with the WHERE clause and implicitly eliminated all NULL

    columns with the COUNT(region) statement, we end up with a zero rows.

Gotcha:  COUNT(column)

eliminates NULL values; COUNT(*) returns the cardinality (total

number of rows) for an entire table.  The order in which SQL performs

the steps in aggregate calculations can have implications for your

results. 

SUM()

SUM(), like the other set functions, applies the search expression,

then the value expression, eliminates NULLs and finally returns a sum of the

remaining rows. It is important to note that SUM() excludes NULLs, as NULL scalar arithmetic normally returns NULL. For

instance, let's do a scalar addition of all the values in the reportsto column of the employees table:

SELECT 2 + NULL + 2 + 2 + 2 + 5 + 5 + 2 + 5

This SELECT statement will return NULL, since NULL plus anything always returns NULL. If we use the SUM() function on

the table, however, we can rest assured that NULLs are eliminated from our

result:

SELECT SUM(reportsto) AS RowSum
FROM [Northwind].[dbo].[employees]

The result of this query:

RowSum
25

 

Gotcha:  NULLs are eliminated

by the SUM() function; in scalar arithmetic, however, NULL added to

anything equals NULL.

AVG() NULL-handling

The AVG() function returns the average of the column, excluding NULLs. There are a couple of "gotcha's"

concerning the AVG() function. The first "gotcha" is that NULLs are excluded, as with other set functions,

so your answer may not be what you expect. The second "gotcha" is that, by definition, AVG() returns a number

of the same scale as the column it is performed on. Again, this could throw your answer off a bit.

We'll tackle the NULL issue first. AVG() specifically excludes NULLs when it calculates averages. The following

query divides the sum of the reportsto column by the count of non-NULL rows. In our example, one row is excluded from the

calculation:

SELECT AVG(reportsto) AS ColAvg
FROM [Northwind].[dbo].[employees]

Our result is:

ColAvg
3

This is the same result as if we performed this query:

SELECT SUM(reportsto)/COUNT(reportsto)
FROM [Northwind].[dbo].[employees]

So how does excluding NULLs affect our average? We can use COALESCE() function to see.

SELECT AVG(COALESCE(reportsto, 0)) AS ColAvg
FROM [Northwind].[dbo].[employees]

The COALESCE(reportsto, 0) statement forces AVG() to count NULL rows as zeroes; forcing every single row to be

counted. This gives us the following result:

ColAvg
2

We can get the same result using the SUM() and COUNT() set functions:

SELECT SUM(reportsto)/COUNT(*) AS ColAvg
FROM [Northwind].[dbo].[employees]
ColAvg
2

Here we are performing a SUM() on the rows that are not NULL, then dividing by the total number of rows, effectively

treating the NULL rows as zeroes in our average. In this instance we're dealing with smaller numbers and only nine rows. With

larger numbers and more rows the difference is more pronounced.

So which method should you use? That depends on the result you're looking for. Consider an example of calculating the

average payroll for a department. Suppose you have the following table:

Name Weekly Pay
Joe M. 400.00
Lisa D. 500.00

Our average is $450.00 per week. Now let's say that we hire a new guy, but his weekly pay has not been put into the system

yet. Now we have:

Name Weekly Pay
Joe M. 400.00
Lisa D. 500.00
Tom J. NULL

If we include Tom's missing salary in our average (treat it like 0.00), our average goes down to $300.00 per week. If we

leave Tom out of it for now – the default for AVG() – we get $450.00 per week again.

So which method is correct? Ignore NULLs or treat them as zeroes in your averages? That depends entirely on your

objectives and corporate policies. The main thing is to be consistent in your calculations. If you ignore NULLs in calculating

averages, use that same methodology every time; otherwise your results will not be comparable from

year to year, quarter to quarter or even day to day.

Gotcha:  AVG()

eliminates NULLs when calculating your average. 

AVG() Scale

The second issue is scale. In our examples, the averages we received were calculated and returned with the same scale

(digits after the decimal point) as the column; in this case the column is an INT data type, so no decimals were returned.

We'll apply the CAST operator to see what would have been returned if we were using a floating-point data type:

SELECT CAST(AVG(reportsto) AS FLOAT) AS ColAvg
FROM [Northwind].[dbo].[employees]
ColAvg
3.0

So far, not much has changed. We received 3 previously and 3.0 this time. So let's look at the equivalent average using

SUM() and COUNT():

SELECT CAST(SUM(reportsto) AS FLOAT)/COUNT(reportsto) AS ColAvg
FROM [Northwind].[dbo].[employees]
ColAvg
3.125

Notice that we are forcing floating point precision now by casting the

reportsto column as a FLOAT.  We can get this same result by moving the CAST inside the AVG() function, like this:

SELECT (AVG(CAST(reportsto AS FLOAT))) AS ColAvg
FROM [Northwind].[dbo].[employees]

So how did we end up with a different value the second time? Since AVG() calculates using the same scale as the

column data type, our INT column average is calculated with no decimal places. In the first example, we CAST the result

of the AVG() function to a FLOAT type, but the AVG() was calculated as an INT first.

In the second example, we CAST the SUM() of the reportsto column to a floating point type, and then

divided; so we

didn't lose any digits after the decimal point in the calculation. As shown, we can get the same result by using CAST on the column inside the

AVG() function.  This is particularly important to note in financial

and scientific calculations, where the scale is particularly important to our

calculations.

If your scale requirements are greater than the scale of your column, you might need to use the

SUM(column)/COUNT(column) method of calculating your average and CAST it to a floating or fixed-point type,

or CAST inside of the AVG() function.

Gotcha:  AVG() also

uses the precision (total number of digits) and scale (number of digits

after the decimal point) of the column you are performing the AVG()

on.

MIN(), MAX()

The MIN() and MAX() functions return the minimum and maximum values of a column, respectively. Like the other

set functions, these also eliminate NULLs before returning a result.

For numeric data types, MIN() and MAX() simply return the minimum or maximum number in the column. For

character data types, the values are compared using standard SQL form: the shorter of the values is right-padded with

'pad characters' (usually spaces). This means that character columns with trailing spaces are mixed in with character columns

without trailing spaces. As an example:

IF 'hi' = 'hi '
  PRINT 'Equal'
ELSE
  PRINT 'Not Equal'

Depending on your collation settings, string comparisons might be case-sensitive or case-insensitive.

Gotcha:  NULLs are eliminated

by MIN() and MAX() functions.  For character data

types, shorter values are right-padded with 'pad characters' (often

spaces) before comparing them.

Conclusion

SQL set functions are very useful in calculating basic statistical values. By knowing how SQL set functions handle NULL

values you can ensure that you always get accurate, high-quality results.

Further Reading

Rate

4.9 (58)

Share

Share

Rate

4.9 (58)