Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Gotcha! SQL Aggregate Functions and NULL

By Michael Coles, (first published: 2005/07/05)

Introduction

Recently, I noticed a lot of questions about how SQL aggregate functions handle NULL values on a discussion board for an article here at SQL Server Central Understanding the difference between IS NULL and =NULL discussion board. This article was written to help answer some of those questions by explaining how SQL aggregate functions handle NULLs, and to describe some other "gotcha's" when dealing with SQL aggregate functions. Each section of the article ends with a summary of "gotcha's" to look out for when using these function.

The ANSI SQL-92 Standard defines five aggregate functions (which are also referred to in the SQL-92 standard as set functions.) SQL NULLs receive special internal handling that the database designer/programmer needs to be aware of when using these functions. This article describes how ANSI SQL-92 aggregate functions handle NULLs, per the ANSI SQL-92 standard.

I reference the ANSI SQL-92 Standard throughout the article, and the sample code requires the Microsoft SQL Server Northwind Sample Database to be run. All samples were run on SQL Server 2000.

Aggregate Functions

ANSI SQL-92 defines five aggregate functions. For all set functions that specify a column name, the following process takes place behind the scenes:

  1. The search expression (the WHERE clause) is applied.
  2. The value expression (scalar functions on the column like COALESCE()) is applied to each row of the results of Step 1.
  3. NULL values are eliminated from our results*.

*NOTE:  COUNT(*) is a special case, which we describe in detail below.

ANSI SQL-92 specifies that a warning message will be returned when NULLs are eliminated by set functions. You can turn on this option by using the SET ANSI_WARNINGS ON command. The SET ANSI_WARNINGS ON option returns the following error message "Warning: Null value is eliminated by an aggregate or other set operation."

The SET ANSI_WARNINGS ON also affects how SQL Server handles division by zero, arithmetic overflow and character/binary data truncation behavior. Before changing your SET ANSI_WARNINGS option, be sure to reference the following MSDN article: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_set-set_6d2r.asp

Gotcha:  SET ANSI_WARNINGS ON affects several aspects of SQL Server behavior.  Be sure to check http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_set-set_6d2r.asp before changing this option.

COUNT()

The COUNT() set function has two forms:

  • COUNT(column) – Returns the count of rows in a table, less NULLs
  • COUNT(*) – Returns the cardinality (total number of rows) of a table

The following code demonstrates the difference between COUNT(column) and COUNT(*):

SELECT COUNT(*) AS TotalRows, COUNT(region) AS NonNULLRows, COUNT(*) - COUNT(region) AS NULLRows
FROM [Northwind].[dbo].[suppliers]

This query returns the following result:

TotalRowsNonNULLRowsNULLRows
29920
  1. TotalRows is COUNT(*), the cardinality (total number of rows) of the suppliers table
  2. NonNULLRows is COUNT(region), the total number of rows of the suppliers table in which region is not NULL
  3. NULLRows is COUNT(*) – COUNT(region), the difference between the cardinality of the suppliers table and the number of rows in which region is not NULL; basically this returns the number of rows in which region *is* NULL.

The fact that COUNT(*) and COUNT(column) treat NULLs differently can be used to your advantage in many situations such as calculating the NULLRows column above.

Notice that we can get the NULLRows count from above with the following query as well:

SELECT COUNT(*) AS NULLRows
FROM [Northwind].[dbo].[suppliers]
WHERE region IS NULL

This will not work if you specify the region column in your count, however. Here we'll take a look at why it is important to understand the order in which SQL performs the steps in a SELECT statement that contains an aggregate function. This query returns zero as the NULLRows count:

SELECT COUNT(region) AS NULLRows
FROM [Northwind].[dbo].[suppliers]
WHERE region IS NULL

Why does this query return zero? Remember, by definition COUNT(column) applies the value expression, the search expression and then eliminates NULLs before giving you a final count! Here's a play-by-play of what happens in our example:

  1. The WHERE clause eliminates every row in which region is not NULL.
  2. The value expression in this example is simply a column name - region.
  3. The COUNT(region) function eliminates every row in which region is NULL.
  4. Since we've explicitly eliminated all non-NULL columns with the WHERE clause and implicitly eliminated all NULL columns with the COUNT(region) statement, we end up with a zero rows.
Gotcha:  COUNT(column) eliminates NULL values; COUNT(*) returns the cardinality (total number of rows) for an entire table.  The order in which SQL performs the steps in aggregate calculations can have implications for your results. 

SUM()

SUM(), like the other set functions, applies the search expression, then the value expression, eliminates NULLs and finally returns a sum of the remaining rows. It is important to note that SUM() excludes NULLs, as NULL scalar arithmetic normally returns NULL. For instance, let's do a scalar addition of all the values in the reportsto column of the employees table:

SELECT 2 + NULL + 2 + 2 + 2 + 5 + 5 + 2 + 5

This SELECT statement will return NULL, since NULL plus anything always returns NULL. If we use the SUM() function on the table, however, we can rest assured that NULLs are eliminated from our result:

SELECT SUM(reportsto) AS RowSum
FROM [Northwind].[dbo].[employees]

The result of this query:

RowSum
25
 
Gotcha:  NULLs are eliminated by the SUM() function; in scalar arithmetic, however, NULL added to anything equals NULL.

AVG() NULL-handling

The AVG() function returns the average of the column, excluding NULLs. There are a couple of "gotcha's" concerning the AVG() function. The first "gotcha" is that NULLs are excluded, as with other set functions, so your answer may not be what you expect. The second "gotcha" is that, by definition, AVG() returns a number of the same scale as the column it is performed on. Again, this could throw your answer off a bit.

We'll tackle the NULL issue first. AVG() specifically excludes NULLs when it calculates averages. The following query divides the sum of the reportsto column by the count of non-NULL rows. In our example, one row is excluded from the calculation:

SELECT AVG(reportsto) AS ColAvg
FROM [Northwind].[dbo].[employees]

Our result is:

ColAvg
3

This is the same result as if we performed this query:

SELECT SUM(reportsto)/COUNT(reportsto)
FROM [Northwind].[dbo].[employees]

So how does excluding NULLs affect our average? We can use COALESCE() function to see.

SELECT AVG(COALESCE(reportsto, 0)) AS ColAvg
FROM [Northwind].[dbo].[employees]

The COALESCE(reportsto, 0) statement forces AVG() to count NULL rows as zeroes; forcing every single row to be counted. This gives us the following result:

ColAvg
2

We can get the same result using the SUM() and COUNT() set functions:

SELECT SUM(reportsto)/COUNT(*) AS ColAvg
FROM [Northwind].[dbo].[employees]
ColAvg
2

Here we are performing a SUM() on the rows that are not NULL, then dividing by the total number of rows, effectively treating the NULL rows as zeroes in our average. In this instance we're dealing with smaller numbers and only nine rows. With larger numbers and more rows the difference is more pronounced.

So which method should you use? That depends on the result you're looking for. Consider an example of calculating the average payroll for a department. Suppose you have the following table:

NameWeekly Pay
Joe M.400.00
Lisa D.500.00

Our average is $450.00 per week. Now let's say that we hire a new guy, but his weekly pay has not been put into the system yet. Now we have:

NameWeekly Pay
Joe M.400.00
Lisa D.500.00
Tom J.NULL

If we include Tom's missing salary in our average (treat it like 0.00), our average goes down to $300.00 per week. If we leave Tom out of it for now – the default for AVG() – we get $450.00 per week again.

So which method is correct? Ignore NULLs or treat them as zeroes in your averages? That depends entirely on your objectives and corporate policies. The main thing is to be consistent in your calculations. If you ignore NULLs in calculating averages, use that same methodology every time; otherwise your results will not be comparable from year to year, quarter to quarter or even day to day.

Gotcha:  AVG() eliminates NULLs when calculating your average. 

AVG() Scale

The second issue is scale. In our examples, the averages we received were calculated and returned with the same scale (digits after the decimal point) as the column; in this case the column is an INT data type, so no decimals were returned. We'll apply the CAST operator to see what would have been returned if we were using a floating-point data type:

SELECT CAST(AVG(reportsto) AS FLOAT) AS ColAvg
FROM [Northwind].[dbo].[employees]
ColAvg
3.0

So far, not much has changed. We received 3 previously and 3.0 this time. So let's look at the equivalent average using SUM() and COUNT():

SELECT CAST(SUM(reportsto) AS FLOAT)/COUNT(reportsto) AS ColAvg
FROM [Northwind].[dbo].[employees]
ColAvg
3.125

Notice that we are forcing floating point precision now by casting the reportsto column as a FLOAT.  We can get this same result by moving the CAST inside the AVG() function, like this:

SELECT (AVG(CAST(reportsto AS FLOAT))) AS ColAvg
FROM [Northwind].[dbo].[employees]

So how did we end up with a different value the second time? Since AVG() calculates using the same scale as the column data type, our INT column average is calculated with no decimal places. In the first example, we CAST the result of the AVG() function to a FLOAT type, but the AVG() was calculated as an INT first.

In the second example, we CAST the SUM() of the reportsto column to a floating point type, and then divided; so we didn't lose any digits after the decimal point in the calculation. As shown, we can get the same result by using CAST on the column inside the AVG() function.  This is particularly important to note in financial and scientific calculations, where the scale is particularly important to our calculations.

If your scale requirements are greater than the scale of your column, you might need to use the SUM(column)/COUNT(column) method of calculating your average and CAST it to a floating or fixed-point type, or CAST inside of the AVG() function.

Gotcha:  AVG() also uses the precision (total number of digits) and scale (number of digits after the decimal point) of the column you are performing the AVG() on.

MIN(), MAX()

The MIN() and MAX() functions return the minimum and maximum values of a column, respectively. Like the other set functions, these also eliminate NULLs before returning a result.

For numeric data types, MIN() and MAX() simply return the minimum or maximum number in the column. For character data types, the values are compared using standard SQL form: the shorter of the values is right-padded with 'pad characters' (usually spaces). This means that character columns with trailing spaces are mixed in with character columns without trailing spaces. As an example:

IF 'hi' = 'hi '
  PRINT 'Equal'
ELSE
  PRINT 'Not Equal'

Depending on your collation settings, string comparisons might be case-sensitive or case-insensitive.

Gotcha:  NULLs are eliminated by MIN() and MAX() functions.  For character data types, shorter values are right-padded with 'pad characters' (often spaces) before comparing them.

Conclusion

SQL set functions are very useful in calculating basic statistical values. By knowing how SQL set functions handle NULL values you can ensure that you always get accurate, high-quality results.

Further Reading

Total article views: 86592 | Views in the last 30 days: 76
 
Related Articles
FORUM

Adding a column header on count function

Adding a column header on count function

FORUM

Count Function

Count Function with condition

ARTICLE

Calculating Stock Returns Using the ROW_NUMBER function and CTEs

In finance, when pricing data isn't available on last day of week, month or year, calculating return...

ARTICLE

Using a column name in a COUNT function

Putting a column name in a COUNT() function may not always give you what you are looking for.

FORUM

get Count(*) to return 0

get Count(*) to return 0

Tags
advanced querying    
t-sql    
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones