### Introduction

Recently, I noticed a lot of questions about how SQL aggregate functions

handle NULL values on a discussion board for an article here at SQL Server

Central Understanding the difference between IS NULL and =NULL discussion board. This article was written to help answer some of those questions by explaining how SQL aggregate functions handle NULLs, and to describe some other "gotcha's" when dealing with SQL aggregate functions. Each section of the article ends with a summary of "gotcha's" to look out for when using these function.

The ANSI SQL-92 Standard defines five *aggregate functions* (which are also referred to in the SQL-92 standard as *set functions*.) SQL NULLs receive special internal handling that the database designer/programmer needs to be aware of when using these functions. This article describes how ANSI SQL-92 aggregate functions handle NULLs, per the ANSI SQL-92 standard.

I reference the ANSI SQL-92 Standard throughout the article, and the sample code requires the Microsoft SQL Server Northwind Sample Database to be run. All samples were run on SQL Server 2000.

### Aggregate Functions

ANSI SQL-92 defines five *aggregate functions*. For all set functions that specify a column name, the following process takes place behind the scenes:

- The search expression (the
**WHERE**clause) is applied. - The
*value expression*(scalar functions on the column like**COALESCE()**)is applied to each row of the results of Step 1.

- NULL values are eliminated from our results*.

*NOTE: **COUNT(*)** is a special case, which we describe in detail below.

ANSI SQL-92 specifies that a warning message will be returned when NULLs are eliminated by set functions. You can turn on

this option by using the **SET ANSI_WARNINGS ON** command. The **SET ANSI_WARNINGS ON** option returns the following error message

*"Warning: Null value is eliminated by an aggregate or other set operation." *

The **SET ANSI_WARNINGS ON** also affects

how SQL Server handles division by zero, arithmetic overflow and character/binary data truncation behavior. Before changing

your **SET ANSI_WARNINGS** option, be sure to reference the following MSDN article:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_set-set_6d2r.asp

Gotcha: SET ANSI_WARNINGS
check http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_set-set_6d2r.asp before changing this option. |

### COUNT()

The **COUNT()** set function has two forms:

**COUNT(**– Returns the count of rows in a table, less NULLs*column*)**COUNT(*)**– Returns the cardinality (total number of rows) of a table

The following code demonstrates the difference between **COUNT( column)** and

**COUNT(*)**:

SELECT COUNT(*) AS TotalRows, COUNT(region) AS NonNULLRows, COUNT(*) - COUNT(region) AS NULLRows FROM [Northwind].[dbo].[suppliers]

This query returns the following result:

TotalRows | NonNULLRows | NULLRows |
---|---|---|

29 | 9 | 20 |

- TotalRows is
**COUNT(*)**, the cardinality (total number of rows) of the suppliers table - NonNULLRows is
**COUNT(region)**, the total number of rows of the suppliers table in which region is not NULL - NULLRows is
**COUNT(*) – COUNT(region)**, the difference between the cardinality of the suppliers table and the numberof rows in which region is not NULL; basically this returns the number of rows

in which region *is* NULL.

The fact that **COUNT(*)** and **COUNT( column)** treat NULLs differently can be used to your advantage in

many situations such as calculating the NULLRows column above.

Notice that we can get the NULLRows count from above with the following query as well:

SELECT COUNT(*) AS NULLRows FROM [Northwind].[dbo].[suppliers] WHERE region IS NULL

This will not work if you specify the region column in your count, however.

Here we'll take a look at why it is important to understand the order in which

SQL performs the steps in a **SELECT** statement that contains an aggregate

function. This query returns zero as the NULLRows count:

SELECT COUNT(region) AS NULLRows FROM [Northwind].[dbo].[suppliers] WHERE region IS NULL

Why does this query return zero? Remember, by definition **COUNT( column)** applies the

*value expression*, the *search expression * and

then eliminates NULLs before giving you a final count! Here's a play-by-play of what happens in our example:

- The
**WHERE**clause eliminates every row in which region is not NULL. - The
*value expression*in this example is simply a column name -*region*. - The
**COUNT(region)**function eliminates every row in which region is NULL. - Since we've explicitly eliminated all non-NULL columns with the
**WHERE**clause and implicitly eliminated all NULLcolumns with the

**COUNT(region)**statement, we end up with a zero rows.

Gotcha: COUNT(
column) eliminates NULL values; number of rows) for an entire table. The order in which SQL performs the steps in aggregate calculations can have implications for your results. |

### SUM()

**SUM()**, like the other set functions, applies the *search expression*,

then the *value expression*, eliminates NULLs and finally returns a sum of the

remaining rows. It is important to note that **SUM()** excludes NULLs, as NULL scalar arithmetic normally returns NULL. For

instance, let's do a scalar addition of all the values in the reportsto column of the employees table:

SELECT 2 + NULL + 2 + 2 + 2 + 5 + 5 + 2 + 5

This **SELECT **statement will return NULL, since NULL plus anything always returns NULL. If we use the **SUM()** function on

the table, however, we can rest assured that NULLs are eliminated from our

result:

SELECT SUM(reportsto) AS RowSum FROM [Northwind].[dbo].[employees]

The result of this query:

RowSum |
---|

25 |

Gotcha: NULLs are eliminated
by the anything equals NULL. |

### AVG() NULL-handling

The **AVG()** function returns the average of the column, excluding NULLs. There are a couple of "gotcha's"

concerning the **AVG()** function. The first "gotcha" is that NULLs are excluded, as with other set functions,

so your answer may not be what you expect. The second "gotcha" is that, by definition, **AVG()** returns a number

of the same scale as the column it is performed on. Again, this could throw your answer off a bit.

We'll tackle the NULL issue first. **AVG()** specifically excludes NULLs when it calculates averages. The following

query divides the sum of the reportsto column by the count of non-NULL rows. In our example, one row is excluded from the

calculation:

SELECT AVG(reportsto) AS ColAvg FROM [Northwind].[dbo].[employees]

Our result is:

ColAvg |
---|

3 |

This is the same result as if we performed this query:

SELECT SUM(reportsto)/COUNT(reportsto) FROM [Northwind].[dbo].[employees]

So how does excluding NULLs affect our average? We can use **COALESCE()** function to see.

SELECT AVG(COALESCE(reportsto, 0)) AS ColAvg FROM [Northwind].[dbo].[employees]

The **COALESCE(reportsto, 0)** statement forces **AVG()** to count NULL rows as zeroes; forcing every single row to be

counted. This gives us the following result:

ColAvg |
---|

2 |

We can get the same result using the **SUM()** and **COUNT()** set functions:

SELECT SUM(reportsto)/COUNT(*) AS ColAvg FROM [Northwind].[dbo].[employees]

ColAvg |
---|

2 |

Here we are performing a **SUM()** on the rows that are not NULL, then dividing by the total number of rows, effectively

treating the NULL rows as zeroes in our average. In this instance we're dealing with smaller numbers and only nine rows. With

larger numbers and more rows the difference is more pronounced.

So which method should you use? That depends on the result you're looking for. Consider an example of calculating the

average payroll for a department. Suppose you have the following table:

Name | Weekly Pay |
---|---|

Joe M. | 400.00 |

Lisa D. | 500.00 |

Our average is $450.00 per week. Now let's say that we hire a new guy, but his weekly pay has not been put into the system

yet. Now we have:

Name | Weekly Pay |
---|---|

Joe M. | 400.00 |

Lisa D. | 500.00 |

Tom J. | NULL |

If we include Tom's missing salary in our average (treat it like 0.00), our average goes down to $300.00 per week. If we

leave Tom out of it for now – the default for **AVG()** – we get $450.00 per week again.

So which method is correct? Ignore NULLs or treat them as zeroes in your averages? That depends entirely on your

objectives and corporate policies. The main thing is to be consistent in your calculations. If you ignore NULLs in calculating

averages, use that same methodology every time; otherwise your results will not be comparable from

year to year, quarter to quarter or even day to day.

Gotcha: AVG()
eliminates NULLs when calculating your average. |

### AVG() Scale

The second issue is scale. In our examples, the averages we received were calculated and returned with the same scale

(digits after the decimal point) as the column; in this case the column is an INT data type, so no decimals were returned.

We'll apply the ** CAST** operator to see what would have been returned if we were using a floating-point data type:

SELECT CAST(AVG(reportsto) AS FLOAT) AS ColAvg FROM [Northwind].[dbo].[employees]

ColAvg |
---|

3.0 |

So far, not much has changed. We received 3 previously and 3.0 this time. So let's look at the equivalent average using

**SUM()** and **COUNT()**:

SELECT CAST(SUM(reportsto) AS FLOAT)/COUNT(reportsto) AS ColAvg FROM [Northwind].[dbo].[employees]

ColAvg |
---|

3.125 |

Notice that we are forcing floating point precision now by casting the

reportsto column as a FLOAT. We can get this same result by moving the **CAST** inside the **AVG()** function, like this:

SELECT (AVG(CAST(reportsto AS FLOAT))) AS ColAvg FROM [Northwind].[dbo].[employees]

So how did we end up with a different value the second time? Since **AVG()** calculates using the same scale as the

column data type, our INT column average is calculated with no decimal places. In the first example, we **CAST** the result

of the **AVG()** function to a FLOAT type, but the **AVG()** was calculated as an INT first.

In the second example, we **CAST** the ** SUM()** of the reportsto column to a floating point type, and then

divided; so we

didn't lose any digits after the decimal point in the calculation. As shown, we can get the same result by using **CAST **on the column inside the

**AVG()** function. This is particularly important to note in financial

and scientific calculations, where the scale is particularly important to our

calculations.

If your scale requirements are greater than the scale of your column, you might need to use the

**SUM( column)/COUNT(column)** method of calculating your average and

**CAST**it to a floating or fixed-point type,

or **CAST **inside of the **AVG()** function.

Gotcha: AVG() also
uses the precision (total number of digits) and scale (number of digits after the decimal point) of the column you are performing the on. |

### MIN(), MAX()

The **MIN() **and **MAX()** functions return the minimum and maximum values of a column, respectively. Like the other

set functions, these also eliminate NULLs before returning a result.

For numeric data types, **MIN()** and **MAX()** simply return the minimum or maximum number in the column. For

character data types, the values are compared using standard SQL form: the shorter of the values is right-padded with

'pad characters' (usually spaces). This means that character columns with trailing spaces are mixed in with character columns

without trailing spaces. As an example:

IF 'hi' = 'hi ' PRINT 'Equal' ELSE PRINT 'Not Equal'

Depending on your collation settings, string comparisons might be case-sensitive or case-insensitive.

Gotcha: NULLs are eliminated
by types, shorter values are right-padded with 'pad characters' (often spaces) before comparing them. |

### Conclusion

SQL set functions are very useful in calculating basic statistical values. By knowing how SQL set functions handle NULL

values you can ensure that you always get accurate, high-quality results.

### Further Reading

- MSDN, "SET ANSI_WARNINGS". http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_set-set_6d2r.asp
- ANSI SQL-92 Standard (Draft). http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt