Advice on using COUNT( )

In the SQL Server community, one thing that I sometimes encounter is the question on whether you should use COUNT(*) or COUNT(columnname), where columnname is a column in the table that you want to count the rows for. Often the advice given to people in forums and mailing lists is that COUNT(columnname) will perform better than COUNT(*). This is not always the correct advice though, in many cases it is even entirely wrong. Although there are situations where you can (or even should) use COUNT(columnname), you definitely shouldn't always use it. This incorrect piece of advice is probably based on a lack of understanding of how SQL Server handles data internally.

Description of COUNT( )

The first thing that you need to know is that there is a difference between the alternative ways of using COUNT( ), and what this difference is. The complete syntax for COUNT( ) is this:

COUNT ( { [ ALL | DISTINCT ] expression ] | * } )

The word expression means any expression except for uniqueidentifier, text, ntext or image data, and it may not use aggregated functions or subqueries. Most often though, expression is just a column in the table. ALL is the default, which means that writing COUNT(expression) is equal to writing COUNT(ALL expression).

COUNT(*) returns the total number of rows in the table, while COUNT(expression) returns the number of rows where the result of the expression is not NULL. Naturally, COUNT(DISTINCT expression) means that duplicates are only counted once. This means that COUNT( ) can return different results depending on how you write it.

Myths and facts

As I said earlier, many people believe COUNT(columnname) is faster than using COUNT(*), because COUNT(*) would have to read all columns of each row (just like executing a SELECT * FROM MYTABLE statement), while COUNT(columnname) only need to read the specified column. This is not true though, for several reasons.

First of all, SQL Server can't read just the contents of a single column without reading the entire row. SQL Server stores the rows with the data on 8 KB data pages on disk. These pages contain one or more rows (depending on the size of each individual row, which may be up to 8060 bytes, with some exceptions), and these pages are placed in the internal memory (RAM) when SQL Server needs to access them for any reason. To check the value of a single column (or several of course), an entire page has to be read from disk and placed in memory. The pages may of course already be cached in memory, in which case the read will be much faster, but SQL still needs to read an entire page from memory just to check a single column of a row.

Now, to avoid having to read these entire data pages when all you are really interested in is how many rows there are in a table, SQL Server will use an index instead, if one exists. Indexes are stored in the same way as data, on 8 KB index pages. Since an index is probably not as wide as a data row (the index only consists of some or even one of the columns in the row), an index page can usually fit a lot more rows per page than the data pages can. This means that SQL Server doesn't have to read as many pages to check the number of rows in the index as it does with the data pages, which is of course a good thing.

This does not only apply to COUNT(columnname_with_an_index_defined_on_it), COUNT(*) will of course also use the index to count the rows. In some cases there may not be an index that covers the specified column in a COUNT(columnname) query, but there is an index defined on another column of the table. In this case COUNT(*) would use this other index to count the number of rows, but COUNT(columnname_without_an_index) would have to read the data pages to check the column for NULL values and count the rows.

To try this for yourself, run the following script in SQL Query Analyzer (if it is not already set to show the results in text mode, use Ctrl-T to set it that way):

USE Northwind

SET STATISTICS IO ON

SELECT COUNT(*) FROM Orders

SELECT COUNT(CustomerId) FROM Orders

SELECT * FROM Orders

SET STATISTICS IO OFF

The statement SET STATISTICS IO ON configures SQL Server to output statistics showing the amount of I/O that was required to execute the query, and you can use it to compare the amount of resources used by different queries to decide which one to use. You can find this output directly after the results of the statement executed. The statistics we are interested in here is the number of logical and/or physical page reads. Logical page reads is the amount of pages (data and/or index pages) that was read from memory, and physical page reads is the number of pages read from disk. On my computer the result of COUNT( ) shows 830 rows for both alternatives, which is probably also what you got if you haven't added or deleted any rows from the Orders table. Now note the number of logical page reads for these statements (run the script a couple of times if you're getting physical page reads to cache the data in memory). I have 3 logical page reads for the first alternative, and 21 logical page reads for the second one! Also note that the third statement that SELECTs all of the rows from the table also resulted in 21 logical page reads. This shows us that the second statement had to read all of the data pages just to count the number of rows in Orders because there is no index on the CustomerId column, but the first statement is able to use an index (on my computer the index ShippersOrders was used; I checked the execution plan for the query to find that out) to count the rows.

Which one to use?

As I have shown, using COUNT(*) does certainly not mean poor performance. On the contrary, in some cases you may instead get poor performance from using COUNT(expression). Normally you probably won't encounter the problem in the example above, as you will probably have an index on the column you specified. What is worse though is that you may receive a different result from what you were expecting! Let's say that you have a legacy application that uses COUNT(columnname) to count the number of rows of a table, where columnname represents a column that does not allow NULL values. Now, sometime later, the definition for the column is changed to allow NULL values. As soon as someone enters a NULL value in the column, your application will no longer show the number of rows in the table but instead the number of rows with non-NULL values in the specified column! That may not be what the designers of the application intended and expected, and could possibly cause major problems.

But...

So, normally there is no reason not to use COUNT(*). But as I mentioned in the beginning of the article there are situations where you want to (or rather should) use COUNT(expression). One obvious example is of course if you are really only interested in the number of rows where the column value is not NULL. A typical example of a situation like that is when you use COUNT( ) together with another aggregated function. Let's say we have a table with some sort of measure data, with NULL values in some rows. Now we're looking for an average of these values. Normally, we would use AVG( ) for this, but to see the point we'll say we're not allowed to use it. Compare these two statements and see if you spot the problem:

SELECT SUM(column) / COUNT(*) FROM table

SELECT SUM(column) / COUNT(column) FROM table

These statements will return different average results, since SUM( ) ignores NULL values (they are not counted as 0). If the sum is 1500, and the number of rows is 150, of which 50 have NULL in the specified column, the result of the first query will be 10 (1500/150) and the result of the second query will be 15 (1500/100). This is actually a problem that I encounter quite often in my work as a database consultant, and most often it exists due to the fact that the person who wrote the SQL statement where not aware of how NULL values are handled differently in different aggregated functions (SUM( ) and COUNT( ) in the example above).

Rate

Share

Categories

Share

Rate

Advice on using COUNT( )

Rate

Share

Categories

Share

Rate

Related content

Uses For Derived Tables

Making Dynamic Queries Static

Optimizing Stored Procedures To Avoid Recompiles

Tip: Testing for the Existence of Data

Introduction to ADO Part 4 - Combining It All