Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Aggregate Queries

By Kathi Kellenberger, (first published: 2005/12/19)

Aggregate Queries

The use of aggregate functions in queries is a popular and powerful feature of the SQL language. It takes some practice to become proficient in their use. Query writers are frequently confused about the difference between the WHERE clause and the HAVING clause or have trouble figuring out which columns to include in the GROUP BY clause. This article will explain how to write queries using aggregates and, hopefully, clear up the confusion. It is intended to cover the basics of using aggregate expressions and does not cover the advanced features.

What are aggregate functions?

Aggregate functions are used to summarize data in queries. They usually work on groups of data, however, in some cases they will work on the entire table. The most commonly used aggregate functions are AVG, COUNT, MIN, MAX and SUM. Keep the following in mind when working with these aggregate functions.

  • The functions AVG and SUM will only operate on numeric columns.
  • The functions MIN and MAX will work on numeric, character and date columns.
  • The COUNT function can operate on any column except for text, ntext or image columns.
  • COUNT can be used with an asterisk (*) to give the count of the rows by group or entire set of results.
  • The aggregate functions ignore NULL values.
  • The DISTINCT argument, when used within an aggregate expression, will cause it to operate on non-null unique values.

The following query from the Northwind database finds the count of orders and the maximum freight grouped by EmployeeID. Rows are filtered out of the results if the maximum freight of a group is less than 800 or the ShippedDate of an order IS NULL.

SELECT COUNT(*), MAX(Freight), EmployeeID
FROM Orders
WHERE ShippedDate IS NOT NULL
GROUP BY  EmployeeID
HAVING MAX(Freight) >= 800
ORDER BY EmployeeID

The building blocks

Let’s take a look at the parts, known as clauses, of an aggregate query. I have always found it is easier to write the query if I figure out each clause before moving on to the next one. This step-by-step approach may work for you, too.

The first part of the statement is the SELECT clause. It lists the columns and aggregate expressions that are returned to the client application. Make sure you only include columns that you really need in the results, because the SELECT clause directly affects which columns will be required in the GROUP BY clause. Keep these facts in mind when building the SELECT clause.

  • You can use columns, literal values, aggregate expressions and other expressions in the SELECT clause.
  • Any columns listed that are not part of an aggregate expression will be used to group the results when you get to the GROUP BY clause.

The next part of the statement is the FROM clause. This clause specifies the tables and or views from which the data is queried. The tables and views are joined using the usual JOIN syntax that is used for any other T-SQL statement. Refer to Books Online if you need to learn more about how to join tables.

The WHERE clause is a very important part of an aggregate query. By using the WHERE clause you can eliminate some of the rows before the rows are grouped and the aggregate functions do their work. You can specify any of the available columns from the tables or views in the WHERE clause even if they haven't been used. Notice that the sample query above is filtering on ShippedDate which is not used anywhere else in the query. You are, however, restricted from including any aggregates in the WHERE clause. This makes sense when you consider that the WHERE clause is processed by SQL before the aggregate functions. Remember these rules when building the WHERE clause:

  • Any column can be used in the WHERE clause.
  • Nonaggregate expressions can be used in the WHERE clause.

The GROUP BY clause is one that query writers struggle with quite often. As its name suggest, it is used to group the data. The aggregate functions are applied to the groups. If a GROUP BY is not used in the query, then the aggregate functions are applied to the entire set of rows returned. Remember, the WHERE clause may be used to filter out some of the rows from the table first. You must include all columns in the GROUP BY clause that are used in the SELECT clause unless the column is part of an aggregate expression. One of my college professors insisted that the opposite was true, that you have to include in the SELECT clause any columns listed in the GROUP BY clause. While it is true that the columns in the two clauses usually match, it is possible to list a column in the GROUP BY clause without listing it in the SELECT clause. I can’t think of any good reason to do this, however. I will include an example query demonstrating this technique and a better way to write the same query later in the article.

If you don’t want to group on a column, don’t list it in the SELECT clause. In the example query above, you can not list information about individual orders since you are grouping the orders by EmployeeID. If you included the OrderID column in the SELECT clause, for example, you would then have to list the column in the GROUP BY clause. Your summary information would no longer apply to EmployeeID but to the individual orders, which wouldn’t be a summary at all. There are cases where you will need to list the details about the row that satisfies certain criteria based on an aggregate function, for example, the last order placed by each customer. An sample query later in the article will show how to do this using a derived table.

Remember these rules when writing the GROUP BY clause:

  • The results will be grouped by the columns listed in the GROUP BY clause.
  • All columns found in the SELECT clause, including those used in nonaggregate expressions, must be listed in the GROUP BY clause.

The GROUP BY clause also has some optional arguments: ALL, WITH ROLLUP and WITH CUBE. These will not be covered in this article.

The HAVING clause is used to filter rows after the grouping has been applied, but before the results are returned to the client. You can only include aggregate expressions and columns that are listed in the GROUP BY clause. It is possible to include different aggregate functions than those in the SELECT clause. In the example query above, you could write a query using the HAVING clause to include only employees whose orders' average freight was above a certain amount. You could also include the EmployeeID column in the HAVING clause since it is listed in the GROUP BY clause. Chances are, though, that it would make more sense to filter the EmployeeID column in the WHERE clause instead.

  • Any column or expression listed in the GROUP BY clause may be included in the HAVING clause.
  • Filter aggregate results using the HAVING clause, not the WHERE clause.

The last clause is the ORDER BY clause. You might think that the ORDER BY clause would not be necessary -- that SQL would order the results according to the GROUP BY columns, but this is not the case. If you want the results to sort according to the groups, the columns must be specified in the ORDER BY clause. The ORDER BY clause has the same rules as the SELECT clause, only aggregates and columns listed in the GROUP BY are allowed.

  • Any column or expression listed in the GROUP BY clause may be included in the ORDER BY clause.
  • The ORDER BY is required if you want ordered results.

Sample queries using Northwind

This query joins two tables, orders and order details. It returns the OrderID column along with the count of line items in each order if they were shipped via method 3 and if they have at least two line items. Finally, the results are ordered by the OrderID column:

SELECT o.OrderID, COUNT(*) AS Lines
FROM orders o JOIN [order details] od
ON o.OrderID = od.OrderID
WHERE ShipVia = 3
GROUP BY o.OrderID
HAVING COUNT(*) > 1
ORDER BY o.OrderID

You can’t update a column using an aggregate function directly, but you can use a derived table to product the desired results.

First, run this statement to add a column to the Employees table in Northwind:

ALTER TABLE Employees ADD ORDER_COUNT INT NULL

The following query will generate an error:


UPDATE e
SET ORDER_COUNT = COUNT(*)
FROM Employees e JOIN Orders o
ON e.EmployeeID = o.EmployeeID

This update query using a derived table shown in blue works fine:

UPDATE e
SET ORDER_COUNT = ORDERCOUNT
FROM Employees e JOIN 
(SELECT COUNT(*) AS ORDERCOUNT, EmployeeID 
FROM ORDERS 
GROUP BY EmployeeID) o
ON e.EmployeeID = o.EmployeeID

This example uses a couple of expressions. Notice that they do not need to be included in the GROUP BY clause because they do not use any of the table's columns:

SELECT getDate() as CurrentDate, 1 + 2 as Add1Plus2, 
EmployeeID, count(*) as ORDERCOUNT
FROM Orders
GROUP BY EmployeeID

The query below will run without error because the column, OrderDate, used in the expressions is listed in the GROUP BY clause. It does not, however, give the expected results:


SELECT COUNT(*) AS OrderCount, 
   MONTH(OrderDate) AS OrderMonth,
   YEAR(OrderDate) AS OrderYear
FROM Orders
GROUP BY OrderDate
ORDER BY OrderDate

To get the correct results, a summary by year and month, the query must be grouped by the actual expressions, not just the column:

SELECT COUNT(*) AS OrderCount, 
   MONTH(OrderDate) AS OrderMonth, 
   YEAR(OrderDate) AS OrderYear
FROM Orders GROUP BY YEAR(OrderDate), MONTH(OrderDate)
ORDER BY YEAR(OrderDate), MONTH(OrderDate)

It is possible to list a column in the GROUP BY clause and not in the SELECT clause. Here is a query that runs but doesn't really return any useful information:

SELECT MAX(OrderID) AS lastOrder
FROM Orders
GROUP BY CustomerID

There may be times when this technique can be used with a sub-query to solve a particular problem. In this example, I am listing more information about the latest order for each customer:

SELECT OrderID, CustomerID, OrderDate, EmployeeID, 
   ShipVia, Freight, ShipName
FROM Orders
WHERE OrderID in (
   SELECT MAX(OrderID) 
   FROM Orders
   GROUP BY CustomerID)

The query could be better written with a derived table:

SELECT OrderID, o.CustomerID, EmployeeID, OrderDate, ShipVia, Freight, ShipName
FROM Orders o JOIN 
   (SELECT MAX(OrderID) as MaxOrder, CustomerID FROM orders GROUP BY CustomerID) m
ON o.CustomeriD = m.CustomerID
WHERE OrderID = MaxOrder

Conclusion

Writing aggregate queries just takes a little practice once you understand the rules. Remember to first figure out what you need to see in the results (SELECT) and which tables or views the data will come from (FROM). Filter out any rows you don’t want included at all (WHERE). Determine how the results will be grouped (GROUP BY) by looking at the SELECT clause. Decide if you need to filter based on the groups using aggregate functions (HAVING). Finally, sort the results (ORDER BY). Often a derived table will help solve more complex problems.

Total article views: 35389 | Views in the last 30 days: 39
 
Related Articles
FORUM

Group by clause query

Group by clause query

FORUM

Sql query with group by clause

Sql query with group by clause

FORUM

Order by clause in openrowset

Order by clause in openrowset

BLOG

Group by.....Having Clause

Group By:- Group By clauses is used to groups rows based on the distinct values of the specified col...

FORUM

Multi Query Group BY

Need to combine two queries with same GROUP BY clause

Tags
advanced querying    
t-sql    
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones