Stairway to T-SQL DML

Stairway to T-SQL DML Level 7: Summarizing Data Using a Simple GROUP BY Clause

,

In Stairway 6 I showed you how to sort your data using the ORDER BY clause. This allowed you to organize your detailed records in sort order based on single, or multiple columns. Detailed data is great if you want to see data in specific records, but sometimes you need to roll up the detailed data into summarized values. Summarizing your data you can done using the GROUP BY clause.

There are two types of GROUP BY clauses. One that is known as the simple GROUP BY clause and another that provides a little more than simple summarization that is called the general GROUP BY clause. The main difference between these two type is the simple GROUP BY contains just the GROUP BY clause, whereas the general GROUP BY clause contains other operators like ROLLUP and CUBE.

In this article I will be covering how to group data using the simple GROUP BY clause. In a follow-up article I will be covering the more complex general GROUP BY clause.

Simple GROUP BY clause

Using the simple GROUP BY clause allows you to aggregate your data based on a single column, multiple columns or expressions. Only one summarized row will be returned for each unique value based on the columns and/or expressions specified in the GROUP BY clause. When SQL Server processes a GROUP BY clause it groups the detailed records by the unique column or expression values, and then summarizes each set based on the aggregation functions included in the select list.

To better grasp how to use the GROUP BY, let’s assume you have a table that contains detailed sales information for different stores and you want to summarize total sales amount by store. You can use the GROUP BY clause to aggregate the total sales amount by each store. In this example the unique column you would group on would be store name, and the column to be aggregated would be the sales amount. Your results would show one row for each unique store name, and the row for each store would contain the sum of the sales amounts for that store.

SQL Server has some limitation on what columns can be included in the SELECT list of a GROUP BY query. Each column specified in the SELECT list of a GROUP BY query needs to fall into one of the following categories:

  • A column specified in the GROUP BY clause
  • An expression specified in the GROUP BY clause
  • The value returned from an aggregate function

If a column doesn’t fall into one of these categories you will get an error when trying to run your GROUP BY query. Note a column or expression contained in the GROUP BY clause is not required to be in the select list.

Let me go through a couple of examples to help demonstrate how to obtain summarized values using the simple GROUP BY clause.

Sample Data for Exploring the Simple GROUP BY Clause

In order to demonstrate how to use a simple GROUP BY clause I need to build some sample data. I am providing a script to create my sample data so you can run the sample code provided in this article. Use the script in Listing 1 to build and populate the sample tables.

USE tempdb;
GO
SET NOCOUNT ON;
-- Create Sales Table   
CREATE TABLE dbo.SalesTransaction 
   (Id INT IDENTITY PRIMARY KEY
   ,CustomerName VARCHAR(65)
   ,TotalSalesAmount money
   ,SalesTypeDesc VARCHAR(200)
   ,SalesDateTime DATETIME
   ,StoreName VARCHAR(100));
   
-- Add data to Sales Table
INSERT INTO dbo.SalesTransaction 
      VALUES ('John Smith', 124.23,'Software','09/22/2011 11:51:12 AM','The Software Outlet');
INSERT INTO dbo.SalesTransaction 
      VALUES ('Jack Thomas', 29.56,'Computer Supplies','09/23/2011 10:21:49 AM','The Software Outlet');
INSERT INTO dbo.SalesTransaction 
      VALUES ('Sue Hunter', 89.45,'Computer Supplies','09/23/2011 2:51:56 AM','The Software Outlet');
INSERT INTO dbo.SalesTransaction 
      VALUES ('Karla Johnson', 759.12,'Software','09/23/2011 2:54:37 PM','The Software Outlet');
      INSERT INTO dbo.SalesTransaction 
      VALUES ('Gary Clark', 81.51,'Software','09/22/2011 11:08:52 AM','Discount Software');
INSERT INTO dbo.SalesTransaction 
      VALUES ('Scott Crochet', 12345.78,'Computer Supplies','09/23/2011 3:12:37 PM','Discount Software');
INSERT INTO dbo.SalesTransaction 
      VALUES ('Sheri Holtz', 12.34,'Software','09/23/2011 10:51:42 AM','Discount Software');
INSERT INTO dbo.SalesTransaction 
      VALUES ('Mary Lee', 101.34,'Software','09/23/2011 09:37:19 AM','Discount Software');
      INSERT INTO dbo.SalesTransaction 
      VALUES ('Sally Davisson', 871.12,'Software','09/22/2011 05:21:28 PM','Discount Software');
INSERT INTO dbo.SalesTransaction 
      VALUES ('Rod Kaplan', 2345.19,'Computer Supplies','09/23/2011 5:01:11 PM','Discount Software');
INSERT INTO dbo.SalesTransaction 
      VALUES ('Sandy Roberts', 76.38,'Books','09/23/2011 4:51:57 PM','Computer Books and Software');
INSERT INTO dbo.SalesTransaction 
      VALUES ('Marc Trotter', 562.94,'Software','09/23/2011 6:51:43 PM','Computer Books and Software');

Listing 1: Script to create Sample Data

If you look through the script in Listing 1 you will find I created the dbo.SalesTransaction table. I then inserted a number of records into this table. I will use this table to demonstrate how to use a simple GROUP BY clause to aggregate data.

Grouping by a Single Column

Using the sample table created using Listing 1 , this first example will use the GROUP BY clause to summarize data based on a single column. My example in Listing 2 summarizes my sample data based on the StoreName column.

USE tempdb;
GO
SELECT StoreName 
     ,SUM(TotalSalesAmount) AS StoreSalesAmount
FROM dbo.SalesTransaction  
GROUP BY StoreName;

Listing 2: GROUP BY based on Single columns

When the code in Listing 2 is executed against my sample table the following aggregated rows in Report 1 are returned.

StoreName                                  StoreSalesAmount

------------------------------------------ ----------------

Computer Books and Software                639.32

Discount Software                          15757.28

The Software Outlet                        1002.36

Report 1: Summarizing sample data based on a single column

If you review the output in Report 1 you can see that only one aggregated row is returned for each unique value of StoreName. The StoreSalesAmount on each record is calculated by summing up the TotalSalesAmount column for each store’s Sales records using the SUM function.

Grouping by Multiple Columns

There are times when you need to develop a report in which your data needs to be grouped by multiple columns. In order to accomplish this all you need to do is add additional columns to the GROUP BY clause. When multiple columns are specified in the GROUP BY clause SQL Server aggregates the detailed rows based on each unique combination of values from the columns in the GROUP BY clause. In Listing 3 I have expanded the query in Listing 2 by adding a second column to the GROUP BY clause.

USE tempdb;
GO
SELECT StoreName, SalesTypeDesc
     ,SUM(TotalSalesAmount) AS StoreSalesAmount
FROM dbo.SalesTransaction  
GROUP BY StoreName, SalesTypeDesc;

Listing 3: GROUP BY based on Single columns

When I run the code in Listing 3 against my sample data I get the results in Report 2.

StoreName                    SalesTypeDesc      StoreSalesAmount

---------------------------- ------------------ ----------------

Computer Books and Software  Books              76.38

Discount Software            Computer Supplies  14690.97

The Software Outlet          Computer Supplies  119.01

Computer Books and Software  Software           562.94

Discount Software            Software           1066.31

The Software Outlet          Software           883.35

Report 2: Output from Running Listing 1

In Report 2 you can see that StoreSalesAmount is now summarized at the StoreName and SalesTypeDesc level. Also note that the aggregated rows returned are not in sorted order based on the columns in the GROUP BY clause. If I wanted the summarized data to appear in StoreName order then I would have needed to include an ORDER BY clause on the SELECT statement. I will leave it up to you to add the ORDER BY to the code in Listing 3 to return the summarized data in StoreName order.

Using an Expression in the GROUP BY Clause

There are times that you may want to group your data by something other than a specific column or set of columns. For example you might want to summarize your data based on the first few characters of some VARCHAR column, or maybe just the date, or month of a DATETIME column. SQL Server allows you to specify expressions in the GROUP BY clause to accomplish this. An expression could be any valid expression that is based on a column in the detailed record set that is being aggregated. To demonstrate how to use an expression in the GROUP BY clause look at the code in Listing 4.

USE tempdb;
GO
SELECT CONVERT(CHAR(10),SalesDateTime,101) AS SalesDate   
     ,SUM(TotalSalesAmount) AS TotalSalesAmount
FROM dbo.SalesTransaction
GROUP BY CONVERT(CHAR(10),SalesDateTime,101);

Listing 4: GROUP BY based on Single columns

In listing 4, the SELECT statement is grouping the data based on an expression, in this case a CONVERT function. If you use an expression in the GROUP BY clause the same exact expression must be used in the SELECT list. The CONVERT function is parsing the SalesDateTime column, and returning only the date portion of this column. Using the CONVERT function in the GROUP BY clause allows me to summarize the Sales data based on the actual dates of the different Sales records. By doing this I was able to summarize my sample to get the TotalSalesAmount by date for all stores, as shown in Report 3.

SalesDate  TotalSalesAmount

---------- ----------------

09/22/2011 1076.86

09/23/2011 16322.10

Report 3: Output when summarizing data based on expression

Using expressions allows to you to programmatically identify which parts of your detailed data will be used to aggregate your data.

HAVING Clause

If you are aggregating data with the GROUP BY clause you might want to not return all the aggregated values. Instead you may want to only return a subset of the aggregated values. The HAVING clause can be used to selectively identify the aggregated values you want to return from the GROUP BY summarization.

Normally when we SELECT data we use the WHERE clause to restrict the rows that are returned. The only problem with that is the WHERE clause operates on row values, and not aggregated values. Therefore the WHERE clause is unable to use aggregated values created by the GROUP BY clause. However, adding a HAVING clause after your GROUP BY clause provides you a way to specify conditions to identify the specific summarized values that you want returned. To better understand this let me provide you with a couple of examples.

One of the common things the HAVING clause might be used for when looking at store sales data is to determine the stores that are not meeting a specific sales quota. If you wanted to find all the stores that didn’t meet a minimum sales amount you could do that with the code in Listing 5.

USE tempdb;
GO
SELECT StoreName 
     ,SUM(TotalSalesAmount) AS StoreSalesAmount
FROM dbo.SalesTransaction 
GROUP BY StoreName
HAVING SUM(TotalSalesAmount) < 1000.00;

Listing 5: Restricting result set by using HAVING clause

In Listing 5 I restricted the result set to those stores that had an aggregated TotalSalesAmount of less than 1000.00. In my trivial example here, you will find the StoreName of “Computer Books and Software” is the only store that didn’t meet the $1000.00 sales quota amount.

The HAVING clause can be used on columns that are also not aggregated. If you want to restrict the rows returned based on specific value of any one of the columns used in the GROUP BY clause you can also do that, and Listing 6 demonstrates this.

USE tempdb;
GO
SELECT StoreName 
     ,SUM(TotalSalesAmount) AS StoreSalesAmount
FROM dbo.SalesTransaction 
GROUP BY StoreName 
HAVING StoreName LIKE '%Outlet%' 
    OR StoreName LIKE '%Books%';

Listing 6: Restricting result set based on GROUP BY column

In Listing 6 I only wanted to see summarized data for stores that have either “Outlet” or “Books” in their store name This example also demonstrates that you can have multiple conditions in the HAVING clause. Another way to think of the difference between WHERE and HAVING is that the WHERE clause filters out data rows before the data is aggregated, and the HAVING clause filters out aggregated rows after the GROUP BY is applied.

Summarizing Data with the Simple GROUP BY Clause

In this article I showed you how to use the simple GROUP BY clause to summarize your data. I discussed how to use a single column, multiple columns, as well as expressions in the GROUP BY clause to summarize detailed data. By using what I have demonstrated, you should now be able to build a simple GROUP BY clause to summarize your data, and optionally filter the summarized data using HAVING.

In my next article I will be expanding my discussion of the GROUP BY clause. In this follow-up article I will show you how to use the CUBE and ROLLUP operators to produce additional summarized data, such as sub-total and grand total values.

This article is part of the parent stairway Stairway to T-SQL DML

Rate

5 (1)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (1)

You rated this post out of 5. Change rating