Questions About CUBE, ROLLUP and GROUPING SETs That You Were Too Shy to Ask

There are few parts of SQL Syntax as familiar as the GROUP BY clause of the SELECT statement. On the other hand, CUBE and ROLLUP remain mysterious despite their usefulness and GROUPING SET is positively arcane, especially if you are too shy to reveal your ignorance of the subject by asking!

  1. Eh? What are GROUPING SET, CUBE and ROLLUP in SQL?
  2. Why would ROLLUP or CUBE be useful to me?
  3. Are these standard SQL or are they a Microsoft-only thing?
  4. Can I exclude one or more columns from the ROLLUP?
  5. What are GROUPING SETs then? Should I know about them?
  6. Why would we want to combine columns in any aggregation?
  7. Is there more to GROUPING SETS than a way of doing ‘à la carte’ CUBEs?
  8. Why are the functions Grouping() and Grouping_ID() provided?

1. Eh? What are GROUPING SET, CUBE and ROLLUP in SQL?

CUBE, ROLLUP and GROUPING SET are optional operators of the GROUP BY clause of the SELECT statement for doing reports with large amounts of information. They allow you to do several GROUP BY operations in one statement, potentially saving a lot of time and computational effort. They can provide all the information needed for reporting, including totals, whilst giving good performance over large tables, and helping the Query Optimiser devise a good execution plan.

The extra ‘super-aggregate’ rows provide summary values, thereby allowing you to have several ‘aggregations’ such as SUM() or MAX() within the one result. The NULLs within these rows in the result are intended to mean ‘all’ rather than ‘unknown’. It allows you to get all the aggregations you need in one pass through the table. Because of the presence of extra rows in the results, extra functions GROUPING() and GROUPING_ID() are provided to indicate these extra ‘super-aggregate’ rows, and which columns are being aggregated.

This makes a great deal of sense if you have an application that needs to run several reports without extra computation or without going back to the database: You have everything you need in one result.

Take this standard example of a ROLLUP (I’m using AdventureWorks 2012 here)..

As well as the simple GROUP BY aggregate rows, with the total due for each month, that you’d get with a simple grouping, you also get subtotal or super-aggregate rows, and also a grand total row. (here is the beginning of the result)

2303-clip_image001.png

That NULL I’ve highlit means that the row is an aggregate for ‘all’ months of 2005 in France (part of Europe region)

As well as all this, you get the total due for each year, for each territory and territorial group, as well as the full total due. (from the end)   

2303-clip_image002.png

Those NULLs mean ‘All’, remember. The last row is the grand total, and above it is the total for the pacific region. Above that is Australia’s contribution to the pacific region. The fourth row from the bottom is Australia’s 2008 contribution. The number of groupings that is returned is one more than the number of expressions in the composite element list provided to the GROUP BY statement. 

To get the same effect without using a rollup, you’d need to do something like this (AdventureWorks2012)

Which is a lot more expensive in CPU and I/O. Note that the standard syntax of the GROUP BY clause in recent versions  is

This new syntax allows you some extra functionality. Remember too that the column order affects the output groupings of ROLLUP and can affect the number of rows in the result set.

 The CUBE does the same general thing but, instead of providing a hierarchy of totals in ordered super-aggregate rows, it provides all the ‘super-aggregate’ permutations (‘symmetric super-aggregate’ rows), the so-called cross-tabulation rows. If you wanted to know which territory gave the most orders in march, or which territory performed least well in 2006, then you’d need a CUBE. You are providing all the possible summations in the result.

GROUPING SET allows you to fine-tune your result to provide more specialised information above and beyond CUBE. It can provide summary information on combinations of dimensions. You could get exactly the same result as in our ROLLUP example by using GROUPING SETS, but with a lot more typing.

This is just to show how they relate. In reality, you’d resort to GROUPING SETS to get results that are impossible with ROLLUP or CUBE.

Almost all these summaries can be gained from using just GROUP BY, but only through repeatedly GROUPing the result of a GROUP BY, or by making more than one pass through the data.

When you are using CUBE, ROLLUP or GROUPING SETS, you can’t use the DISTINCT keyword in your aggregate expressions, such as  AVG (DISTINCT column_name), COUNT (DISTINCT column_name), and SUM (DISTINCT column_name)

2. Why would ROLLUP or CUBE be useful to me?

ROLLUP and CUBE had their heyday before SSAS. They were useful for providing the same sort of facilities offered by the cube in OLAP. It still has its uses though. In AdventureWorks, it is overkill, but if you are handling large volumes of data you need to pass over your data only once, and do as much as possible on data that has been aggregated. Events that happened in the past can’t be changed, so it is seldom necessary to retain historic data on an active OLTP system. Instead, you only need to retain the aggregated data at the level of detail (‘granularity’) required for all foreseeable reports.

Imagine you are responsible for reporting on a telephone switch that has two million or so calls a day. If you retain all these calls on your OLTP server, you are soon going to find the SQL Server labouring over usage reports. You have to retain the original call information for a statutory time period, but you determine from the business that they are, at most, only interested in the number of calls in a minute. Then you have reduced your storage requirement on the OLTP server to 1.4% of what it was, and the call records can be archived off to another SQL Server for ad-hoc queries and customer statements. That’s likely to be a saving worth making. The CUBE and ROLLUP clauses allow you to even store the row totals, column totals and grand totals without having to do a table, or clustered index, scan of the summary table.

As long as changes aren’t made retrospectively to this data, and all time periods are complete, you never have to repeat or alter the aggregations based on past time-periods, though grand totals will need to be over-written!.

Let’s pretend, but using AdventureWorks2012  so you can play along.

Firstly, we’ll create a temporary summary table.

Notice that we are adding extra ‘bit’ columns that tell us which rows contain the summary rows. If you mistakenly add them to any further aggregations you’ll get some seriously inflated results. You can’t use Grouping() or Grouping_ID on the saved result, obviously, so you ought to provide something in its stead.

Now we can produce a pivot table very fast

2303-clip_image004.jpg

So there are brief smiles from the managers on seeing this, but then they brightly say ‘I’m sure I also asked for a breakdown by territory per month

With a brief chuckle, you do this.

2303-clip_image005.png

But if you’d used CUBE instead of Rollup,  that last ‘total’ row would already be calculated. In a real example that would cost time doing the report. You can do a CUBE on up to ten dimensions; although they tend to  bulk up the aggregation, they aren’t too costly.

3. Are these standard SQL or are they a Microsoft-only thing?

These are now standard ANSI SQL from 1999, though WITH CUBE and WITH ROLLUP were first introduced by Microsoft. This inclusion is somewhat surprising in that they introduce a second meaning, ‘all’, for the NULL value besides ‘unknown’. When Microsoft first introduced CUBE and ROLLUP, the syntax was slightly different, but both forms are allowed in SQL Server. Only one syntax style can be used in a single SELECT statement, and you should use the ISO compliant syntax for all new work.

4. Can I exclude one or more columns from the ROLLUP?

If you want to! Imagine that I didn’t want a super-aggregate total for all the regions (t.[group])

Here we are using the ANSI SQL 2006 compliant syntax. You can do the same thing with a cube. I’ve never found a practical use for this but you might come across it

5. What are GROUPING SETs then? Should I know about them?

GROUPING SET means you are asking SQL to group the result several times. You can use the GROUPING SETS syntax to specify precisely which aggregations to compute.  Here’s an example.

 Here, you are asking for the breakdown by territory group for every month of every year with month and year totals, followed by a summary total by territory name, but without a grand total. Unlike the ROLLUP, you get the same result whatever the order of the columns within each GROUPING SET and the order of the GROUPING SETS.

 GROUPING SETs can give you precisely what CUBE and ROLLUP gives you and a lot more besides. As you can see with this last example, you can use standard ‘table d’hôte’ CUBE and ROLLUP mixed together with directly-expressed ‘à la carte ‘GROUPING SETs.

6. Why would we want to combine columns in any aggregation?

Where two columns should be combined in some reports, it is useful to declare an aggregation that combines two columns. In the first example we combine year and month for the rollup, having the effect of restricting the totals to just each territory,  

That extra bracket in the ROLLUP clause has had the effect of restricting the aggregations to just the territory and the month/year. Leave them out, and you get totals for each year.

This can be very useful for certain data. We have avoided needing to combine columns here. If you were to do a CUBE, and the terms for territories used the words like ‘Northern’ or ‘Southern’ to describe a territory in more than one region, you would have some bizarre aggregations that apply to ‘northern’ territories that are unrelated. By combining columns, you would avoid this.

7. Is there more to GROUPING SETS than a way of doing ‘à la carte’ CUBEs?

I’m not sure that I’d be shy about asking this question. SQL:1999’s GROUPING SETS provide a rich recursive syntax that allows you to aggregate combinations of columns and define all sorts of esoteric reports providing up to ten dimensions. The aggregations can be nested and you can nest CUBEs within ROLLUPs and  nest ROLLUPs within CUBEs. You will need to read a specialist publication to find out more about this.

8. Why are the functions Grouping() and Grouping_ID() provided?

It isn’t really a good idea to use NULL to signify that a column is an aggregation. The problem is that, if a grouping column contains null values, all null values are considered equal, and put into a single NULL group which masquerades as a summary.  To get around the obvious difficulty of NULL values in the original data, two functions are provided: Grouping() and Grouping_ID().

The Grouping() function is passed the name of a column that participated in the ROLLUP, CUBE or GROUPING  SET. It returns zero if this row is a summary for this column with a NULL value meaning ‘all’ or whether it contains a value.

The GROUPING_ID function is passed a list that must exactly match the expression in the GROUP BY list. GROUPING_ID is created as a bitmap of the respective summary columns. If, for example the territory column has a NULL meaning ‘all’ territories rather than a name of a territory, and it is listed as the second column, then the second bit from the left is set. This integer is then returned.

Grouping_ID() is generally used to indicate whether the row is a primary or secondary aggregation (0 or >0) and, if secondary, then excluded from any further  GROUP BY manipulation.

It is usually considered good practice to include a bit column for every dimension (such as ‘Territory’ or ‘Region’ in our example) that is set if the row is a summary for that dimension, together with a Grouping_ID() value to assist any further grouping  of the result.

To illustrate the way that Grouping_ID actually works, here we get to look at the way the bits in the Grouping_ID are set according to the type of summary. We’ll use Phil Factor’s function ToBinaryString to show the bits.

This gives (just a sample of course)…

2303-clip_image006.png

 The interesting thing here is that each ‘1’ in the grouping bitmap represents the corresponding group-by element. The top row has pacific region (0) all territories (1), all years, (1), 12th month (0). That last row has All regions (1), Australia (0), 2008(0) and all months (1). With some simple bit-manipulation you can avoid dealing with nulls!