03 November 2015

65557 views

73 0

Questions About CUBE, ROLLUP and GROUPING SETs That You Were Too Shy to Ask

There are few parts of SQL Syntax as familiar as the GROUP BY clause of the SELECT statement. On the other hand, CUBE and ROLLUP remain mysterious despite their usefulness and GROUPING SET is positively arcane, especially if you are too shy to reveal your ignorance of the subject by asking!

Eh? What are GROUPING SET, CUBE and ROLLUP in SQL?
Why would ROLLUP or CUBE be useful to me?
Are these standard SQL or are they a Microsoft-only thing?
Can I exclude one or more columns from the ROLLUP?
What are GROUPING SETs then? Should I know about them?
Why would we want to combine columns in any aggregation?
Is there more to GROUPING SETS than a way of doing ‘à la carte’ CUBEs?
Why are the functions Grouping() and Grouping_ID() provided?

1. Eh? What are GROUPING SET, CUBE and ROLLUP in SQL?

CUBE, ROLLUP and GROUPING SET are optional operators of the GROUP BY clause of the SELECT statement for doing reports with large amounts of information. They allow you to do several GROUP BY operations in one statement, potentially saving a lot of time and computational effort. They can provide all the information needed for reporting, including totals, whilst giving good performance over large tables, and helping the Query Optimiser devise a good execution plan.

The extra ‘super-aggregate’ rows provide summary values, thereby allowing you to have several ‘aggregations’ such as SUM() or MAX() within the one result. The NULLs within these rows in the result are intended to mean ‘all’ rather than ‘unknown’. It allows you to get all the aggregations you need in one pass through the table. Because of the presence of extra rows in the results, extra functions GROUPING() and GROUPING_ID() are provided to indicate these extra ‘super-aggregate’ rows, and which columns are being aggregated.

This makes a great deal of sense if you have an application that needs to run several reports without extra computation or without going back to the database: You have everything you need in one result.

Take this standard example of a ROLLUP (I’m using AdventureWorks 2012 here)..

SELECT t.[Group] AS region, t.name AS territory, sum(TotalDue) AS revenue,

datepart(yyyy, OrderDate) AS [year], datepart(mm, OrderDate) AS [month]

FROM Sales.SalesOrderHeader s

INNER JOIN Sales.SalesTerritory T ON s.TerritoryID = T.TerritoryID

GROUP BY t.[Group], t.name, datepart(yyyy, OrderDate), datepart(mm, OrderDate)

WITH ROLLUP

As well as the simple GROUP BY aggregate rows, with the total due for each month, that you’d get with a simple grouping, you also get subtotal or super-aggregate rows, and also a grand total row. (here is the beginning of the result)

That NULL I’ve highlit means that the row is an aggregate for ‘all’ months of 2005 in France (part of Europe region)

As well as all this, you get the total due for each year, for each territory and territorial group, as well as the full total due. (from the end)

Those NULLs mean ‘All’, remember. The last row is the grand total, and above it is the total for the pacific region. Above that is Australia’s contribution to the pacific region. The fourth row from the bottom is Australia’s 2008 contribution. The number of groupings that is returned is one more than the number of expressions in the composite element list provided to the GROUP BY statement.

To get the same effect without using a rollup, you’d need to do something like this (AdventureWorks2012)

;

WITH myGrouping ( region, territory, totalDue, [year], [month] )

AS ( SELECT t.[Group], t.name, sum(TotalDue) AS revenue,

datepart(yyyy, OrderDate) AS [year], datepart(mm, OrderDate) AS [month]

FROM Sales.SalesOrderHeader s

INNER JOIN Sales.SalesTerritory T ON s.TerritoryID = T.TerritoryID

GROUP BY t.name, t.[Group], datepart(yyyy, OrderDate), datepart(mm, OrderDate))

SELECT Region, territory, totalDue, [year], [month]

FROM myGrouping

UNION ALL

SELECT Region, territory, sum(totalDue), [year], NULL

FROM myGrouping GROUP BY Region, territory, [year]

UNION ALL

SELECT Region, territory, sum(totalDue), NULL, NULL

FROM myGrouping GROUP BY Region, territory

UNION ALL

SELECT Region, NULL, sum(totalDue), NULL, NULL

FROM myGrouping GROUP BY Region

UNION ALL

SELECT NULL, NULL, sum(totalDue), NULL, NULL

FROM myGrouping

Which is a lot more expensive in CPU and I/O. Note that the standard syntax of the GROUP BY clause in recent versions is

1 2	... 'GROUPBY ROLLUP (t.[Group],t.name,datepart(yyyy,OrderDate),datepart(mm,OrderDate))'

This new syntax allows you some extra functionality. Remember too that the column order affects the output groupings of ROLLUP and can affect the number of rows in the result set.

The CUBE does the same general thing but, instead of providing a hierarchy of totals in ordered super-aggregate rows, it provides all the ‘super-aggregate’ permutations (‘symmetric super-aggregate’ rows), the so-called cross-tabulation rows. If you wanted to know which territory gave the most orders in march, or which territory performed least well in 2006, then you’d need a CUBE. You are providing all the possible summations in the result.

GROUPING SET allows you to fine-tune your result to provide more specialised information above and beyond CUBE. It can provide summary information on combinations of dimensions. You could get exactly the same result as in our ROLLUP example by using GROUPING SETS, but with a lot more typing.

SELECT t.[Group] AS region, t.name AS territory, sum(TotalDue) AS revenue,

datepart(yyyy, OrderDate) AS [year], datepart(mm, OrderDate) AS [month]

FROM Sales.SalesOrderHeader s

INNER JOIN Sales.SalesTerritory T ON s.TerritoryID = T.TerritoryID

GROUP BY GROUPING SETS(

(T.[Group], T.name,datepart(yyyy, OrderDate), datepart(mm, OrderDate)),

(T.[Group], T.name,datepart(yyyy, OrderDate) ),

(T.[Group], T.name),

(T.[Group]),

())

This is just to show how they relate. In reality, you’d resort to GROUPING SETS to get results that are impossible with ROLLUP or CUBE.

Almost all these summaries can be gained from using just GROUP BY, but only through repeatedly GROUPing the result of a GROUP BY, or by making more than one pass through the data.

When you are using CUBE, ROLLUP or GROUPING SETS, you can’t use the DISTINCT keyword in your aggregate expressions, such as AVG (DISTINCT column_name), COUNT (DISTINCT column_name), and SUM (DISTINCT column_name)

2. Why would ROLLUP or CUBE be useful to me?

ROLLUP and CUBE had their heyday before SSAS. They were useful for providing the same sort of facilities offered by the cube in OLAP. It still has its uses though. In AdventureWorks, it is overkill, but if you are handling large volumes of data you need to pass over your data only once, and do as much as possible on data that has been aggregated. Events that happened in the past can’t be changed, so it is seldom necessary to retain historic data on an active OLTP system. Instead, you only need to retain the aggregated data at the level of detail (‘granularity’) required for all foreseeable reports.

Imagine you are responsible for reporting on a telephone switch that has two million or so calls a day. If you retain all these calls on your OLTP server, you are soon going to find the SQL Server labouring over usage reports. You have to retain the original call information for a statutory time period, but you determine from the business that they are, at most, only interested in the number of calls in a minute. Then you have reduced your storage requirement on the OLTP server to 1.4% of what it was, and the call records can be archived off to another SQL Server for ad-hoc queries and customer statements. That’s likely to be a saving worth making. The CUBE and ROLLUP clauses allow you to even store the row totals, column totals and grand totals without having to do a table, or clustered index, scan of the summary table.

As long as changes aren’t made retrospectively to this data, and all time periods are complete, you never have to repeat or alter the aggregations based on past time-periods, though grand totals will need to be over-written!.

Let’s pretend, but using AdventureWorks2012 so you can play along.

Firstly, we’ll create a temporary summary table.

IF EXISTS (SELECT * FROM tempdb.sys.tables WHERE name LIKE '#AggregationTable%')

DROP TABLE #aggregationTable --delete the temporary table if it exists

SELECT

identity(INT,1,1) AS [surrogate], --so we can have a unique column

t.[Group] AS region, t.name AS territory, sum(TotalDue) AS revenue,

datepart(yyyy, OrderDate) AS [year], datepart(mm, OrderDate) AS [month],

grouping(t.name) AS isNameGroup, --Does this relate to ALL territories

grouping(t.[Group]) AS isGroupGroup,--Does this relate to ALL continents

grouping(datepart(yyyy, OrderDate)) AS isYearGroup,--Does this relate to ALL years

grouping(datepart(mm, OrderDate)) AS isMonthGroup,--Does this relate to ALL months

Grouping_ID (t.name,t.[Group],

datepart(yyyy, OrderDate),datepart(mm, OrderDate)) AS isGroupingRow

--is this an extra non-data row containing aggregate data

INTO #AggregationTable

FROM Sales.SalesOrderHeader s

INNER JOIN Sales.SalesTerritory T ON s.TerritoryID = T.TerritoryID

GROUP BY t.name, t.[Group], datepart(yyyy, OrderDate), datepart(mm, OrderDate)

WITH ROLLUP

Notice that we are adding extra ‘bit’ columns that tell us which rows contain the summary rows. If you mistakenly add them to any further aggregations you’ll get some seriously inflated results. You can’t use Grouping() or Grouping_ID on the saved result, obviously, so you ought to provide something in its stead.

Now we can produce a pivot table very fast

-- now we can create a simple pivot table with row and

-- column totals

SELECT Territory,

sum(CASE [year] WHEN 2005 THEN revenue ELSE 0 END) AS [2005],

sum(CASE [year] WHEN 2006 THEN revenue ELSE 0 END) AS [2006],

sum(CASE [year] WHEN 2007 THEN revenue ELSE 0 END) AS [2007],

sum(CASE [year] WHEN 2008 THEN revenue ELSE 0 END) AS [2008],

sum(revenue) AS [territory total]

FROM #AggregationTable

WHERE isGroupingrow =0

GROUP BY territory

UNION ALL

SELECT 'Total', sum(CASE [year] WHEN 2005 THEN revenue ELSE 0 END) AS [2005],

sum(CASE [year] WHEN 2006 THEN revenue ELSE 0 END) AS [2006],

sum(CASE [year] WHEN 2007 THEN revenue ELSE 0 END) AS [2007],

sum(CASE [year] WHEN 2008 THEN revenue ELSE 0 END) AS [2008],

sum(revenue) AS [territory total]

FROM #AggregationTable

WHERE isYearGroup =0 AND isMonthGroup=1

So there are brief smiles from the managers on seeing this, but then they brightly say ‘I’m sure I also asked for a breakdown by territory per month

With a brief chuckle, you do this.

SELECT

datename(MONTH,dateadd(MONTH, [MONTH],'01 dec 2000')) AS [month],

sum(CASE territory WHEN 'Australia' THEN revenue ELSE 0 END) AS [Australia],

sum(CASE territory WHEN 'Canada' THEN revenue ELSE 0 END) AS [Canada],

sum(CASE territory WHEN 'Central' THEN revenue ELSE 0 END) AS [Central],

sum(CASE territory WHEN 'France' THEN revenue ELSE 0 END) AS [France],

sum(CASE territory WHEN 'Germany' THEN revenue ELSE 0 END) AS [Germany],

sum(CASE territory WHEN 'Northeast' THEN revenue ELSE 0 END) AS [Northeast],

sum(CASE territory WHEN 'Northwest' THEN revenue ELSE 0 END) AS [Northwest],

sum(CASE territory WHEN 'Southeast' THEN revenue ELSE 0 END) AS [Southeast],

sum(CASE territory WHEN 'Southwest' THEN revenue ELSE 0 END) AS [Southwest],

sum(CASE territory WHEN 'United Kingdom' THEN revenue ELSE 0 END) AS [United Kingdom],

sum(revenue) AS [Month total]

FROM #AggregationTable

WHERE isGroupingrow =0

GROUP BY month

UNION ALL

SELECT

'Total',