This post is part of a series on this blog that will help me, and hopefully you, pass exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012. So far, we’ve covered:
- Tables and Schemas in the data warehouse
- Dimensions and Slowly Changing Dimensions
- Fact tables and measures
SQL Server 2012 introduced a new way of storing nonclustered indexes, called the columnstore index, which has been shown to improve performance in a data warehouse environment many times over. How, you might ask? Well, the magic of the columnstore index lies in how the data is stored.
If you look at a traditional nonclustered index, you’ll see that the data is stored in a row-wise format. For example, let’s assume we have a Customers table with columns for CustomerNum, FirstName, LastName, and BirthDate. In a row-wise storage format, data pages contain rows of data, as illustrated below.
Let’s say I want to run the following query:
SELECT LastName FROM dbo.Customers
Even though I only want the LastNames, SQL Server still has to read all pages in their entirely into memory to satisfy my query (assuming I don’t have an index on last name). That makes for a lot of wasted space in memory holding information that I don’t really need (i.e. first names, birth dates, etc.). It also means a lot of extra IO operations if the data isn’t already in memory.
With columnstore indexes, data is stored in a columnar format, meaning pages contain the data from a single column, as shown below.
By storing data for a single column on a page, SQL Server can now read in only the data really needed by a particular query. If my query only needs LastName, SQL Server only needs to retrieve the pages for LastName. This leads to fewer IO operations and less wasted memory space. SQL Server also uses a special compression algorithm on columnstore indexes, leading to even greater IO savings.
Batch Mode Processing
But wait, there’s more! In addition to highly compressed columnstore indexes, SQL Server 2012 also introduced batch mode processing. In the data warehouse environment, we’re typically scanning and processing large sets of data. Working with individual rows would be very inefficient, so SQL Server can now choose to process data in batches, rather than one row at a time.
Not all operators will run in batch mode. In SQL Server 2012, batch mode processing is available for filter, sequence project, scan, hash match (partial) aggregation, hash match (inner join), and hash table build operations. You can use the Actual and Estimated Execution Mode properties of the Actual Execution Plan to determine what mode SQL Server is using in a particular plan. Keep in mind that batch processing is really only used when joining, filtering or aggregating large data sets.
The cost of columnstore indexes
Unfortunately, there’s no such thing as a free lunch. SQL Server still needs to ultimately return rows of data. Therefore, it needs to take the columnstore data and reconstruct the rows that satisfy a query. This reconstruction takes time and CPU resources. This also makes nonclustered columnstore indexes read-only and their underlying tables also read-only, so you won’t be using this in your OLTP databases. (SQL Server 2014 introduced updateable clustered columnstore indexes, but we won’t go into those here.) If your table is partitioned, the columnstore index must be partition aligned.
There can also only be one columnstore index on a table. A common recommendation for the data warehouse is to create a columnstore index on fact tables, putting all of the columns of the fact table in the columnstore index. Larger dimension tables can also benefit from columnstore indexes. Just keep in mind that you don’t want to use this feature on tables that are updated frequently.
That’s a brief introduction on columnstore indexes and their benefits, as well as their inherent costs. Next time we’ll look at how to implement columnstore indexes in our data warehouse.