Stairway to Columnstore Indexes

Stairway to Columnstore Indexes Level 4: Catalog Views

,

If you need to find information about columnstore indexes in your database, you can use the tools and options in SQL Server Management Studio. But it is generally easier (once you’ve climbed the learning curve) to get the information you need by directly querying the catalog views – and as a bonus, this metadata exposes additional information that cannot be accessed from the Management Studio GUI.

In this level, we will first take a look at how to recognize columnstore indexes in some of the generic catalog views. After that we will investigate the new catalog views that have been added just for columnstore indexes.

Preparation

In the previous levels, I have used only a nonclustered columnstore index in the sample code, so that readers who have not yet upgraded to SQL Server 2014 can still follow along. But in this level I want to show how both types of columnstore index are shown in the catalog views, so I will now create a clustered columnstore index as well. Readers using SQL Server 2012 can still follow along, but they will have to skip the first listing, and they will obviously not get the output related to the clustered columnstore index.

The code in this level is all based on the ContosoRetailDW database as it was after the first level, so after restoring the original database and creating the nonclustered columnstore index. The code in listing 4-1 creates a copy of the FactOnlineSales table, called FactOnlineSales2, and creates a clustered columnstore index on it. Note that creating the index this way will probably not result in the best possible performance; we will look at better ways in a later level.

USE ContosoRetailDW;
GO
SELECT *
INTO   dbo.FactOnlineSales2
FROM   dbo.FactOnlineSales;
CREATE CLUSTERED COLUMNSTORE INDEX CCI_FactOnlineSales2
ON dbo.FactOnlineSales2;

Listing 4-1: Creating a table with a clustered columnstore index

After running this code, you will now have two tables: FactOnlineSales, with a nonclustered columnstore index NCI_FactOnlineSales; and FactOnlineSales2, with a clustered columnstore index CCI_FactOnlineSales2. Readers who are using SQL Server 2012 will have only the first of these two tables, and hence only the nonclustered columnstore index.

Generic catalog views

You can find information about indexes in two generic catalog views: sys.indexes and sys.index_columns. The code in listing 4-2 queries these two views for just the two existing columnstore indexes, and figure 4-1 shows partial results on my system. The full output of the second query is 42 rows: 2 tables, 21 columns in each, and all of them are included in the indexes.

USE ContosoRetailDW;
GO
SELECT     i.*
FROM       sys.indexes       AS i
WHERE      i.name            IN (N'NCI_FactOnlineSales',
                                 N'CCI_FactOnlineSales2');
--WHERE      i.object_id       IN (OBJECT_ID(N'FactOnlineSales'),
--                                 OBJECT_ID(N'FactOnlineSales'))
--AND        i.type            IN (5, 6);
SELECT     ic.*
FROM       sys.index_columns AS ic
INNER JOIN sys.indexes       AS i
      ON   i.object_id        = ic.object_id
      AND  i.index_id         = ic.index_id
WHERE      i.name            IN (N'NCI_FactOnlineSales',
                                 N'CCI_FactOnlineSales2');

Listing 4-2: Querying index and index columns information

Figure 4-1: Index and index columns information

Both these views existed before columnstore indexes were added in SQL Server 2012, and no new columns have been added to them. The range of possible values for the columns type and type_desc in sys.indexes has been expanded, however: type 6 with description NONCLUSTERED COLUMNSTORE was introduced in SQL Server 2012; in SQL Server 2014, type 5 with description CLUSTERED COLUMNSTORE was added. Many of the columns in this table are not used for columnstore indexes and are therefore always 0 when the index is a columnstore.

The data returned from sys.index_columns is pretty useless by itself; you will usually want to join it to other catalog views such as sys.objects, sys.indexes, and sys.columns to translate the object_id, index_id, and column_id values in the names of the table, index, and column. But even then, this data is still rather uninteresting for columnstore indexes. There will always be one row for each column in the index, and each column will be marked as an included column. This is strictly speaking not entirely true, as the concept of “included” columns is very much tied to the design of rowstore indexes. But within the design of this catalog view, this was probably the best Microsoft could have done without changing the view and potentially breaking existing applications that rely on this view.

Filter on index name or table name?

The queries in listing 4-2, as well as the queries in the rest of this article, all filter based on the name of the columnstore index. Unless your company has a very strict naming policy, chances are that you do know the table name, but do not know the index name. In that case, you can use the alternative form of the filter shown as a comment in the first query in listing 4-2. Keep in mind that this second form of the query will return information on all indexes, not just the columnstore indexes, if you remove the additional filter on the type column.If you use the same alternative for the sys.index_columns table, you can use the object_id column from that table instead of the object_id from sys.indexes – and if you also decide not to filter on type, you can then even remove the join to sys.indexes.

Information on segments

When Microsoft implemented columnstore indexes, they did not just add new values for some columns in existing catalog views; they implemented several completely new catalog views as well, designed to give insight in the specific columnstore features. Two of these were released in SQL Server 2012; one more was added in SQL Server 2014. One of the two views available since the first release of columnstore indexes is sys.column_store_segments, which (as the name implies) gives detailed information about the individual segments within the columnstore index.

As with most catalog views, you need to join this view to others to get useful information in a readable format. This is actually quite tricky for this view. As you can see in listing 4-3, a lot of additional catalog views need to be joined in to get useful results. This listing also includes a filter on the specific indexes that we are interested in. In this case that WHERE clause doesn’t have much effect because we have no other columnstore indexes in the demo database. But I wanted to show how you can easily focus on specific tables even in a database that has more tables with columnstore indexes.

USE ContosoRetailDW;
GO
SELECT     OBJECT_NAME(i.object_id)          AS TableName,
           i.name                            AS IndexName,
           i.type_desc                       AS IndexType,
           COALESCE(c.name, '* Internal *')  AS ColumnName,
           p.partition_number,
           s.segment_id,
           s.row_count,
           s.on_disk_size,
           s.min_data_id,
           s.max_data_id
FROM       sys.column_store_segments         AS s
INNER JOIN sys.partitions                    AS p 
      ON   p.hobt_id                          = s.hobt_id
INNER JOIN sys.indexes                       AS i 
      ON   i.object_id                        = p.object_id
      AND  i.index_id                         = p.index_id
LEFT  JOIN sys.index_columns                 AS ic
      ON   ic.object_id                       = i.object_id
      AND  ic.index_id                        = i.index_id
      AND  ic.index_column_id                 = s.column_id
LEFT  JOIN sys.columns                       AS c
      ON   c.object_id                        = ic.object_id
      AND  c.column_id                        = ic.column_id
WHERE      i.name                            IN (N'NCI_FactOnlineSales',
                                                 N'CCI_FactOnlineSales2')
ORDER BY   TableName, IndexName,
           s.column_id, p.partition_number, s.segment_id;

Listing 4-3: Querying segment information

Figure 4-2: Segment information

Remember that when the index was built, the data was first divided into rowgroups of about a million rows each. On my SQL Server 2014 system, the nonclustered columnstore index uses 15 rowgroups and the clustered columnstore index is 16 rowgroups (results may vary on your system!) These rowgroups were then further divided into segments for each of the 21 columns. Which is why on my system, the full results of listing 4-3 is 651 rows: 15 * 21 for the nonclustered columnstore index, plus 16*21 for the clustered columnstore.

To identify a single segment, you need to know the rowgroup and the column name. Rowgroups are identified by a zero-based counter, which is stored in this view in the badly named column segment_id. The row_count column should not actually have been in this view, because it stores the number of rows in the rowgroup. So it will be the same for each segment of the rowgroup. However, for monitoring and troubleshooting it can be very convenient to have this number available directly in this view.

The on_disk_size column, when used in conjunction with row_count, can be used to quickly see how well (or how bad) the data in a segment was compressed. This obviously also depends on the data type, so if you don’t know the design of the table you might want to add a few more columns from the sys.columns catalog view to the query.

The values in min_data_id and max_data_id are, if the data type of the column supports it, used for segment elimination (as described in level 2 of this series). If you want to get the best possible performance, you should definitely monitor these values whenever one of your columnstore indexes has been (re)built. The data returned from the query can be hard to interpret, because non-integer columns are represented in an internal format. But you can look at the values for different segments of the same column. For instance, the data in figure 4-2 shows that minimum and maximum data id for the ProductKey column are the same for rowgroups 10, 11, and 12 (and, not visible in the screenshot, for rowgroups 0 – 9 as well). In rowgroups 13, 14, and 15, the minimum is higher and the maximum is lower. Regardless of what actual values are represented, it is clear that the interval in these last rowgroups is much smaller. That means that these segments have some potential for segment elimination based on the ProductKey column. But these are very small rowgroups so there will be no gain; all of the full rowgroups include the full range of ProductKey values and will not be able to use segment elimination. In a later level I will show how you can remedy this situation.

There are more columns in the sys.column_store_segments view, but they are not really useful for monitoring or troubleshooting. They do expose some of the internal details of how SQL Server encodes the data in the columnstore index, so you can add them to the query if you want to look at these values. A deep coverage of these columns is beyond the scope of this article, so I will only briefly describe them. The has_nulls columns is 1 if at least one actual NULL value was encountered in the segment when building the index. If that is the case, then null_value represents the magic value that SQL Server has chosen to use to represent NULL values in this segment.

The encoding_type is, as the name implies, the type of encoding used in the segment. If you add this column to the query, you will see that different encoding types can be used for different segments of the same column. The possible values in this column and their meaning are as follows:

ValueType of encoding
1Value based
2Dictionary encoding of non-strings
3Dictionary encoding of strings
4No encoding

The base_id and magnitude columns are only relevant when value based encoding is used. They are used to scale down numbers, so that fewer bits are needed to store the individual values. For instance, when all values are multiples of 100, magnitude is set to 100 and all values are divided by 100 before being stored. And when all values are between 12345 and 13579, base_id is set to 12345 (the actual value might be slightly less or more) and that number is subtracted from the value, so that values between 0 and 1234 are stored.

The primary_dictionary_id and secondary_dictionary_id columns apply when dictionary encoding is used. They are references to the global and local dictionaries (originally called primary and secondary dictionaries; the column names have not been changed for backwards compatibility reasons).

Finally, the version column is always 1 on both SQL Server 2012 and SQL Server 2014. The description in Books Online suggests that this is included for future use, to enable Microsoft to add alternative storage formats for columnstore segments in a future release.

Some background to the query

When looking at the query in listing 4-3, you might wonder why the sys.index_columns and sys.columns views are joined with an outer join. The reason for this is that a nonclustered columnstore index will sometimes contain additional columns that were not included in the CREATE INDEX statement. If you have not explicitly included all the columns of the table’s clustered (row-store) index, SQL Server will still add all of them to the nonclustered columnstore index. Additionally, if the table’s clustered index is a non-unique index, SQL Server will have created a hidden uniqueifier that you cannot explicitly specify, but will still be added to the columnstore index. And if the table has no clustered index at all, SQL Server will add a hidden column for the RID (or Row ID, which is the location of the row in the heap). Data for each of these columns is included in sys.column_store_segments, but not in sys.index_columns. The outer join to sys.index_columns ensures that these “hidden columns” in sys.column_store_segments are still exposed; the join to sys.indexes must then also be an outer join because otherwise this data would still be lost.

Information on dictionaries

An encoding_type of 2 or 3 in sys.column_store_segments means that dictionary encoding is used. SQL Server uses a global dictionary or a local dictionary (or both) to store a list of values that can then be replaced by a reference to the dictionary entry in the actual data. You can query the sys.column_store_dictionaries view, using the primary_dictionary_id and secondary_dictionary_id entries found in sys.column_store_segments, to retrieve additional information about these dictionaries. One possible way to do so is by using the query shown in listing 4-4.

USE ContosoRetailDW;
GO
SELECT     OBJECT_NAME(i.object_id)          AS TableName,
           i.name                            AS IndexName,
           i.type_desc                       AS IndexType,
           COALESCE(c.name, '* Internal *')  AS ColumnName,
           p.partition_number,
           s.segment_id,
           s.encoding_type,
           dG.type                           AS GlDictType,
           dG.entry_count                    AS GlDictEntryCount,
           dG.on_disk_size                   AS GlDictOnDiskSize,
           dL.type                           AS LcDictType,
           dL.entry_count                    AS LcDictEntryCount,
           dL.on_disk_size                   AS LcDictOnDiskSize
FROM       sys.column_store_segments         AS s
INNER JOIN sys.partitions                    AS p 
      ON   p.hobt_id                          = s.hobt_id
INNER JOIN sys.indexes                       AS i 
      ON   i.object_id                        = p.object_id
      AND  i.index_id                         = p.index_id
LEFT  JOIN sys.index_columns                 AS ic
      ON   ic.object_id                       = i.object_id
      AND  ic.index_id                        = i.index_id
      AND  ic.index_column_id                 = s.column_id
LEFT  JOIN sys.columns                       AS c
      ON   c.object_id                        = ic.object_id
      AND  c.column_id                        = ic.column_id
LEFT  JOIN sys.column_store_dictionaries     AS dG         -- Global dictionary
      ON   dG.hobt_id                         = s.hobt_id
      AND  dG.column_id                       = s.column_id
      AND  dG.dictionary_id                   = s.primary_dictionary_id
LEFT  JOIN sys.column_store_dictionaries     AS dL         -- Local dictionary
      ON   dL.hobt_id                         = s.hobt_id
      AND  dL.column_id                       = s.column_id
      AND  dL.dictionary_id                   = s.secondary_dictionary_id
WHERE      i.name                            IN (N'NCI_FactOnlineSales',
                                                 N'CCI_FactOnlineSales2')
AND        s.encoding_type                   IN (2, 3)
ORDER BY   TableName, IndexName,
           s.column_id, p.partition_number, s.segment_id;

Listing 4-4: Querying dictionary information

Figure 4-3: Dictionary information

The query in listing 4-4 adds two joins to the logic of listing 4-3, to add information about both the global dictionary and the local dictionary. Both joins are outer joins, because segments that use dictionary encoding may use only the global dictionary, only a local dictionary, or both. The number of rows returned by this query will never exceed the number of rows returned by listing 4-3, because the additional filter on the encoding_type column restricts this query to only segments that use a dictionary. On my system, 587 rows were returned.

Three of the columns in the sys.column_store_dictionaries view are exposed in this query. Two of them, entry_count and on_disc_size, have self-descriptive names. You can divide them to find out the length of the average entry in that particular dictionary. The last one, type, tells you what type of values are stored in the dictionary; this column shows 1 for a dictionary of integer values, 3 for a dictionary of string values, or 4 for a dictionary of floating point values.

There are a few other columns in sys.column_store_dictionaries as well, that I did not include in the query. The version column is documented as the “version of the dictionary format”; the valid values and their meaning are not documented but I have never observed a value other than 1 so I guess that this column is included for future plans. The last_id column contains the last data id in the dictionary – but since there is no way for us to access the individual entries by id, this is meaningless for us. On SQL Server 2012 only, this view also includes a flags column, which is not documented anywhere. It has been removed in SQL Server 2014, so I guess that whatever plans the development team had for this column have been revised since.

Combining the queries

Even though I have shown sys.column_store_segments and sys.column_store_dictionaries view separately, you will probably find it more convenient to build and use a single query that combines the two. That is in fact very easy. If you compare the code in listings 4-3 and 4-4, you will see that I only added two extra joins to the FROM clause, changed the column list, and changed the WHERE clause to return only data for segments that use dictionary encoding. If you undo the change to the WHERE clause and add the columns from sys.column_store_segments that you want to see, you will have a single query that returns all relevant information at once.

Information on rowgroups

Whereas the two views above give a lot of interesting information about segments and dictionaries, in SQL Server 2012 there was no way to get information about rowgroups – other than by aggregating the information for all segments in a rowgroup. That changed in SQL Server 2014, when the sys.column_store_row_groups view was introduced.

The sys.column_store_row_groups view was not only introduced later than the other views, it also links to other system views in a new way. The previous views only linked to the index indirectly, through the sys.partitions view; this new view now has a direct link to the index. I personally think that this method is much simpler and cleaner and would have preferred this method right from the start. But introducing it now for a new view while leaving the existing views unchanged does introduce an annoying inconsistency.

The code in listing 4-5 shows one way to query the information on rowgroups. As in the previous queries, I have included an optional filter to restrict the results to specific indexes. The query returns 31 rows on my system, 15 for the nonclustered columnstore index and 16 for the clustered one. But again, the exact number of rows on your system can vary due to the many factors that influence the process of building the columnstore index.

-- SQL Server 2014 only!!
USE ContosoRetailDW;
GO
SELECT     OBJECT_NAME(rg.object_id)   AS TableName,
           i.name                      AS IndexName,
           i.type_desc                 AS IndexType,
           rg.partition_number,
           rg.row_group_id,
           rg.total_rows,
           rg.size_in_bytes
FROM       sys.column_store_row_groups AS rg
INNER JOIN sys.indexes                 AS i
      ON   i.object_id                  = rg.object_id
      AND  i.index_id                   = rg.index_id
WHERE      i.name IN (N'NCI_FactOnlineSales',
                      N'CCI_FactOnlineSales2')
ORDER BY   TableName, IndexName,
           rg.partition_number, rg.row_group_id;

Listing 4-5: Querying rowgroup information

Figure 4-4: Rowgroup information

In this query, I return just four of the columns in this catalog view. There are more, but they are all related to how SQL Server implements modifications to the clustered columnstore index in SQL Server 2014. I will get back to this view and describe those columns in a later level,.

The columns I did include are pretty self-explanatory. The partition_number is always 1 in the output shown in figure 4-2, because I have not partitioned the tables in my sample database. Most production databases that are large enough to warrant using columnstore indexes are also large enough to use partitioned tables; in those cases the partition_number column is a one-based counter for the partition.

The row_group_id is also a counter, but in this case for the rowgroup. Every row group created is assigned a number. For reasons only known inside Microsoft, this number is zero-based. It also always corresponds to the segment_id values used in the other two views.

The total_rows column shows the number of rows in the rowgroup. On my laptop, this number is equal to the maximum (1,048,576) for most rowgroups, except for the last few of each partition within the columnstore index. That means that SQL Server had sufficient resources available to process full rowgroups while the index was built. I have four smaller rowgroups because SQL Server used four threads to build the index, and it simply divided the last rows over all threads when the build process wrapped up. If the total_rows column shows lower numbers across the board after building or rebuilding a columnstore index, your server ran out of resources when the index was created. This is most likely caused by memory pressure though there are a few other possibilities that are beyond the scope of this article.

For updatable columnstore indexes, smaller rowgroups will appear over time as a result of how modifications are processed. Also note that total_rows in this view will always be the same as row_count in sys.column_store_segments.

Finally, the size_in_bytes column shows how much space the data in that rowgroup takes in the columnstore index, after it has been ordered, compressed and stored as described in the previous level. Comparing this data with the data size of the uncompressed data can give you a good indication of how well your data compresses. However, since most applications do not allow you to change the data stored and SQL Server has no way to change the compression process used, there is not much you can do if you see lower compression than you were hoping for.

Conclusion

SQL Server makes a lot of information about columnstore indexes available through the system views. But to get understandable information, you need to know how to join these views to each other, and how to interpret the data returned.

Not all information that the system views return is directly useful. Some of it is informational only, and some information can point at issues that are hard to change. But by looking at this information, you can also find signals that you can act upon. Seeing lots of small rowgroups is a red flag for memory pressure during the columnstore index build process, as described in the previous level. The minimum and maximum values of a column in each segment are used for segment elimination, and a future level will explain how you can change the build process of your columnstore indexes to ensure that the columns most frequently filtered on will contribute the most towards segment elimination.

This article is part of the parent stairway Stairway to Columnstore Indexes

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating