NULL-Friendly: Using Sparse Columns and Column Sets in SQL Server

Sparse columns and column sets in SQL Server 2012 provide a way of accomodating the less-structured data that has always proved to be tricky for the relational model. They can be used very effectively where the attributes are sparse for any given entity and very numerous across all entities. Seth Delconte shows how to use them.

Sparse columns and column sets can be used in conjunction, and are ideal for fields that contain mostly NULL values. Sparse columns handle NULL values extremely efficiently; column sets combine all sparse columns into an XML representation as a new column.

Sparse columns are designed to allow a relational database to store and process relatively unstructured data, where any individual entity may have a modest selection from a very wide set of attributes. This sort of data can be found frequently in scientific, pharmaceutical, and medical applications. EAV (Entity/Attribute/Value) storage techniques were engineered for such data; however, EAV data storage has proved to be somewhat inefficient in such areas as query speed and storage cost.

Sparse columns work well where a high proportion of the data in a column is NULL, because negligible storage space is used for the NULL representation. A table that is wide in theory can be made narrow in practice by implementing sparse columns, and the addition of filtered indexes provides effective searches on the sparse data. Sparse columns support the use of check constraints, as long as the constraints allow NULL values (a sparse column requirement). Constraints introduce basic data type checking, and thus allow for a good range of validation and integrity-checking techniques.

Column sets provide a solution to the difficulties of efficient imports and exports of sparse data. The untyped XML representation that is generated for each row interacts directly with each sparse field. The XML field accepts inserts, and its data can be extracted and modified easily. This feature effectively supports certain types of serialized object data from various applications.

Sparse Columns

The SPARSE column property is a special, NULL-friendly column option – introduced with SQL Server 2008.  Sparse column data storage is extremely efficient for NULL values. In fact, a NULL value requires no space at all in a sparse column – MSDN states that “…when the column value is NULL for any row in the table, the values require no storage.” However, the storage requirement for a non-NULL value is increased by up to 4 bytes when the SPARSE column property is used. Given that trade-off, Microsoft recommends not using sparse columns unless the percentage of NULL values in a column is high enough that a 20 percent to 40 percent storage savings gain would result. The ratio of NULLs to real values that would warrant implementing a sparse column differs for each data type. For example, when using the datetime type, 52% of the values must be NULL in order to save 40% in storage, but for the bit data type, 98% must be NULL. Fortunately, Microsoft has compiled a chart to easily find the percentage information for each data type.

Sparse columns are ordinary columns, with the addition of the SPARSE property. To create a table with sparse columns, simply use the SPARSE keyword:

In order to make a column sparse after-the-fact, use an ALTER TABLE statement:

This can also be accomplished by changing the ‘Is Sparse‘ column property to ‘Yes‘ in table design view in the SQL Server Management Studio GUI:

1519-image001.png

A sparse column must be NULLABLE, so the NULL keyword is optional. A sparse column is populated in the same manner as a normal column:

1519-image002.png

Comparing NULL Storage

To demonstrate the space savings gained when using sparse columns, we will create two new tables that contain many NULL values, and then examine the amount of space used for storage in each table. First, we’ll create an Employees_sparse table with three sparse columns:

Next, an identical table without the SPARSE property enabled:

Before we populate these tables, let’s add a CHECK constraint to the EMAIL field. Sparse columns DO support CHECK constraints. We want to ensure that each email address in the table will contain the ‘@’ symbol, and that the ‘@’ symbol appears only once in each email address:

To save time, we’ll use RedGate’s SQL Data Generator to insert 50000 Employee records into the Employees_sparse table.

Now that we have test data in the table, let’s take a look at portion of it to verify that the distribution of data in the sparse columns is realistic:

1519-image003.png

We see a good sampling of the data, where there are very few values in the columns that have the SPARSE property enabled. The Data Generator recognized that TITLE, MIDDLEINIT, and EMAIL are sparse columns, and should contain very few values.

Because we want to compare data storage size between the two tables, we need to insert identical data into the non-sparse table (Employees). We’ll copy over all 50000 records from the sparse table:

To view the sparse table’s storage savings, we’ll run the sp_spaceused system stored procedure on both tables:

1519-image004.png

We see that the storage difference between the two tables is quite notable, including index sizes, when the data is very sparse. However, the space savings that were gained can be lost quickly in the event that many NULL values are replaced with real values.

Sparse Column Limitations

A few things to keep in mind when considering using sparse columns:

  • Sparse columns cannot have default values, and must accept NULL values
  • A computed column cannot be SPARSE
  • Sparse columns do not support data compression
  • A sparse column cannot be a primary key

Locating Sparse Columns

A list of all columns that use the SPARSE property can be obtained via the is_sparse field in sys.columns:

1519-image005.png

Column Sets

An interesting enhancement to a table that uses sparse columns is a column set. A sparse column set gathers all sparse columns into a new column that is similar to a derived or computed column, but with additional functionality – its data can be updated and selected from directly. A column set is calculated based on the sparse columns in a table, and it generates an untyped XML representation of all sparse columns and values (NULL or otherwise). Keep in mind that a column set is optional. Sparse columns may be used without the implementation of a sparse column set.

A column set may be included in a table definition at the time of creation, or added later, if no sparse columns exist yet. A column set may not be added to a table that contains existing sparse columns:

To include the sparse column set in the Royalty table, we must first drop and then recreate the table:

We are now able to populate the table in order to demonstrate the behavior of the column set:

1519-replace1.png

We can see that the results of a SELECT * FROM Royalty are completely different after the introduction of the column set. Every one of the sparse columns is now gathered into an XML string. Only the non-sparse Chronology field is not included in the column set. However, each and every sparse column is still accessible individually when invoked by name:

1519-replace2.png

Almost every column in the Royalty table makes use of the SPARSE property. Keep in mind that for a column to make good use of the SPARSE property, it is expected that most of the values will be NULL.

Column Set Advantages

What are the benefits of using a sparse column set? For one, the column set itself is directly updatable. Because of this, operations performed directly on the column set may be more efficient than when they are performed on individual columns. After an update or an insert, the values that were added to the column set are immediately accessible via the singular sparse column(s):

1519-image008.jpg

To demonstrate the efficiency of a column set operation, consider the following situation: the Royalty table is empty, and we have an XML string that contains all of the records that we want in the table. We can insert the data directly to the DetailSet (xml column set) column all at once, with the help of a bit of XQuery. Let’s remove the existing records in the Royalty table to prepare for the XML import:

We have an XML string that contains all the data that we want to put in the empty table:

Because we have a built-in identifier for each Person node (the Chronology attribute), we can use a SQL variable in a small XQuery script that will serve as a counter for a loop, and will also delineate each Person record for the insert operation:

1519-image009.png

This can streamline otherwise time-consuming data imports. Even if the XML schema changes frequently, or is structured poorly, directly importing it to a customized column set can be relatively easy – only one representative record of the XML data would have to be examined. A table structure that included appropriate sparse columns could then be constructed, based on the XML schema.

Additionally, the maximum sparse column count per table far outdistances the maximum column count limit for traditional columns – but only when used in conjunction with a column set. The column count limit without the use of a column set is the same for sparse columns as it is for non-sparse columns – 1024. If a script were generated to create an all-sparse column table of 2000 columns, but without a column set defined, the following error would result:

Although only one column set per table is permitted, a sparse column set can contain up to 30,000 sparse columns – a wide table solution for a table that has reached the normal column limit of 1024 columns, and still needs to grow in width. Even though no more than 1024 columns may be returned in a result set at once, the entire XML column set may be returned together. In addition, all columns, sparse or non-sparse, may be reverted to an XML representation (without a root element) by using the FOR XML AUTO clause:

The first two elements of the XML results of the above query are as follows:

1519-image010.png

Notice that the column set is represented as a node containing child nodes for all sparse columns, while any non-sparse columns (the Chronology column) are depicted as node attributes.

Column Set Disadvantages

There are a number of disadvantages to using a column set. As mentioned earlier, only one column set may be used per table, and a column set cannot be added to a table that already contains sparse columns. Replication, distributed queries, and change data capture (CDC) do not support using column sets. Also, a column set cannot be indexed in any way.

What’s next? Introducing Filtered Indexes

Another option to consider for sparse columns is a filtered index. A filtered index, when designed properly, can be much smaller and faster than a normal index, because it only stores data that meet certain criteria. This is accomplished by using a WHERE clause in the index definition. The WHERE clause feature makes filtered indexes ideal for sparse columns, as an index can be designed to store only the non-NULL values from a column. A filtered index must be created as a nonclustered index:

The above index will be small in size, and should return faster, more efficient results for queries similar to the following:

Filtered indexes will not store the NULL values on the heap when the WHERE criteria is configured correctly, unlike unfiltered nonclustered indexes.

Summary

Sparse columns store NULL values very efficiently for tables that contain a high percentage of NULLs. Sparse columns accommodate drastically improved query speeds when used in conjunction with filtered indexes. When used with column sets, sparse columns extend the traditional column count limit for tables, providing a much better solution for storing unstructured data than older, inefficient EAV methods. Column sets also represent sparse data in an automatically generated XML format that allows easy data imports, exports, and manipulation.

Sparse columns do not support primary keys, data compression, NOT NULL constraints, or default values. Non-sparse columns may not be altered to have the SPARSE property, but instead must be created with it. Sparse columns implicitly allow NULLs upon table creation. Column sets are not compatible with data replication, distributed queries, or indexes. Sparse columns are compatible with CHECK constraints and nonclustered indexes.

We’ve looked at some detailed examples that use sparse columns and column sets. Sparse column solutions offer many benefits when applied correctly, but may be unnecessary if the ratio of NULL values to non-NULL values is not high enough, or if the table is not wide enough to warrant the use of a column set.