The Real World: Implementing Data Compression in SQL Server 2008 R2

Introduction

Data Compression is a feature of SQL Server 2008 and higher that can compress tables and indexes, thus reducing the size of the database and saving storage, depending on the nature of the table/index or partition being compressed.

Data Compression not only saves space, but it will also improve performance for read-intensive workloads. This is because the engine has to scan much fewer data pages depeding of the compression gains.

A possible reaction to the above is to go ahead and compress ALL tables and impress your company with the savings made on storage but there are some caveats with Data Compression:

Only IN_ROW_DATA allocation units can be compressed
Data Compression does have performance impact on update operations because decompressing pages for update is CPU intensive
Performing the act itself requires careful planning for workspace (there can be significant growth of the log file)
The degree of compression depends on factors such as number of NULL and 0 fields in the table and the degree of repetition in the data

This is a good place to mention there are two levels of Data Compression:

ROW - optimizes data types and NULL/0 values within rows
PAGE - minimizes data redundacy at page level. A superset of ROW compression

Scenario

In our environment we considered a certain application which keeps historical data in separate tables. The assumption was that older tables would be frequently accessed and even less frequently accessed thus would be good candidates for Data Compression. For us, the space savings was primary, performance improvement for read operations was a bonus.

The first thing we did was to determine the space used by all tables in the affected databases and focus on the top n largest tables using the query in Listing 1. You will also notice we generated the required compression statement in the same effort:

  ###################################
  # Listing 1: Largest Tables       #
  ###################################
USE DB_NAME()
GO
create table tablesize 
( name varchar(100),
  rows int,
  reserved varchar(100),
  data varchar(100), 
  index_size varchar(100),
  unused varchar(100)
)
GO
insert into tablesize
 exec sp_MSforeachtable 'exec sp_spaceused ''?'''
GO
select 'ALTER TABLE [dbo].[' + name + ' ] REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = PAGE, SORT_IN_TEMPDB=ON, MAXDOP=8);', name [Table Name], CAST(REPLACE(data,'KB','') AS int) [Data Size (KB)], CAST(REPLACE(index_size,'KB','') AS int) [Index Size (KB)] from tablesize order by [Data Size (KB)] desc
GO
drop table tablesize
GO

The above listed can be modified by filtering for the specific tables you want to work on as in Listing 2:

#######################################
# Listing 2: Largest Tables Selected  #
#######################################
USE DB_NAME()
GO
create table tablesize (name varchar(100),rows int,reserved varchar(100),data varchar(100), index_size varchar(100),
  unused varchar(100))
GO
insert into tablesize
  exec sp_MSforeachtable 'exec sp_spaceused ''?'''
GO
select 'ALTER TABLE [dbo].[' + name + ' ] REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = PAGE, SORT_IN_TEMPDB=ON, MAXDOP=8);' [Compress Statement], name [Table Name], CAST(REPLACE(data,'KB','') AS int) [Data Size (KB)], CAST(REPLACE(index_size,'KB','') AS int) [Index Size (KB)] from tablesize 
where name in 
  ('T1'
  ,'T2'
  ,'T3'
  ,'T4'
  ,'T5'
  )
  order by [Data Size (KB)] desc
GO
drop table tablesize
GO

The next step was to determine how much space we were actually going to save. SQL Server provides the function sp_estimate_data_compression_savings to determine the possible space gains that may result from Data Compression. In our case we used the following script to generate this information for several tables at once:

####################################
# Listing 3: Space Savings         #
####################################
USE DB_NAME
GO
create table space_savings 
([object_name] varchar(50)
,[schema_name] varchar(30)
,[index_id] int
,[partition_number] int
,[size_with_current_compression_setting(KB)] int
,[size_with_requested_compression_setting(KB)] int
,[sample_size_with_current_compression_setting(KB)] int
,[sample_size_with_requested_compression_setting(KB)] int)
GO
insert into space_savings EXEC sp_estimate_data_compression_savings 'dbo','T1', NULL, NULL, 'PAGE'
insert into space_savings EXEC sp_estimate_data_compression_savings 'dbo','T2', NULL, NULL, 'PAGE'
insert into space_savings EXEC sp_estimate_data_compression_savings 'dbo','T3', NULL, NULL, 'PAGE'
insert into space_savings EXEC sp_estimate_data_compression_savings 'dbo','T4', NULL, NULL, 'PAGE'
GO
select * from space_savings
GO
drop table space_savings
GO

I mentioned earlier our assumption that older tables were infrequently updated. We tested this theory using the following variant of a scripts published by Sanjay Mishra:

####################################
# Listing 4: Percent Update #
####################################
USE DB_NAME()
GO
SELECT db_name() database_name, o.name AS [Table_Name], x.name AS [Index_Name],
       i.partition_number AS [Partition],
       i.index_id AS [Index_ID], x.type_desc AS [Index_Type],
       i.leaf_update_count * 100.0 /
           (i.range_scan_count + i.leaf_insert_count
            + i.leaf_delete_count + i.leaf_update_count
            + i.leaf_page_merge_count + i.singleton_lookup_count
           ) AS [Percent_Update]
FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i
JOIN sys.objects o ON o.object_id = i.object_id
JOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_id
WHERE (i.range_scan_count + i.leaf_insert_count
       + i.leaf_delete_count + leaf_update_count
       + i.leaf_page_merge_count + i.singleton_lookup_count) != 0
AND objectproperty(i.object_id,'IsUserTable') = 1
AND object_name(i.object_id) in 

('T1'
,'T2'
,'T3'
,'T4'
,'T5'
)
ORDER BY [Percent_Update] ASC

###################################
# Listing 5: Percent Scan         #
###################################
USE DB_NAME()
GO
SELECT db_name() database_name, o.name AS [Table_Name], x.name AS [Index_Name],
       i.partition_number AS [Partition],
       i.index_id AS [Index_ID], x.type_desc AS [Index_Type],
       i.range_scan_count * 100.0 /
           (i.range_scan_count + i.leaf_insert_count
            + i.leaf_delete_count + i.leaf_update_count
            + i.leaf_page_merge_count + i.singleton_lookup_count
           ) AS [Percent_Scan]
FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i
JOIN sys.objects o ON o.object_id = i.object_id
JOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_id
WHERE (i.range_scan_count + i.leaf_insert_count
       + i.leaf_delete_count + leaf_update_count
       + i.leaf_page_merge_count + i.singleton_lookup_count) != 0
AND objectproperty(i.object_id,'IsUserTable') = 1
AND object_name(i.object_id) in 
('T1'
,'T2'
,'T3'
,'T4'
,'T5'
)
ORDER BY [Percent_Scan] DESC;

Once we were confortable that we had the right set of tables we went on to apply the ALTER TABLE ... REBUILD scripts generated in Listing 2. Notice the SORT_IN_TEMPDB and MAXDOP options which are intended to improve performance. I should mention here that to get the maximum benefit from SORT_IN_TEMPDB, the tables involved must have clustered indexes otherwise the sort will only have in TempDB for non-clustered indexes if they do exist.

Result

Recall that our primary objective was space savings and we mention the possibility of impact on CPU usage especially if the tables chosen are not as idle or as devoid of updates as thought. Across two databases on which compression was applied we saved over 150GB and were able to release a significant portion of that to the operating system using DBCC SHRINKFILE. The table below is a subset of the actual recorded data though not with the real table names:

Table Name	Orig Data Size (KB)	Orig Index Size (KB)	Curr Data Size (KB)	Curr Index Size (KB)	Space Savings Data	Space Savings Index	SSD (GB)	SSI (GB)
T1	29241216	280	14152216	104	15089000	176	14.39	0.00
T2	16886960	3411664	10178040	6854120	6708920	-3442456	6.40	-3.28
T3	10758056	128	6629120	88	4128936	40	3.94	0.00
T4	5939608	1743936	5868296	1104816	71312	639120	0.07	0.61
T5	5768320	1483472	5815424	1290048	-47104	193424	-0.04	0.18
T6	5616552	1412600	5630528	1043632	-13976	368968	-0.01	0.35

							24.74	-2.14

Table 1: Space Savings

One thing I did find interesting in the result is the tables and indexes that actually increased in size after compression. One explanation for th tables is that the size was not immediately captured after the change was completed. For the indexes, some had not been rebuilt in a while (having not met the fragmentation threshold requirement).Other thoughts on this are welcome.

As per CPU usage, we extracted the trend of wait events from our daily checks for a period of twenty-six days and tried to identify any change in pattern of CPU usage using. There was no significant change. Since most of the tables identified for compression were hardly ever used it made sense that compressing them would not have much impact on CPU subsequently. This was already pretty obvious from the outputs of Listings 4 and 5.

Database_Name	Table_Name	Index_Name	Partition	Index_ID	Index_Type	Percent_Scan
DB1	T5	IX_1	1	6	NONCLUSTERED	46.56
DB1	T4	IX_1	1	6	NONCLUSTERED	12.77
DB1	T4	IX_2	1	3	NONCLUSTERED	1.90
DB1	T4	IX_3	1	10	NONCLUSTERED	0.37
DB1	T4	NULL	1	0	HEAP	0.00
DB1	T5	NULL	1	0	HEAP	0.00
DB1	T4	IX_4	1	4	NONCLUSTERED	0.00
DB1	T4	IX_5	1	5	NONCLUSTERED	0.00

Table 2: Percent Scan on the First Five Tables

Database_Name	Table_Name	Index_Name	Partition	Index_ID	Index_Type	Percent_Update
DB1	T4	IX_2	1	3	NONCLUSTERED	0
DB1	T4	IX_4	1	4	NONCLUSTERED	0
DB1	T4	IX_5	1	5	NONCLUSTERED	0
DB1	T4	IX_1	1	6	NONCLUSTERED	0
DB1	T4	IX_3	1	10	NONCLUSTERED	0
DB1	T5	IX_1	1	6	NONCLUSTERED	0
DB1	T5	IX_2	1	10	NONCLUSTERED	0
DB1	T5	NULL	1	0	HEAP	0.008622272
DB1	T4	NULL	1	0	HEAP	0.079325947

Table 3: Percent Update on the First Five Tables

Another interesting thing you might have noticed from the data on the tables is that only two tables were captured using the scripts in Listings 4 and 5 for Percent Scan and Percent Update checks respectively. This is because the statistics are pulled from the DMF sys.dm_db_index_operational_stats which may not have any entries for a table or index if that table of index in not very active.

The graphs below show the pattern of CPU usage over the period. The compression was applied on the 21st of November and we see that except for spikes due to other operations, the pattern of key CPU related wait events is essentially the same. Actually a more accurate way to have checked this would be to create a User Defined Data Collector Set using Perfmon and specifically track counters in the Processor object.

Fig 1: Top Five Wait Events

Fig 2: SOS Scheduler Wait Event

Finally we also checked for any performance improvements in terms of IO. A good way to do this is to use the SET STATISTICS IO ON statement and issue a specific SELECT statement on the tables. The SET option will return number of logical read and physical reads required. It is expected that the number of IOs will reduce after compression for a particular statement.

In our case, being pressed for time and in view of the fact that we were more concerned with space saving anyway as these tables were hardly used, we merely captured the estimated execution plan for the statements

###################################
# Listing 6: Test Statements      #
###################################
USE DB2
GO 
SET STATISTICS IO ON 
PRINT GETDATE() 
SELECT * FROM T4; 
SELECT * FROM T4; 
PRINT GETDATE()

Here are outputs from tables T4 and T5: