The Real World: Implementing Data Compression in SQL Server 2008 R2

,

Introduction

Data Compression is a feature of SQL Server 2008 and higher that can compress tables and indexes, thus reducing the size of the database and saving storage, depending on the nature of the table/index or partition being compressed.

Data Compression not only saves space, but it will also improve performance for read-intensive workloads. This is because the engine has to scan much fewer data pages depeding of the compression gains.

A possible reaction to the above is to go ahead and compress ALL tables and impress your company with the savings made on storage but there are some caveats with Data Compression:

  1. Only IN_ROW_DATA allocation units can be compressed
  2. Data Compression does have performance impact on update operations because decompressing pages for update is CPU intensive
  3. Performing the act itself requires careful planning for workspace (there can be significant growth of the log file)
  4. The degree of compression depends on factors such as number of NULL and 0 fields in the table and the degree of repetition in the data

This is a good place to mention there are two levels of Data Compression:

  1. ROW - optimizes data types and NULL/0 values within rows
  2. PAGE - minimizes data redundacy at page level. A superset of ROW compression

Scenario

In our environment we considered a certain application which keeps historical data in separate tables. The assumption was that older tables would be frequently accessed and even less frequently accessed thus would be good candidates for Data Compression. For us, the space savings was primary, performance improvement for read operations was a bonus.  

The first thing we did was to determine the space used by all tables in the affected databases and focus on the top n largest tables using the query in Listing 1. You will also notice we generated the required compression statement in the same effort:

  ###################################
  # Listing 1: Largest Tables       #
  ###################################
USE DB_NAME()
GO
create table tablesize 
( name varchar(100),
  rows int,
  reserved varchar(100),
  data varchar(100), 
  index_size varchar(100),
  unused varchar(100)
)
GO
insert into tablesize
 exec sp_MSforeachtable 'exec sp_spaceused ''?'''
GO
select 'ALTER TABLE [dbo].[' + name + ' ] REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = PAGE, SORT_IN_TEMPDB=ON, MAXDOP=8);', name [Table Name], CAST(REPLACE(data,'KB','') AS int) [Data Size (KB)], CAST(REPLACE(index_size,'KB','') AS int) [Index Size (KB)] from tablesize order by [Data Size (KB)] desc
GO
drop table tablesize
GO

The above listed can be modified by filtering for the specific tables you want to work on as in Listing 2:

#######################################
# Listing 2: Largest Tables Selected  #
#######################################
USE DB_NAME()
GO
create table tablesize (name varchar(100),rows int,reserved varchar(100),data varchar(100), index_size varchar(100),
  unused varchar(100))
GO
insert into tablesize
  exec sp_MSforeachtable 'exec sp_spaceused ''?'''
GO
select 'ALTER TABLE [dbo].[' + name + ' ] REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = PAGE, SORT_IN_TEMPDB=ON, MAXDOP=8);' [Compress Statement], name [Table Name], CAST(REPLACE(data,'KB','') AS int) [Data Size (KB)], CAST(REPLACE(index_size,'KB','') AS int) [Index Size (KB)] from tablesize 
where name in 
  ('T1'
  ,'T2'
  ,'T3'
  ,'T4'
  ,'T5'
  )
  order by [Data Size (KB)] desc
GO
drop table tablesize
GO

The next step was to determine how much space we were actually going to save. SQL Server provides the function sp_estimate_data_compression_savings to determine the possible space gains that may result from Data Compression. In our case we used the following script to generate this information for several tables at once:

####################################
# Listing 3: Space Savings         #
####################################
USE DB_NAME
GO
create table space_savings 
([object_name] varchar(50)
,[schema_name] varchar(30)
,[index_id] int
,[partition_number] int
,[size_with_current_compression_setting(KB)] int
,[size_with_requested_compression_setting(KB)] int
,[sample_size_with_current_compression_setting(KB)] int
,[sample_size_with_requested_compression_setting(KB)] int)
GO
insert into space_savings EXEC sp_estimate_data_compression_savings 'dbo','T1', NULL, NULL, 'PAGE'
insert into space_savings EXEC sp_estimate_data_compression_savings 'dbo','T2', NULL, NULL, 'PAGE'
insert into space_savings EXEC sp_estimate_data_compression_savings 'dbo','T3', NULL, NULL, 'PAGE'
insert into space_savings EXEC sp_estimate_data_compression_savings 'dbo','T4', NULL, NULL, 'PAGE'
GO
select * from space_savings
GO
drop table space_savings
GO

I mentioned earlier our assumption that older tables were infrequently updated. We tested this theory using the following variant of a scripts published by Sanjay Mishra:

####################################
# Listing 4: Percent Update #
####################################
USE DB_NAME()
GO
SELECT db_name() database_name, o.name AS [Table_Name], x.name AS [Index_Name],
       i.partition_number AS [Partition],
       i.index_id AS [Index_ID], x.type_desc AS [Index_Type],
       i.leaf_update_count * 100.0 /
           (i.range_scan_count + i.leaf_insert_count
            + i.leaf_delete_count + i.leaf_update_count
            + i.leaf_page_merge_count + i.singleton_lookup_count
           ) AS [Percent_Update]
FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i
JOIN sys.objects o ON o.object_id = i.object_id
JOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_id
WHERE (i.range_scan_count + i.leaf_insert_count
       + i.leaf_delete_count + leaf_update_count
       + i.leaf_page_merge_count + i.singleton_lookup_count) != 0
AND objectproperty(i.object_id,'IsUserTable') = 1
AND object_name(i.object_id) in 
('T1'
,'T2'
,'T3'
,'T4'
,'T5'
)
ORDER BY [Percent_Update] ASC

###################################
# Listing 5: Percent Scan         #
###################################
USE DB_NAME()
GO
SELECT db_name() database_name, o.name AS [Table_Name], x.name AS [Index_Name],
       i.partition_number AS [Partition],
       i.index_id AS [Index_ID], x.type_desc AS [Index_Type],
       i.range_scan_count * 100.0 /
           (i.range_scan_count + i.leaf_insert_count
            + i.leaf_delete_count + i.leaf_update_count
            + i.leaf_page_merge_count + i.singleton_lookup_count
           ) AS [Percent_Scan]
FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i
JOIN sys.objects o ON o.object_id = i.object_id
JOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_id
WHERE (i.range_scan_count + i.leaf_insert_count
       + i.leaf_delete_count + leaf_update_count
       + i.leaf_page_merge_count + i.singleton_lookup_count) != 0
AND objectproperty(i.object_id,'IsUserTable') = 1
AND object_name(i.object_id) in 
('T1'
,'T2'
,'T3'
,'T4'
,'T5'
)
ORDER BY [Percent_Scan] DESC;

Once we were confortable that we had the right set of tables we went on to apply the ALTER TABLE ... REBUILD scripts generated in Listing 2. Notice the SORT_IN_TEMPDB and MAXDOP options which are intended to improve performance. I should mention here that to get the maximum benefit from SORT_IN_TEMPDB, the tables involved must have clustered indexes otherwise the sort will only have in TempDB for non-clustered indexes if they do exist.

Result

Recall that our primary objective was space savings and we mention the possibility of impact on CPU usage especially if the tables chosen are not as idle or as devoid of updates as thought. Across two databases on which compression was applied we saved over 150GB and were able to release a significant portion of that to the operating system using DBCC SHRINKFILE. The table below is a subset of the actual recorded data though not with the real table names:

Table Name Orig Data Size (KB) Orig Index Size (KB) Curr Data Size (KB) Curr Index Size (KB) Space Savings Data Space Savings Index SSD (GB) SSI (GB)
T1 29241216 280 14152216 104 15089000 176 14.39 0.00
T2 16886960 3411664 10178040 6854120 6708920 -3442456 6.40 -3.28
T3 10758056 128 6629120 88 4128936 40 3.94 0.00
T4 5939608 1743936 5868296 1104816 71312 639120 0.07 0.61
T5 5768320 1483472 5815424 1290048 -47104 193424 -0.04 0.18
T6 5616552 1412600 5630528 1043632 -13976 368968 -0.01 0.35
24.74 -2.14

Table 1: Space Savings

One thing I did find interesting in the result is the tables and indexes that actually increased in size after compression. One explanation for th tables is that the size was not immediately captured after the change was completed. For the indexes, some had not been rebuilt in a while (having not met the fragmentation threshold requirement).Other thoughts on this are welcome.

As per CPU usage, we extracted the trend of wait events from our daily checks for a period of twenty-six days and tried to identify any change in pattern of CPU usage using. There was no significant change. Since most of the tables identified for compression were hardly ever used it made sense that compressing them would not have much impact on CPU subsequently. This was already pretty obvious from the outputs of Listings 4 and 5.

Database_Name Table_Name Index_Name Partition Index_ID Index_Type Percent_Scan
DB1 T5 IX_1 1 6 NONCLUSTERED 46.56
DB1 T4 IX_1 1 6 NONCLUSTERED 12.77
DB1 T4 IX_2 1 3 NONCLUSTERED 1.90
DB1 T4 IX_3 1 10 NONCLUSTERED 0.37
DB1 T4 NULL 1 0 HEAP 0.00
DB1 T5 NULL 1 0 HEAP 0.00
DB1 T4 IX_4 1 4 NONCLUSTERED 0.00
DB1 T4 IX_5 1 5 NONCLUSTERED 0.00

Table 2: Percent Scan on the First Five Tables

Database_Name Table_Name Index_Name Partition Index_ID Index_Type Percent_Update
DB1 T4 IX_2 1 3 NONCLUSTERED 0
DB1 T4 IX_4 1 4 NONCLUSTERED 0
DB1 T4 IX_5 1 5 NONCLUSTERED 0
DB1 T4 IX_1 1 6 NONCLUSTERED 0
DB1 T4 IX_3 1 10 NONCLUSTERED 0
DB1 T5 IX_1 1 6 NONCLUSTERED 0
DB1 T5 IX_2 1 10 NONCLUSTERED 0
DB1 T5 NULL 1 0 HEAP 0.008622272
DB1 T4 NULL 1 0 HEAP 0.079325947

Table 3: Percent Update on the First Five Tables

Another interesting thing you might have noticed from the data on the tables is that only two tables were captured using the scripts in Listings 4 and 5 for Percent Scan and Percent Update checks respectively. This is because the statistics are pulled from the DMF sys.dm_db_index_operational_stats which may not have any entries for a table or index if that table of index in not very active.

The graphs below show the pattern of CPU usage over the period. The compression was applied on the 21st of November and we see that except for spikes due to other operations, the pattern of key CPU related wait events is essentially the same. Actually a more accurate way to have checked this would be to create a User Defined Data Collector Set using Perfmon and specifically track counters in the Processor object.

Fig 1: Top Five Wait Events

Fig 2: SOS Scheduler Wait Event

Finally we also checked for any performance improvements in terms of IO. A good way to do this is to use the SET STATISTICS IO ON statement and issue a specific SELECT statement on the tables. The SET option will return number of logical read and physical reads required. It is expected that the number of IOs will reduce after compression for a particular statement.

In our case, being pressed for time and in view of the fact that we were more concerned with space saving anyway as these tables were hardly used, we merely captured the estimated execution plan for the statements 

###################################
# Listing 6: Test Statements      #
###################################
USE DB2
GO 
SET STATISTICS IO ON 
PRINT GETDATE() 
SELECT * FROM T4; 
SELECT * FROM T4; 
PRINT GETDATE()

Here are outputs from tables T4 and T5:

Fig 3: T4 Before Compression

Fig 4: T4 After Compression

Fig 5: T5 Before Compression

Fig 5: T5 After Compression

Notice the drop in Estimated IO Cost and the subtle rise in Estimated CPU Cost.

Rate

4.67 (3)

Share

Share

Rate

4.67 (3)