One of the large advantages of migrating up to SQL Server 2008 is the ability to compress your data, reducing your disk overhead. It is a relatively simple process to compress your data.
Unfortunately from a production standpoint it is only available on the SQL Server 2008 Enterprise Edition (as per http://msdn.microsoft.com/en-us/library/cc645993.aspx), it can however also be used on the Developer Edition (http://www.microsoft.com/sqlserver/2008/en/us/developer.aspx)
The reduction in I/O required to pull data can lead to significant performance improvements, particularly in data warehouse and data mart environments where you are working with extremely large datasets.
What is Data Compression
Data compression is exactly what is says it is. It is a way of compressing the data within your database so that you can reduce greatly the amount of storage space required to host the data. There is a caveat with this, depending upon the amount of stored data within a table, the allocation unit size of your disk and the datatypes you could in fact end up using MORE storage
Note: Allocation Unit Size (AUS) is also known as the cluster or blocksize of your disk. This size is set when you format your disk and can range in size from 512 Bytes to 64 KB. The default AUS is based upon the size of the disk, see http://support.microsoft.com/kb/140365 for more information. Larger file allocations will provide performance improvements for applications such as SQL Server, particularly the 64 KB AUS. The size of a SQL Server extent (8 pages) is 64 KB, and so optimizes performance. The downside of a larger AUS is that is takes a great deal more space to hold a file on the disk, so if you have a lot of smaller files you could end up using a far more disk space than you need to as the disk has to allocate 64 KB of space for even a 2 KB file. If you are curious about the existing AUS on your partitions Microsoft have published a technet article with a script to show the partition properties http://www.microsoft.com/technet/scriptcenter/guide/sas_fsd_grsi.mspx?mfr=true
There are two types of data compression:
- Row compression
- Page compression
Row level compression will provide savings by not storing blank characters within fixed character strings (such as a char(10) with a 5 character value). Null and 0 values are not stored and so do not incur additional storage overhead. For a list of data types that support row level compression see http://msdn.microsoft.com/en-us/library/cc280576.aspx
Page compression uses a far more complex algorithm to minimize the storage space of data, known as dictionary compression. SQL Server looks at all of the data stored on a page and builds a dictionary based upon that data which can be referenced for repeated values, and only the dictionary id and changes of the dictionary value are stored. This provides great savings for similar patterned data. For full details on page level compression and how it works visit http://msdn.microsoft.com/en-us/library/cc280464.aspx Page compression includesrow compression, so you get the benefit of both.
Potential Issues with Data Compression
Data compression is not for everybody. Depending upon the workload of your system, the performance requirements, and whether or not you use encryption this might not be the thing for you. There is a CPU overhead associated with using data compression, and this may adversely impact your system. High volume OLTP systems could be significantly impacted by attempting to implement data compression.
How Do I know if Data Compression is Right for Me?
- Estimate the potential storage savings that you could get by implementing (you could actually end up using MORE storage within certain circumstances)
- Complete a baseline performance analysis of your database server and reproduce this in a development or staging environment. Enable compression and evaluate the performance against that baseline.
- Look at which tables can provide you the biggest benefit. I have found that an audit table I have will actually use 75% savings in disk space with no impact to the application on top of the database. Tables which have a lot of repetitive or numerical data or CHAR columns that are not fully populated are usually excellent candidates for compression.
- Check and recheck your analysis against the baseline, and seek feedback from the users if you do implement it in a production environment.
How Do I know if a Table is Worth Compressing?
There is a stored procedure in SQL Server 2008 called sp_estimate_data_compression_savings. This procedure accepts 5 parameters.
The two critical parameters are @object_name and @data_compression.
The object name refers to the table that you wish to evaluate and @data_compression can have one of three values (NONE, ROW, PAGE). Depending upon the value passed this will perform estimations for the two compression types and strangely for no compression.
The additional parameters provide some more advanced estimation options. @schema_name allows you to estimate the savings against tables on a different schema (by default the procedure only looks at tables within your own schema). @index_id will allow you to specify the estimated savings for a single index on a table based upon the index id, left with the default value of NULL it will assess all of the indexes on that table. @partition_number will allow you to define particular partitions on a table to evaluate potential savings. This can be very useful for estimating the savings on older data, rarely changed living on a partition, which you may want to compress as opposed to more volatile data within a different partition which may not be a good candidate for compression. Worth noting, you have to define an index id in order to evaluate a partition.
Estimating Savings for Entire Databases
It is obviously a time consuming task to go through each and every table an get an estimation of the potential savings you could find, which I why I created a procedure USP_Compression_Savings_By_DB (atttached to this article), which will go out and provide the estimated compression savings for all tables within a single database, or within all databases and load that information into a table for later analysis.
The procedure can be created in the database of your choice and the final analysis table located in another database of your choice. It accepts two parameters, both of which are optional.
- @checkdb – this is the name of the database that we will be checking, if left null then all databases will be checked and the savings estimated
- @admindbname – should you wish to put the results into a particular database you can put a value here, by default it will use the database in which the procedure resides
PLEASE NOTE: The sp_estimate_data_compression_savings procedure is a bit of a heavy hitter, and obviously with checking all tables in a database (or on an instance) using USP_Compression_Savings_By_DB, your system will be hit all the harder. I would recommend running this against a recent dump of your production databases restored on to another system. At the very least run it during the period of least activity on your system.
Compression can be a very useful tool and can provide huge benefits in not only storage but performance in heavy read applications.
Ensure to perform all of the analysis required to prevent your production systems from being negatively impacted by enabling data compression.
The code for usp_compression_savings_by_db is included in the Resource section below.