Click here to monitor SSC
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

Data Compression Double Take

By Tony Davis,

Most people have a broad idea of how data compression works; to oversimplify, a lossless compression algorithm analyses the binary-encoded data for repeating units, and their frequency, and each repeating unit is represented by a single, highly-efficient encoding, using a very small number of bits, so reducing redundant data storage and the size of the file.

A few years back, it was unthinkable to send a Word .doc file over email without first zipping it up. With the advent of .docx, this practice is redundant; since it is already stored in a compressed format there is no benefit in applying double compression (same argument for PDF and other compressed file formats). In fact, it's even common in such cases to see compressed file sizes slightly bigger than the original, since additional header information and so on is added to the compressed file.

The appliance of compression technology to database files is a relative innovation in the SQL Server world, with SQL 2008 Enterprise. Here, we don't compress the whole file, but individual tables and indexes within. Row compression frees up empty space mainly by compressing variable length data types. Page compression adds in prefix and dictionary compression which essentially, as described above, eliminates space by removing redundant storage of repeating units of data.

Clearly, the process of compressing (and decompressing) data will burn additional CPU cycles, but the benefits for query performance lie in the potential for greatly reduced I/O, due to the smaller footprint both on disk and in memory, since the data remains compressed in the buffer cache. Having said this, I confess to being somewhat surprised by the relative lack of "real", published performance data that has emerged over four years, to back this up.

The ever-reliable Linchi Shea took an early look, and proved both the huge potential performance boost it could offer for table scans, the size of the negative impact on data modifications, and the vastly differing compression ratios that could be seen for the same data, depending on its distribution. The latter case, unless you have CPU cycles to burn (which some modern, multi-core processors do!), proves that you should choose carefully which objects to compress; if your index happens to organize data in such a way that there is little repetition on many of the pages for that object, then there really isn't much to compress.

However, in particular, I still find myself wondering about the impact of this data compression on, for example, multi-table JOIN queries. Research exists to suggest that most types of join will proceed quite happily on compressed data and even with some performance benefit, but it would be reassuring to have people share some real world data.

Finally, with a Red Gate hat on for a moment, I wonder if databases could be one exception to the rule that "double compression does no good"; in other words, whether we might benefit from compressing at the file as well as the object level. A tool like SQL Storage Compress (based on Hyperbac technology) works at the file driver level and compresses the whole database file; it compresses and decompresses the data as it passes data to and from SQL Server (so the data is not compressed in the cache in this case). In his initial study, Brad McGehee saw a huge space saving, over native page compression, mainly due to the fact that Storage Compress can also remove all unused space in the data file. It will also compress LOB data (varchar(max) and such), which native data compression will not.

I want to hear from anyone who has direct experience with any of these data compression technologies, good or bad; if you have performance data to share, even better!



Total article views: 175 | Views in the last 30 days: 1
Related Articles

SQL Server Central Webinar Series #20: Using SQL Storage Compress in the Real World

Join us on July 10 at 4:00pm GMT/11:00am EDT for an in-depth look at SQL Storage Compress and learn ...


Improving Database Design and Performance Using Compression

This article attempts to show a practical scenario on improving DB design and performance through ro...


Data Compression and Snapshot Isolation

Data Compression and Snapshot Isolation don't play well together, you may not see a performance bene...


Webinar: Using SQL Storage Compress by Steve Jones and Brad McGehee

Steve Jones and Brad McGehee lead you through real-world examples of working with SQL Storage Compre...


Comparing Space Usage for Different Compression Types

This is an investigation, to get an impression on the impact of the used disk space before and after...