Data Compression Double Take

Most people have a broad idea of how data compression works; to oversimplify, a lossless compression algorithm analyses the binary-encoded data for repeating units, and their frequency, and each repeating unit is represented by a single, highly-efficient encoding, using a very small number of bits, so reducing redundant data storage and the size of the file.

A few years back, it was unthinkable to send a Word .doc file over email without first zipping it up. With the advent of .docx, this practice is redundant; since it is already stored in a compressed format there is no benefit in applying double compression (same argument for PDF and other compressed file formats). In fact, it's even common in such cases to see compressed file sizes slightly bigger than the original, since additional header information and so on is added to the compressed file.

The appliance of compression technology to database files is a relative innovation in the SQL Server world, with SQL 2008 Enterprise. Here, we don't compress the whole file, but individual tables and indexes within. Row compression frees up empty space mainly by compressing variable length data types. Page compression adds in prefix and dictionary compression which essentially, as described above, eliminates space by removing redundant storage of repeating units of data.

Clearly, the process of compressing (and decompressing) data will burn additional CPU cycles, but the benefits for query performance lie in the potential for greatly reduced I/O, due to the smaller footprint both on disk and in memory, since the data remains compressed in the buffer cache. Having said this, I confess to being somewhat surprised by the relative lack of "real", published performance data that has emerged over four years, to back this up.

The ever-reliable Linchi Shea took an early look, and proved both the huge potential performance boost it could offer for table scans, the size of the negative impact on data modifications, and the vastly differing compression ratios that could be seen for the same data, depending on its distribution. The latter case, unless you have CPU cycles to burn (which some modern, multi-core processors do!), proves that you should choose carefully which objects to compress; if your index happens to organize data in such a way that there is little repetition on many of the pages for that object, then there really isn't much to compress.

However, in particular, I still find myself wondering about the impact of this data compression on, for example, multi-table JOIN queries. Research exists to suggest that most types of join will proceed quite happily on compressed data and even with some performance benefit, but it would be reassuring to have people share some real world data.

Finally, with a Red Gate hat on for a moment, I wonder if databases could be one exception to the rule that "double compression does no good"; in other words, whether we might benefit from compressing at the file as well as the object level. A tool like SQL Storage Compress (based on Hyperbac technology) works at the file driver level and compresses the whole database file; it compresses and decompresses the data as it passes data to and from SQL Server (so the data is not compressed in the cache in this case). In his initial study, Brad McGehee saw a huge space saving, over native page compression, mainly due to the fact that Storage Compress can also remove all unused space in the data file. It will also compress LOB data (varchar(max) and such), which native data compression will not.

I want to hear from anyone who has direct experience with any of these data compression technologies, good or bad; if you have performance data to share, even better!

Cheers,

Tony.