• t.franz - Monday, April 24, 2017 8:21 AM

    any idea, if (and if yes then how) small datatypes as TINYINT (= 1 byte) could be compressed in columnstored indexes?

    Its clear, that I will save space when I replace "this is a long string" with a simple tinyint (at least, when it occurse more than once), but when the column data type is already as short as possible, it could only work when you save "value 1 (which is the dictonary entry for the original tinyint CountryID 20) will used in rows 1-10, 12-50 and 90-200", but this would it make much harder to recombine rows to answer an SELECT *

    Yes, tinyint data is also compressed. The average space gain is less than for other data types, but it can still be relevant.

    The primary compression method for tinyint will be run-length encoding (RLE). So if the data in rows 1 to 20 reads 1 / 1 / 1 / 1 / 2 / 2 / 2 / 2 / 3 / 3 / 1 / 1 / 1 / 1 / 4 / 1 / 3 / 3 / 3 / 3, then this will be stored as 1 (4) / 2 (4) / 3 (2) / 1 (4) / 4 (1) / 1 (1) / 3 (4). The algorithm that builds the columnstore index includes a step (after dividing the table into rowgroups) where the rows within a rowgroup are sorted to optimize the benefit of RLE. The "optimal" sort order for this column would after RLE result in 1 (9) / 2 (4) / 3 (6) / 4 (1). However, the sort algorithm that is used for columnstore indexes has to look at ALL columns, not just a single one.

    A secondary compression method for tinyint tries to reduce the size of each value to less than 1 full byte. Let's for example say that, within a specific rowgroup, all values in your tinyint columns are NULL or between 23 and 48. The algorithm will replace NULLs with a value not in the normal range (probably 22), and then subtract 22 from each value changing the range to 0 (for NULL) and 1 - 26 (for 23-48); this can then be encoded in 5 bits instead of 8 for the value + 1 for the NULL marker. The "magic value" to represent NULL, and the offset of 22, are stored only once, in the rowgroup metadata. (You can find it in the DMV sys.column_store_segments. This DMV is introduced in level 4, though I have not shown all columns; the null value (null_value) and the offset (base_id) are two of the columns I chose not to include).


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/