Why columnstore indexes are not very popular?

  • They can play tricks in the engine to use less space in a way you cannot with row storage. The index itself can be smaller than a rowstore index.

  • the paper i mentioned does give some insight to what happens under the covers - I could not find their previous paper and that would give even further insight to the compression algorithm.

    Although it would seem for some that a "look-up table holding unique values and tokenized with a Primary Key on its Identity column" would behave the same way it is not as simple as that as the engine deals with them differently - and even with integers the vertipaq engine will do things with them that "help", where possible, reduce the size in disk taken by each key.

    this link https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=3 explain this in a bit more detail for particular examples.

    As people start looking at how this engine works they may also decide to change their tables and how some columns are defined - at a cost of a slight code change.

    take for example the case of a datetime column - storage for these on vertipaq would get the distinct values of a combination of year/month/day/hour/min/second which gives a considerable number of possible distinct values - by separating the column to its date and time components it reduces the possible distinct values and compression is therefore higher e.g. shorter dictionary.

    see https://www.sqlbi.com/articles/optimizing-high-cardinality-columns-in-vertipaq/ for examples - although targeted for SSAS Tabular/PowerPivot the same technique can be applied to SQL Server in some cases.

  • Steve Jones - SSC Editor wrote:

    They can play tricks in the engine to use less space in a way you cannot with row storage. The index itself can be smaller than a rowstore index.

    Steven, where do you think that extra space saving could come from?

    do you have a test case to prove it?

    _____________
    Code for TallyGenerator

  • Frederico, do you have any test case to support your speculations?

    I (and I guess Jeff Moden too) would be particularly interested in a test script which demonstrates advantages of splitting datetime into date and time.

    _____________
    Code for TallyGenerator

  • Sergiy wrote:

    Steve Jones - SSC Editor wrote:

    Columnstore would work by tokenizing and limiting the repetition of values. A simple example is gender. Say we have 3 values (M, F, Other). If we had these stored as single bytes, we could (in a columnstore index), have very few values of these stores. These can be easily tokenized and compressed, but more importantly, the index can be read more quickly.

    Can you please explain me how is it different (and more effective) from a look-up table holding unique values and tokenized with a Primary Key on its Identity column?

     

    Is that a serious question?  We have to trust that because the SQL Engine is handling maintaining the dictionaries at a low level it will be both transparent to the end users(in this case DBA's and developers) and have better performance.

    That's somewhat like asking why use update or delete statements when you can just do it in a cursor, the end result is the same right?

  • I should not have had the need to do this - all the links that I supplied have this info one way or another and more.

    small test case - 5 years worth of data - all days * all seconds of the day

    create table test1
    (datefull datetime
    )

    create clustered columnstore index cci_test1 on test1 with (data_compression = columnstore)

    create table test2
    (dateonly date
    ,timeonly time
    )

    create clustered columnstore index cci_test2 on test2 with (data_compression = columnstore)


    declare @startdate datetime = '1899-12-31';
    declare @years int = 5;

    WITH E1(N) AS (
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
    )
    , E2(N) AS (SELECT 1 FROM E1 a, E1 b)
    , E4(N) AS (SELECT 1 FROM E2 a, E2 b)
    , E6(N) AS (SELECT 1 FROM E4 a, E2 b)
    , secs(n, sec) as (select top 86400 *
    , ROW_NUMBER() over(order by n) as sec
    from e6) -- number seconds in a day
    , days(n, day) as (select top (convert(int, @years * 365.25)) *
    , ROW_NUMBER() over(order by n) as day
    from e6)

    -- insert full date into test1
    insert into test1
    select final.newdate
    from days y1
    outer apply (select dateadd(day, y1.day, @startdate) as basedate) base
    outer apply (select dateadd(second, secs.sec, base.basedate) as newdate
    from secs) final
    order by final.newdate

    -- from test1 insert into test2 spliting date and time
    insert into test2
    select convert(date, datefull), convert(time, datefull)
    from test1

    -- now report on sizes of both tables

    select pa.tablename
    , pa.indexname
    , css.column_id
    , min(min_data_id) as min_data_id
    , max(max_data_id) as max_data_id
    , sum(row_count) as row_count
    , sum(on_disk_size) as on_disk_size
    , count(*) as number_segments
    , max(magnitude) as magnitude
    from sys.column_store_segments css
    outer apply (select top 1 ix.name as indexname
    , object_name(pa.object_id) as tablename
    from sys.partitions pa
    inner join sys.indexes ix
    on ix.index_id = pa.index_id
    and ix.object_id = pa.object_id
    where pa.partition_id = css.partition_id
    ) pa
    group by pa.tablename
    , pa.indexname
    , css.column_id

    select s.Name as schemaname
    , t.Name as tablename
    , p.rows as rowcounts
    , cast(round((sum(a.used_pages) / 128.00), 2) as numeric(36, 2)) as used_mb
    , cast(round((sum(a.total_pages) - sum(a.used_pages)) / 128.00, 2) as numeric(36, 2)) as unused_mb
    , cast(round((sum(a.total_pages) / 128.00), 2) as numeric(36, 2)) as total_mb
    from sys.tables t
    inner join sys.indexes i
    on t.object_id = i.object_id
    inner join sys.partitions p
    on i.object_id = p.object_id
    and i.index_id = p.index_id
    inner join sys.allocation_units a
    on p.partition_id = a.container_id
    inner join sys.schemas s
    on t.schema_id = s.schema_id
    group by t.Name
    , s.Name
    , p.Rows
    order by s.Name
    , t.Name

    output of above queries is
    tablenameindexnamecolumn_idmin_data_idmax_data_idrow_counton_disk_sizenumber_segmentsmagnitude
    test2cci_test21693595695421157766400114032151-1
    test2cci_test22086399000000015776640042079980815110000000
    test1cci_test11300784261028249615776640012622222241511

    schemanametablenamerowcountsused_mbunused_mbtotal_mb
    dbotest11577664001231.562.201233.77
    dbotest2157766400411.913.42415.33
  • ZZartin wrote:

    Sergiy wrote:

    Steve Jones - SSC Editor wrote:

    Columnstore would work by tokenizing and limiting the repetition of values. A simple example is gender. Say we have 3 values (M, F, Other). If we had these stored as single bytes, we could (in a columnstore index), have very few values of these stores. These can be easily tokenized and compressed, but more importantly, the index can be read more quickly.

    Can you please explain me how is it different (and more effective) from a look-up table holding unique values and tokenized with a Primary Key on its Identity column?

    Is that a serious question?  We have to trust that because the SQL Engine is handling maintaining the dictionaries at a low level it will be both transparent to the end users(in this case DBA's and developers) and have better performance.

    That's somewhat like asking why use update or delete statements when you can just do it in a cursor, the end result is the same right?

    Are you trying to dupe me into some religious cult?

    Sorry, I don't do cults.

    _____________
    Code for TallyGenerator

  • frederico_fonseca wrote:

    small test case - 5 years worth of data - all days * all seconds of the day

    Thanks Frederico. Cool stuff.

    I only made one change to your script:

    create clustered index cci_test1 on test1  (datefull)

    as we're trying to compare use cases of columnstore vs. b-tree indexes.

    I hope you don't mind.

    Now, let's say we need to find some event(s) which happened on a Christmas night in a year, say, 1904.

    For Test1 I've got no question:

    select * from Test1
    where datefull >= '19001224 19:00' and datefull < '19001225 07:00'

    How do you write a query returning the same set for Test2?

     

     

     

    _____________
    Code for TallyGenerator

  • Sergiy wrote:

    ZZartin wrote:

    Sergiy wrote:

    Steve Jones - SSC Editor wrote:

    Columnstore would work by tokenizing and limiting the repetition of values. A simple example is gender. Say we have 3 values (M, F, Other). If we had these stored as single bytes, we could (in a columnstore index), have very few values of these stores. These can be easily tokenized and compressed, but more importantly, the index can be read more quickly.

    Can you please explain me how is it different (and more effective) from a look-up table holding unique values and tokenized with a Primary Key on its Identity column?

    Is that a serious question?  We have to trust that because the SQL Engine is handling maintaining the dictionaries at a low level it will be both transparent to the end users(in this case DBA's and developers) and have better performance.

    That's somewhat like asking why use update or delete statements when you can just do it in a cursor, the end result is the same right?

    Are you trying to dupe me into some religious cult?

    Sorry, I don't do cults.

     

    /shrug I think that the fact that dictionaries are maintained transparently in the internal guts of column store indexes vs having to maintain and join a lookup table by hand is significant point in favor of column stores being more effective.

  • ZZartin wrote:

    /shrug I think that the fact that dictionaries are maintained transparently in the internal guts of column store indexes vs having to maintain and join a lookup table by hand is significant point in favor of column stores being more effective.

    Is doing less typing your only concern when you choose a database design approach?

    _____________
    Code for TallyGenerator

  • Sergiy wrote:

    Frederico, do you have any test case to support your speculations?

    I (and I guess Jeff Moden too) would be particularly interested in a test script which demonstrates advantages of splitting datetime into date and time.

    The best place to learn about ColumnStore is from Niko Neugebauer. Niko has examples that show how the compression works.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  •  

    again some basics - few options to do it - and not exhaustive for sure and if anyone has a "faster" method please supply it and i can test it on same machine for comparing performance

    server is an intel I7 with 4GB allocated to SQL Server - 4 cores/8 HT - maxdop 4, CTP 50

    local attached SSD drive

    I've also created 2 new tables - same columns and data as the original 2 ones, but these were created with a normal clustered index with page compression

    table sizes below

    schemanametablenamerowcountsused_mbunused_mbtotal_mb
    dbotest11577664001231.562.201233.77
    dbotest1_rowstore1577664001678.480.321678.80
    dbotest2157766400411.913.42415.33
    dbotest2_rowstore1577664001683.830.221684.05
    set statistics io on
    set statistics time on
    set nocount on
    print '-- test1 - datefull'
    select *
    from Test1
    where datefull >= '19001224 19:00'
    and datefull < '19001225 07:00'

    -- this is the method I would use
    print '-- test2 - method 1 - dateadd'
    select *
    from Test2
    where dateonly >= '19001224'
    and dateonly <= '19001225'
    and (dateadd(second, datediff(second, 0, timeonly), convert(datetime, dateonly)) >= '19001224 19:00')
    and (dateadd(second, datediff(second, 0, timeonly), convert(datetime, dateonly)) < '19001225 07:00')

    -- we could also do like this but not an option I would pick up unless absolutely required for performance

    print '-- test2 - method 2 - union all'
    select *
    from Test2
    where dateonly > '19001224'
    and dateonly < '19001225'

    union all

    select *
    from Test2
    where (dateonly = '19001224' and timeonly >= '19:00')

    union all

    select *
    from Test2
    where (dateonly = '19001225' and timeonly < '07:00')



    print '-- test2 - method 3 - or condition'
    -- following method would most of the times be the one used by developers - not the best option as predicates aren't pushed down the vertipaq engine the same way
    -- resulting in more reads
    select *
    from Test2
    where dateonly > '19001224'
    and dateonly < '19001225'
    or (dateonly = '19001224' and timeonly >= '19:00')
    or (dateonly = '19001225' and timeonly < '07:00')

    print '-- rowstore test1'
    select *
    from Test1_rowstore
    where datefull >= '19001224 19:00'
    and datefull < '19001225 07:00'

    -- this is the method I would use
    print '-- rowstore test2 - method 1 - dateadd'
    select *
    from Test2_rowstore
    where dateonly >= '19001224'
    and dateonly <= '19001225'
    and (dateadd(second, datediff(second, 0, timeonly), convert(datetime, dateonly)) >= '19001224 19:00')
    and (dateadd(second, datediff(second, 0, timeonly), convert(datetime, dateonly)) < '19001225 07:00')

    Times - warm first

    -- test1 - datefull
    Table 'test1'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 6291, lob physical reads 0, lob read-ahead reads 0.
    Table 'test1'. Segment reads 3, segment skipped 148.

    SQL Server Execution Times: CPU time = 31 ms, elapsed time = 222 ms.

    -- test2 - method 1 - dateadd
    Table 'test2'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 2856, lob physical reads 0, lob read-ahead reads 0.
    Table 'test2'. Segment reads 4, segment skipped 147.

    SQL Server Execution Times: CPU time = 47 ms, elapsed time = 324 ms.

    -- test2 - method 2 - union all
    Table 'test2'. Scan count 12, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 8562, lob physical reads 0, lob read-ahead reads 0.
    Table 'test2'. Segment reads 12, segment skipped 441.

    SQL Server Execution Times: CPU time = 16 ms, elapsed time = 242 ms.

    -- test2 - method 3 - or condition
    Table 'test2'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 106995, lob physical reads 0, lob read-ahead reads 0.
    Table 'test2'. Segment reads 151, segment skipped 0.

    SQL Server Execution Times: CPU time = 922 ms, elapsed time = 317 ms.

    -- rowstore test1
    Table 'test1_rowstore'. Scan count 1, logical reads 64, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times: CPU time = 0 ms, elapsed time = 232 ms.

    -- rowstore test2 - method 1 - dateadd
    Table 'test2_rowstore'. Scan count 1, logical reads 240, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times: CPU time = 62 ms, elapsed time = 219 ms.

    now cold times - after instance restart

    -- test1 - datefull
    Table 'test1'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 6291, lob physical reads 0, lob read-ahead reads 19317.
    Table 'test1'. Segment reads 3, segment skipped 148.

    SQL Server Execution Times: CPU time = 46 ms, elapsed time = 218 ms.

    -- test2 - method 1 - dateadd
    Table 'test2'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 2856, lob physical reads 0, lob read-ahead reads 10040.
    Table 'test2'. Segment reads 4, segment skipped 147.

    SQL Server Execution Times: CPU time = 31 ms, elapsed time = 253 ms.

    -- test2 - method 2 - union all
    Table 'test2'. Scan count 12, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 8565, lob physical reads 0, lob read-ahead reads 0.
    Table 'test2'. Segment reads 12, segment skipped 441.

    SQL Server Execution Times: CPU time = 31 ms, elapsed time = 299 ms.

    -- test2 - method 3 - or condition
    Table 'test2'. Scan count 4, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 106995, lob physical reads 0, lob read-ahead reads 367381.
    Table 'test2'. Segment reads 151, segment skipped 0.

    SQL Server Execution Times: CPU time = 1234 ms, elapsed time = 579 ms.

    -- rowstore test1
    Table 'test1_rowstore'. Scan count 1, logical reads 64, physical reads 3, read-ahead reads 60, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times: CPU time = 32 ms, elapsed time = 191 ms.

    -- rowstore test2 - method 1 - dateadd
    Table 'test2_rowstore'. Scan count 1, logical reads 240, physical reads 3, read-ahead reads 236, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times: CPU time = 47 ms, elapsed time = 269 ms.

     

  • Sergiy wrote:

    ZZartin wrote:

    /shrug I think that the fact that dictionaries are maintained transparently in the internal guts of column store indexes vs having to maintain and join a lookup table by hand is significant point in favor of column stores being more effective.

    Is doing less typing your only concern when you choose a database design approach?

     

    Not needlessly recreating the wheel is a concern.  Especially when the proposed alternative is clunkier and harder to deal with than just letting a column store work as intended.

  • Grant Fritchey wrote:

    Sergiy wrote:

    Frederico, do you have any test case to support your speculations?

    I (and I guess Jeff Moden too) would be particularly interested in a test script which demonstrates advantages of splitting datetime into date and time.

    The best place to learn about ColumnStore is from Niko Neugebauer. Niko has examples that show how the compression works.

    Lordy.  I've obviously not read all of that (129 items currently in his top level TOC) but the man does appear to have covered everything that one could ask.  Thanks for the link, Grant.  Bookmarked for sure!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sergiy wrote:

    I (and I guess Jeff Moden too) would be particularly interested in a test script which demonstrates advantages of splitting datetime into date and time.

    That's very true.  And, sorry to say, I've never done a Yabingooducklehoo to see if such a thing exists.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 31 through 45 (of 46 total)

You must be logged in to reply to this topic. Login to reply