Clustered Columnstore Indexing Tips and ETL Load Performance

  • Hi,

    I've got a table with 110M+ rows.  It is an SCD2 table with historic data.  Most of the queries return the bulk of the table, say 50% or more of the total rows, eg. where current_record='Y'.  It is only updated every 6 weeks, and those updates can add 10M or so new rows.

    It was under-indexed and suffering from poor performance, so I've added non-clustered rowstore indexes to support the most common queries (IX_* in the picture below).

    However, I also note https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-design-guidance?view=sql-server-ver15:

    Clustered columnstore index

    Use for:

    1) Traditional data warehouse workload with a star or snowflake schema

    2) Internet of Things (IOT) workloads that insert large volumes of data with minimal updates and deletes.

    This is very much the use case of this table.

    So, I've converted it to clustered columnstore index and am pleased with the results.

    Here is my current structure:

    2020-04-11_17-31-53

     

    The PK is a unique non-clustered constraint, not a primary key.  The FK can't be in place when I created the Clustered Columnstore (from the UI), but can be added after the fact (not sure why???).

    Questions:

    1. Is it now over-indexed?  All of the rowstore indexes would be used in typical queries, if there wasn't a Clustered Columnstore index, but are they necessary given the Clustered Columnstore index?
    2. In the past I "played" with Clustered Columnstore indexes, and suffered atrocious performance during updates (eg. inserting 10M rows then closing out the old expired rows).  Should I disable ALL indexes during updates, then rebuild after the update?  Or just disable the non-clustered indexes (but see #1)?  Given the infrequent updates to this table, I'm happy for the performance hit of rebuilding all indexes if it results in faster queries for the next 6 weeks (rebuild all takes about 20 mins).
    3. I also note this thread:  https://www.sqlservercentral.com/forums/topic/disabling-indexes-for-load-performance.  See Jeff Moden's comment.  And yeah, I tried:
    ALTER INDEX CIX_RLDX_ARCHIVE_ColumnStore ON new.RLDX_ARCHIVE DISABLE
    GO

    SELECT TOP 1000 * FROM new.RLDX_ARCHIVE

    And got this:

    Warning: Foreign key 'FK_RLDX_ARCHIVE_RLDX_AUDIT_TRAIL_rldx_audit_key' on table 'RLDX_ARCHIVE' referencing table 'RLDX_AUDIT_TRAIL' was disabled as a result of disabling the index 'CIX_RLDX_ARCHIVE_ColumnStore'.
    Warning: Index 'PK_RLDX_ARCHIVE_rldx_archive_key' on table 'RLDX_ARCHIVE' was disabled as a result of disabling the clustered index on the table.
    Warning: Index 'IX_RLDX_ARCHIVE_NaturalKeys' on table 'RLDX_ARCHIVE' was disabled as a result of disabling the clustered index on the table.
    Warning: Index 'IX_RLDX_ARCHIVE_hospital_type' on table 'RLDX_ARCHIVE' was disabled as a result of disabling the clustered index on the table.
    Warning: Index 'IX_RLDX_ARCHIVE_recnum' on table 'RLDX_ARCHIVE' was disabled as a result of disabling the clustered index on the table.
    Warning: Index 'IX_RLDX_ARCHIVE_rldx_audit_key' on table 'RLDX_ARCHIVE' was disabled as a result of disabling the clustered index on the table.
    Warning: Index 'IX_RLDX_ARCHIVE_rldx_current_record' on table 'RLDX_ARCHIVE' was disabled as a result of disabling the clustered index on the table.
    Msg 8655, Level 16, State 1, Line 4
    The query processor is unable to produce a plan because the index 'CIX_RLDX_ARCHIVE_ColumnStore' on table or view 'RLDX_ARCHIVE' is disabled.

     

    So disabling the Clustered Columnstore index isn't the correct approach.

    Wrapping up, if I want to use a Clustered Columnstore index (and I do, the performance gains are amazing for the queries I run), then what are:

    1) best practices for other indexes (of any type), and

    2) any "gotchas" in large ETL insert/updates?

    Thanks,

    Scott

     

     

    1. Is it now over-indexed?  All of the rowstore indexes would be used in typical queries, if there wasn't a Clustered Columnstore index, but are they necessary given the Clustered Columnstore index?

    I think rowstore indexes would not be required with Clustered Columnstore Index. However, I've not tried this yet. This needs to be authenticated.

    2. In the past I "played" with Clustered Columnstore indexes, and suffered atrocious performance during updates (eg. inserting 10M rows then closing out the old expired rows).  Should I disable ALL indexes during updates, then rebuild after the update?  Or just disable the non-clustered indexes (but see #1)?  Given the infrequent updates to this table, I'm happy for the performance hit of rebuilding all indexes if it results in faster queries for the next 6 weeks (rebuild all takes about 20 mins).

    3. I also note this thread:  https://www.sqlservercentral.com/forums/topic/disabling-indexes-for-load-performance.  See Jeff Moden's comment.  And yeah, I tried:And got this:So disabling the Clustered Columnstore index isn't the correct approach.

    Columnstore Indexes out-performs as compared to Rowstore Indexes because :

    1. It allows better compression. Data to be physically stored column wise as compared to its ancestor Rowstore Indexes, that stores data row wise.
    2. Since data is highly compressed, more data can be accommodated in memory. More data in memory means less Disk IO and associated CPU utilization.

    Compression is an overhead during WRITE operation whereas helps in READ operation.

    If huge data is being deleted or updated, then you can try DROPPING the clustered columnstore index before DELETE/UPDATE, and CREATING the clustered columnstore index after  DELETE/UPDATE.

    However, you need to compare the time taken to REBUILD vs DROP/CREATE the clustered columnstore index.

     

    Wrapping up, if I want to use a Clustered Columnstore index (and I do, the performance gains are amazing for the queries I run), then what are:

    1) best practices for other indexes (of any type), and

    2) any "gotchas" in large ETL insert/updates?

    You can give a try to the Get started with Columnstore for real-time operational analytics article available on Microsoft official documentation.

    I've also recently wrote an article Probing Columnstore Indexes, in relation to use of columnstore index in OLTP workload. You can have a look into it. I'll hope you'll find it purposeful!

  • So, before I comment on anything, remember, testing is your best friend in these situations. Anyone can have opinions on behavior. However, you should test those opinions and assumptions against your system in order to validate them.

    1. I don't know. It doesn't look like a lot of indexes, but it really depends on your table, and the queries being run against it. In general, you want only as many indexes as you need. However, if you need an index, you need that index. For me, it's all about are you getting the performance you need from your queries and are the indexes being used. Check your execution plans and the index usage stats to validate they're in use.
    2. Testing is your buddy. Mostly, I leave indexes in place during updates because the cost of rebuilding indexes frequently is just as high as any costs incurred during updates. However, it's all down to testing. Different situations have different results. There isn't a single right answer in all situations. Probably, leaving the indexes in place is better depending on the data load and how you're doing it. Test, test, test.

      HOWEVER, we're talking about a clustered index, so, see the next point.

    3. A clustered index, columnstore or not, is the table. It's where the data lives. You can't disable a clustered index because it's like disabling the table. It's just not something you do. This is also why, maybe disabling the non-clustered indexes is a good idea (test, see #2), but probably dropping and recreating the clustered columnstore is a bad idea. Instead, go and look up Niko Negabauer's blog and read what he has to say about how to load columnstore indexes. There are some changes you should make to your load process, absolutely.

    With modern systems we get to pick & choose behaviors. So, if the predominant access of the data is analytical, warehouse, reporting, style queries, a clustered columnstore makes sense. Then, as needed, you can add nonclustered indexes for point lookups. If, on the other hand, the predominant access of the data is more OLTP, small batch insert/update/delete, lots of point lookups, then you use a clustered rowstore index. And, as before, add nonclustered indexes for point lookups. And, as needed, add a nonclustered columnstore for analytics. You can combine them, mix and match. Just understand your data access patterns so that you can choose the right indexes.

    In terms of "over indexing", I wouldn't sweat that unless you're not testing and monitoring. Then, yeah, worry about that until you get testing and monitoring set up. For detailed query analysis, Extended Events are your best buddy. For general analysis, you can rely on the DMVs, or enable Query Store. In addition to these, I recommend getting a third party tool for monitoring, not because you couldn't do it yourself, but because you have better places to spend your time than building a monitoring tool.

    Hope all that helps. For a lot more details on query tuning, I'd suggest getting copies of my two books, below are the links.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Brahmanand Shukla wrote:

    I think rowstore indexes would not be required with Clustered Columnstore Index. However, I've not tried this yet. This needs to be authenticated.

    I'd be really cautious here. You can combine clustered columnstore and nonclustered indexes. The idea is simple. Columnstore indexes are excellent at analytical style queries, but they stink at point lookup queries (single row or small batches). So, when you need the point lookups, you can add the nonclustered indexes.

    If huge data is being deleted or updated, then you can try DROPPING the clustered columnstore index before DELETE/UPDATE, and CREATING the clustered columnstore index after  DELETE/UPDATE.

    However, you need to compare the time taken to REBUILD vs DROP/CREATE the clustered columnstore index.

    Again, a note of caution. Since a clustered columnstore defines data storage, to a degree, dropping and recreating it means we're literally moving the data twice, once into a heap and then into the clustered columnstore. Chances are very high that's a much more costly operation. Considerations around the nonclustered indexes just goes back to testing.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hey Grant,

    Thanks for taking the time to write your detailed reply.

    The single biggest thing I got from your post is to test, test, test.  I do know I'm getting MUCH better performance from my typical queries and joins than without the clustered columnstore index.  I threw a number of the typical queries against this table and the query execution plan almost always chose the CCI, even when there was a rowstore index that could have also been used.  I'm guessing this was due to the quantity of rows being returned (50% or more).  The only time the rowstore index was used was when I queried a specific natural key:  hospital='AAA' and stay_number='123' and episode_sequence_number=1.  This is actually not a typical query, but might be used to drilldown and debug coding errors.

    I did test the ETL *for this particular table*, which added about 2M rows to the 110M row table.  The performance when leaving all indexes in place was acceptable.  But I'll keep in mind your general advice should I have ETL (insert/update) performance issues in the future.

    BTW, I *do* have your book:  SQL Server Query Performance Tuning, Fourth Edition from Apress.  I've only read about 1/4, then got OBE ("overcome by life events").  Seems I should revisit it.  I've also read Nico's blog, and even some of his YouTube videos.  Some of it went over my head at the time; again, perhaps time to re-read his blog.

  • It makes sense that the columnstore index is going to get used for queries that are moving large chunks of data. That's pretty much why they exist. If testing shows it's helping, then it's helping. That's the only measure that counts. Not my opinion or anyone else's, the test.

    Happy to help a little. Good luck going forward.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • 1 The nonclus indexes should be unique, not non-unique.  Worst case, add an identity to the table and use that to make unique non-clus keys.

    2 Are the nonclus indexes page compressed?  If not, look into that immediately.  Because they can be compressed, I wouldn't worry about being over-indexed, at least not for now.  Get it performing well, later you can see if you can remove / combine an index(es).

    3 Make sure the parallel threshold setting is above the minimum setting.  The best number depends on your specific system, but if it's still at a super-low number like 5, that's not good and needs corrected.

    4 Updating a columnstore clustered index can cause performance pangs.  You especially don't want to increase the size of a lot of different columns (that can be an even bigger issue with rowstore tables).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply