Stairway to ColumnStore Indexes Level 7: Optimizing Nonclustered Columnstore Indexes

  • Comments posted to this topic are about the item Stairway to ColumnStore Indexes Level 7: Optimizing Nonclustered Columnstore Indexes


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Historically our environment has been fairly loosy-goosy with creating indexes. It looks like it would benefit us greatly to put a lot more thought into indexes in general and columnstore specifically. Thanks for the article.

  • This is a most excellent series on ColumnStore indexes!
    One question that I have is in regards to pre-order optimizations. Most examples, including in this article, create a btree clustered index with a single column. My existing clustered index includes three columns. They were selected based upon common queries against the table. I am assuming that the clustered index for pre-order purposes can also (and probably should) have multiple columns. Yes?

     - Chris

  • Chris Harrington - Tuesday, October 31, 2017 1:05 PM

    This is a most excellent series on ColumnStore indexes!
    One question that I have is in regards to pre-order optimizations. Most examples, including in this article, create a btree clustered index with a single column. My existing clustered index includes three columns. They were selected based upon common queries against the table. I am assuming that the clustered index for pre-order purposes can also (and probably should) have multiple columns. Yes?

     - Chris

    Yes, it can.
    But you should be aware that the  point of creating a clustered index before building the columnstore index is to maximize the benefit of rowgroup elimination. Create an index on lastname, then the first one million rows read (which will become the first rowgroup) will have only last names starting with A or B, the second will have the rest of the B's and the C's, etc.
    Create an index on (lastname, firstname), and the first rowgroup will still be last names A-B, the second B-C, etc. And the first names will be all over the place. (Everyone with last name Alvarez will be in the first rowgroup, and those everyone can have every possible first name, from Ann to Zach. And the same goes for all people with lastname Zlatov, who are all in the last rowgroup)

    In other words:
    * If you are creating a new clustered index for the sole purpose of optimizing rowgroup elimination, do not bother with a second column. It won't help (and it will slow down the index creation process).
    * If you have an existing multi-column clustered index and the leading column is the column you want to optimize for rowgroup eliminiation, you are safe to use the existing index and there is no need to rebuild with a single-column clustered index.
    * If you are looking for a way to optimize rowgroup elimination for multiple columns, then you are out of luck. Unless the columns are correlated (as described in the article), this is not possible,


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Thanks Hugo. That totally makes sense now that I think about ut.

  • Hugo Kornelis - Tuesday, October 31, 2017 8:50 PM

    Chris Harrington - Tuesday, October 31, 2017 1:05 PM

    This is a most excellent series on ColumnStore indexes!
    One question that I have is in regards to pre-order optimizations. Most examples, including in this article, create a btree clustered index with a single column. My existing clustered index includes three columns. They were selected based upon common queries against the table. I am assuming that the clustered index for pre-order purposes can also (and probably should) have multiple columns. Yes?

     - Chris

    Yes, it can.
    But you should be aware that the  point of creating a clustered index before building the columnstore index is to maximize the benefit of rowgroup elimination. Create an index on lastname, then the first one million rows read (which will become the first rowgroup) will have only last names starting with A or B, the second will have the rest of the B's and the C's, etc.
    Create an index on (lastname, firstname), and the first rowgroup will still be last names A-B, the second B-C, etc. And the first names will be all over the place. (Everyone with last name Alvarez will be in the first rowgroup, and those everyone can have every possible first name, from Ann to Zach. And the same goes for all people with lastname Zlatov, who are all in the last rowgroup)

    In other words:
    * If you are creating a new clustered index for the sole purpose of optimizing rowgroup elimination, do not bother with a second column. It won't help (and it will slow down the index creation process).
    * If you have an existing multi-column clustered index and the leading column is the column you want to optimize for rowgroup eliminiation, you are safe to use the existing index and there is no need to rebuild with a single-column clustered index.
    * If you are looking for a way to optimize rowgroup elimination for multiple columns, then you are out of luck. Unless the columns are correlated (as described in the article), this is not possible,

    What about this case.  Let's assume that the index has (bit, date). Say there are 100m records. The first 50m have bit false. The next 50m have bit true. Those first 50m records are then sorted by a date. Unlike your (lastname, firstname) scenario, the second index column won't be "all over the place" - those date column segments will be ordered by date. Wouldn't this still give the benefits of rowgroup elimination?

  • Chris Harrington - Wednesday, November 1, 2017 10:13 AM

    Hugo Kornelis - Tuesday, October 31, 2017 8:50 PM

    Chris Harrington - Tuesday, October 31, 2017 1:05 PM

    This is a most excellent series on ColumnStore indexes!
    One question that I have is in regards to pre-order optimizations. Most examples, including in this article, create a btree clustered index with a single column. My existing clustered index includes three columns. They were selected based upon common queries against the table. I am assuming that the clustered index for pre-order purposes can also (and probably should) have multiple columns. Yes?

     - Chris

    Yes, it can.
    But you should be aware that the  point of creating a clustered index before building the columnstore index is to maximize the benefit of rowgroup elimination. Create an index on lastname, then the first one million rows read (which will become the first rowgroup) will have only last names starting with A or B, the second will have the rest of the B's and the C's, etc.
    Create an index on (lastname, firstname), and the first rowgroup will still be last names A-B, the second B-C, etc. And the first names will be all over the place. (Everyone with last name Alvarez will be in the first rowgroup, and those everyone can have every possible first name, from Ann to Zach. And the same goes for all people with lastname Zlatov, who are all in the last rowgroup)

    In other words:
    * If you are creating a new clustered index for the sole purpose of optimizing rowgroup elimination, do not bother with a second column. It won't help (and it will slow down the index creation process).
    * If you have an existing multi-column clustered index and the leading column is the column you want to optimize for rowgroup eliminiation, you are safe to use the existing index and there is no need to rebuild with a single-column clustered index.
    * If you are looking for a way to optimize rowgroup elimination for multiple columns, then you are out of luck. Unless the columns are correlated (as described in the article), this is not possible,

    What about this case.  Let's assume that the index has (bit, date). Say there are 100m records. The first 50m have bit false. The next 50m have bit true. Those first 50m records are then sorted by a date. Unlike your (lastname, firstname) scenario, the second index column won't be "all over the place" - those date column segments will be ordered by date. Wouldn't this still give the benefits of rowgroup elimination?

    Short answer: Yes.
    Longer answer: In a lot of cases it would PROBABLY make more sense to have the clustered index on (date, bit).


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo:

    I'm running your demo scripts on a SQL 2016 server.  I noticed that when I rebuilt the column store index in Step 7 without using MAXDOP that I got the results shown in listing 7.3 when I expected to see the results in 7.2.
    Does SQL 2016 have some improvements over 2014 in ordering the column store index to the clustered index when the index is built with parallelism enabled?   

    Thank you for the series, it's been very useful for me.
    Andre Ranieri - @sqlinseattle

  • Sql_In_Seattle - Tuesday, April 3, 2018 7:19 PM

    Hugo:

    I'm running your demo scripts on a SQL 2016 server.  I noticed that when I rebuilt the column store index in Step 7 without using MAXDOP that I got the results shown in listing 7.3 when I expected to see the results in 7.2.
    Does SQL 2016 have some improvements over 2014 in ordering the column store index to the clustered index when the index is built with parallelism enabled?   

    Thank you for the series, it's been very useful for me.
    Andre Ranieri - @sqlinseattle

    Hi Andre,

    Thanks for the feedback! I have tried to reproduce this on my SQL Server 2016 instance and I did not get the results you describe. When I leave out the MAXDOP hint, the data in the CS rowgroups is more similar to figure 7.2 (not exactly the same, but a similar partly-orderred-partly-random distribution of MinProductKey/MaxProductKey) and I do need to add the MAXDOP hint to get the maximum rowgroup elimination potential as shown in figure 7.3.

    Can you please do the following for me::
    1. Rerun listing 7-3, *WITH* the MAXDOP hint, and with the "Include Actual Execution Plan" option enabled. When it's done, switch to the execution plan tab, right-click, choose "Save Execution Plan As" to sasve it as a .sqlplan file.
    2. Rerun listing 7-1 and save the results (as screenshot or as data, either way is fine)
    3. Repeat steps 1 and 2, but now WITHOUT the MAXDOP hint.
    4. Post all the files you created (two execution plans and two sets of query results) here.
    I will then look at those files and PERHAPS that will allow me to find an explanation. No promises.

    Oh, and to answer your actual question - you probably already figured it out from the above 😉 - as far as I am aware there have been no improvements in how the column store is ordered when the index is built. There might have been changes to the ordering of the rows within each rowgroup to get even better compression benefits, but the distribution of data among rowgroups is as far as I know still basically the same.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo:

    I'm sorry for the delay in answering you; it's been an insane week.

    The requested files are uploaded here.   My first observation from looking at the execution plan is that it includes an OPTION (MAXDOP 1) query hint even when the hint is commented out in the query window.    The box I'm using to test this is a smaller dev box, 4 vCPU  with server settings MAXDOP 0, cost threshold for parallelism was dropped down to 5 for this experiment.

    https://www.dropbox.com/s/8td8lh92jy5y5jw/ColumnStore%20Stairway%20Step%207.zip?dl=0

    Thanks again,

    Andre

  • Hi Andre!

    My weekend was at least as insane as your week was. That's why you had to wait for a bit.

    Thanks for providing the additional files!
    Looking at the execution plans, I see that when you run without MAXDOP hint you still have a serial plan. I also saw, in the plan properties, that the "EstimatedAvailableDegreeOfParallelism" property is 1. This probably means that you are running this on a single-core server, or that the query is executing in a resource group that is limited to one core per query, or that the server MAXDOP setting is set to 1.
    Either way, whatever the reason - the columnstore rebuild query is always running in serial for you, with or without the hint, and that's why the results are always the same.

    I hope this helps!


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply