clustered index usage

  • Looking over a few data warehouse procs internally that we created by contracting firm. Noticed that several of the large tables do not have clustered index. Only a non clustered primary key. The only reason i would not use clustered index is if the table needs to be dropped on occasion and recreated and the clustered index rebuild. This could take forever. Is there any other reason when someone would not use a clustered index and only use non clustered primary key? the table in question has 10 million records

     

  • I would argue that every table should have a clustered index unless serious testing suggests otherwise. SQL Server is optimized around storing and retreiving data from clustered indexes. Heaps just don't work as well under most circumstances. I suspect this is just a legacy from ignorance in how things work.

    Additional thought, if you are looking at a data warehouse, consider if columnstore will be better (and, yeah, clustered columnstore).

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I have a different opinion here as far as the OLTP is concerned. Having Clustered Index (but not utilizing it) will not benefit, instead will be overhead.

    If you do not do the following then not having the clustered index would be better, but otherwise Clustered Index is a must.

    1. UPDATE
    2. DELETE specific records
    3. SORTING on Clustered Index Key column in the sort order as specified in the index.
    4. Frequent SELECT statements with all/most columns with WHERE clause having Clustered Index Key columns.

    But as far as OLAP/Data warehouse is concerned, I totally agree with Grant. Clustered Columnstore Indexes shall definitely give a boost to the analytical queries. I did some research around it and posted an article Probing Columnstore Indexes.

  • For doing single row lookups, a nonclustered index (NI) on a heap is no worse performance than a clustered index (CI).  However, as Grant noted, SQL does truly prefer CIs.

    If the key values in the index are inserted sequentially -- for example, say the current NI key is based on datetime of when the row was INSERTed -- then you should use a CI rather than the NI.

    But, if the key values are inserted randomly -- the classic example is a guid -- then you should continue using the NI.

    If no other CI key value is indicated and the table has an identity column, then you can cluster on the identity column.  But never just default to clustering on identity for (nearly) all tables.  The CI is the single most important performance factor for most tables, therefore it should be selected carefully and with clear reasons for using that column(s) as the key, never just by default.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Yes, agree that it would benefit in the read loads. But we should also consider the overhead of CI on write operations. So if the database is read-intensive then CI would definitely help in the SELECT queries (for quickly locating the rows). But if the database is write-intensive or a balanced load then we should always consider it wisely.

  • Snargables wrote:

    i would not use clustered index is if the table needs to be dropped on occasion and recreated and the clustered index rebuild. This could take forever.

    That's a part of the very reason that some people don't use a clustered index.  Yep... it "could" take forever... if you don't know what you're doing. And unless your server is ready for the junk heap , 10 million rows is almost nothing to today's servers.  A rebuild of such an index would usually take something less than 90 seconds and can actually be a whole lot faster than that.  "It Depends".

    If a table needs to be "dropped on occasion", there are methods to Truncate and Insert into a properly formed index using minimal logging that makes it all nasty fast.  You can also use the "Swap'n'Drop" method of building the new table while the old table is still working and then flip a synonym to point at the new table when it's ready.  It takes only milliseconds of "down time" for the users and they won't even see that because it'll wait until no one is using the original table to do it.

    There's really not much justification for not having a Clustered Index and, because of things like a deduplicator in the index if it's not unique, it's usually best to have a Unique Clustered Index.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Brahmanand Shukla wrote:

    Yes, agree that it would benefit in the read loads. But we should also consider the overhead of CI on write operations. So if the database is read-intensive then CI would definitely help in the SELECT queries (for quickly locating the rows). But if the database is write-intensive or a balanced load then we should always consider it wisely.

    If it's IS write intensive and you don't have the right kind of clustered index, then you end up with a hot spot at the logical end of the index that can "eat your face off" for performance of your inserts.  If you also have the nasty habit of doing inserts and follow those with "ExpAnsive" updates, then you have a triple whammy of slow inserts, very slow updates with massive page splits which further slow down your inserts, and the massive fragmentation and excessive log file usage that page splits cause.

    You really can't afford to not have a clustered index.

    So what is the fix for preventing the massive page splits of a "close" "ExpAnsive" update?  Obviously, the BEST method is to make it so the updates aren't expansive and there are a whole lot of different methods to do that.  Of course, that won't help the hot spot away... it will just make it less obnoxious.

    How can you make the hot spot totally disappear and have no fragmentation for months at a time?

    Believe it or not, the use of a Random GUID clustered index can fix all of that and, yeah... you heard me  correctly... using a Random GUID index properly will make fragmentation go away for months at a time even when inserting something like 100,000 rows per day.  Random GUID indexes are actually the very epitome of what people think an index should operate as.

    Think not?  Ok... here's the proof.  We've all literally been stupid for more than 2 decades.  We all have Purple Kool-Aid stains on our lips because we've been drinking it so long.  See the following 'tube.  And that's just scratching the surface as to how wrong we've been and continue to do things.

    https://www.youtube.com/watch?v=rvZwMNJxqVo

    WARNING IF YOU WEARING A HEADSET!!!!

    There are ads they've added to the presentation at time mark 15:00, 30:15, and 45:25!  They are sudden and loud.  They WILL cause you to rip your headset off!  Be prepared!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Brahmanand Shukla wrote:

    Yes, agree that it would benefit in the read loads. But we should also consider the overhead of CI on write operations. So if the database is read-intensive then CI would definitely help in the SELECT queries (for quickly locating the rows). But if the database is write-intensive or a balanced load then we should always consider it wisely.

    The trouble here is we cannot predict the future. We dont always know when the amount of Select traffic will increase, and the need to delete after deciding later we need to be archiving older records. Having a CI is a good insurance policy all around. And as Jeff mentioned, the GUID variety eliminates the hotspots. That is, not doing all the writes in one area of the index (at the end) and distributing them instead at random points.

    ----------------------------------------------------

  • MMartin1 wrote:

    Brahmanand Shukla wrote:

    Yes, agree that it would benefit in the read loads. But we should also consider the overhead of CI on write operations. So if the database is read-intensive then CI would definitely help in the SELECT queries (for quickly locating the rows). But if the database is write-intensive or a balanced load then we should always consider it wisely.

    The trouble here is we cannot predict the future. We dont always know when the amount of Select traffic will increase, and the need to delete after deciding later we need to be archiving older records. Having a CI is a good insurance policy all around. And as Jeff mentioned, the GUID variety eliminates the hotspots. That is, not doing all the writes in one area of the index (at the end) and distributing them instead at random points.

    But is that necessarily a good thing for INSERTs?  You have to rewrite parts of many pages rather than just writing a single page.  That will generate lots of extra physical I/O.  I don't think this is as cut and dried as Jeff made it seem.  I think there are more complex considerations here.  Especially since this approach would not work well for child tables, which the plurality of tables are.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher wrote:

    MMartin1 wrote:

    Brahmanand Shukla wrote:

    Yes, agree that it would benefit in the read loads. But we should also consider the overhead of CI on write operations. So if the database is read-intensive then CI would definitely help in the SELECT queries (for quickly locating the rows). But if the database is write-intensive or a balanced load then we should always consider it wisely.

    The trouble here is we cannot predict the future. We dont always know when the amount of Select traffic will increase, and the need to delete after deciding later we need to be archiving older records. Having a CI is a good insurance policy all around. And as Jeff mentioned, the GUID variety eliminates the hotspots. That is, not doing all the writes in one area of the index (at the end) and distributing them instead at random points.

    But is that necessarily a good thing for INSERTs?  You have to rewrite parts of many pages rather than just writing a single page.  That will generate lots of extra physical I/O.  I don't think this is as cut and dried as Jeff made it seem.  I think there are more complex considerations here.  Especially since this approach would not work well for child tables, which the plurality of tables are.

    BWAAA-HAA-HAAA!!!!  Lordy.. where in anything I said either on this thread or in the 'tube I pointed to did you get the idea that I made it a "cut and dry" thing?  There's a shedload of analysis to be done including trying to figure out if you have a hot spot issue with inserts and then if you have a hot spot issue with ExpAnsive Updates and then whether fragmentation can be prevented during the hot spot updates using any one or several of the different methods for preventing the fragmentation.

    My overarching suggestion does have to do with whether or not to have a Clustered Index rather than just a HEAP and NCI's.  If you have a HEAP and you do ExpAnsive Updates , you end up with row forwarding and all those rows are forwarded to the same hot spot as inserts, which hurts the insert rate and concurrency as well as a shedload of logging.  If you have a Clustered Index with that suffers the same kind of updates, then you end up with page splits and a shedload of logging, whether it's an ever-increasing index or not but especially if it is and you're in the habit of doing such updates on the hot spot, which is always at nearly 100% page fullness unless you've done an index REBUILD (REORGANIZE won't fix this) between the time you inserted and the time you did the update AND you have a lower Fill Factor.

    And THAT can only be solved by making the hot spot updates go away through (like I said) one or several methods and one of those methods is to use a Random GUID Clustered Index with the appropriate Fill Factor.

    And, like I said and repeated above and repeat now, that means the you need to study the issue and take all the gazintas and gotcha's into consideration.  In other words you "Must Look Eye" and "It Depends".  Nothing "cut and dry" about that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ScottPletcher wrote:

    MMartin1 wrote:

    Brahmanand Shukla wrote:

    Yes, agree that it would benefit in the read loads. But we should also consider the overhead of CI on write operations. So if the database is read-intensive then CI would definitely help in the SELECT queries (for quickly locating the rows). But if the database is write-intensive or a balanced load then we should always consider it wisely.

    The trouble here is we cannot predict the future. We dont always know when the amount of Select traffic will increase, and the need to delete after deciding later we need to be archiving older records. Having a CI is a good insurance policy all around. And as Jeff mentioned, the GUID variety eliminates the hotspots. That is, not doing all the writes in one area of the index (at the end) and distributing them instead at random points.

    But is that necessarily a good thing for INSERTs?  You have to rewrite parts of many pages rather than just writing a single page.  That will generate lots of extra physical I/O.  I don't think this is as cut and dried as Jeff made it seem.  I think there are more complex considerations here.  Especially since this approach would not work well for child tables, which the plurality of tables are.

    Hi Scott,

    How does this 'rewrite' parts of many pages? Are you saying the likelyhood will be that inserts to the various pages will result in page splits, more likely than not?

    ----------------------------------------------------

  • MMartin1 wrote:

    ScottPletcher wrote:

    MMartin1 wrote:

    Brahmanand Shukla wrote:

    Yes, agree that it would benefit in the read loads. But we should also consider the overhead of CI on write operations. So if the database is read-intensive then CI would definitely help in the SELECT queries (for quickly locating the rows). But if the database is write-intensive or a balanced load then we should always consider it wisely.

    The trouble here is we cannot predict the future. We dont always know when the amount of Select traffic will increase, and the need to delete after deciding later we need to be archiving older records. Having a CI is a good insurance policy all around. And as Jeff mentioned, the GUID variety eliminates the hotspots. That is, not doing all the writes in one area of the index (at the end) and distributing them instead at random points.

    But is that necessarily a good thing for INSERTs?  You have to rewrite parts of many pages rather than just writing a single page.  That will generate lots of extra physical I/O.  I don't think this is as cut and dried as Jeff made it seem.  I think there are more complex considerations here.  Especially since this approach would not work well for child tables, which the plurality of tables are.

    Hi Scott,

    How does this 'rewrite' parts of many pages? Are you saying the likelyhood will be that inserts to the various pages will result in page splits, more likely than not?

    As new rows are INSERTed into many different pages across the data set, each page has to be "dirtied" to write the new row.  Then, of course, each of those dirty pages has to be flushed to disk.  If we say each page can hold 12 rows, then 12 INSERTs require 12 page rewrites rather than 1 page write.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Jeff Moden wrote:

    ScottPletcher wrote:

    MMartin1 wrote:

    Brahmanand Shukla wrote:

    Yes, agree that it would benefit in the read loads. But we should also consider the overhead of CI on write operations. So if the database is read-intensive then CI would definitely help in the SELECT queries (for quickly locating the rows). But if the database is write-intensive or a balanced load then we should always consider it wisely.

    The trouble here is we cannot predict the future. We dont always know when the amount of Select traffic will increase, and the need to delete after deciding later we need to be archiving older records. Having a CI is a good insurance policy all around. And as Jeff mentioned, the GUID variety eliminates the hotspots. That is, not doing all the writes in one area of the index (at the end) and distributing them instead at random points.

    But is that necessarily a good thing for INSERTs?  You have to rewrite parts of many pages rather than just writing a single page.  That will generate lots of extra physical I/O.  I don't think this is as cut and dried as Jeff made it seem.  I think there are more complex considerations here.  Especially since this approach would not work well for child tables, which the plurality of tables are.

    BWAAA-HAA-HAAA!!!!  Lordy.. where in anything I said either on this thread or in the 'tube I pointed to did you get the idea that I made it a "cut and dry" thing?  There's a shedload of analysis to be done including trying to figure out if you have a hot spot issue with inserts and then if you have a hot spot issue with ExpAnsive Updates and then whether fragmentation can be prevented during the hot spot updates using any one or several of the different methods for preventing the fragmentation.

    My overarching suggestion does have to do with whether or not to have a Clustered Index rather than just a HEAP and NCI's.  If you have a HEAP and you do ExpAnsive Updates , you end up with row forwarding and all those rows are forwarded to the same hot spot as inserts, which hurts the insert rate and concurrency as well as a shedload of logging.  If you have a Clustered Index with that suffers the same kind of updates, then you end up with page splits and a shedload of logging, whether it's an ever-increasing index or not but especially if it is and you're in the habit of doing such updates on the hot spot, which is always at nearly 100% page fullness unless you've done an index REBUILD (

    REORGANIZE won't fix this) between the time you inserted and the time you did the update AND you have a lower Fill Factor.

    And THAT can only be solved by making the hot spot updates go away through (like I said) one or several methods and one of those methods is to use a Random GUID Clustered Index with the appropriate Fill Factor.

    And, like I said and repeated above and repeat now, that means the you need to study the issue and take all the gazintas and gotcha's into consideration.  In other words you "Must Look Eye" and "It Depends".  Nothing "cut and dry" about that.

    This sounds very definitive to me:

    "So what is the fix for preventing the massive page splits of a "close" "ExpAnsive" update?  ...

    How can you make the hot spot totally disappear and have no fragmentation for months at a time?

    Believe it or not, the use of a Random GUID clustered index can fix all of that and, yeah... you heard me correctly... using a Random GUID index properly will make fragmentation go away for months at a time even when inserting something like 100,000 rows per day.  Random GUID indexes are actually the very epitome of what people think an index should operate as.

    Think not?  Ok... here's the proof.  We've all literally been stupid for more than 2 decades.  We all have Purple Kool-Aid stains on our lips because we've been drinking it so long.  See the following 'tube.  And that's just scratching the surface as to how wrong we've been and continue to do things."

    Again, this won't work well for child tables where you want to keep all child rows for a parent together (given that the parent key is also a guid, based on the ideas above).  Which is the large plurality of tables.

    The best way to fix this is, first and foremost, to do a logical table design before doing a physical one.  But, sadly, almost no one does that.  Another important thing is, yes, to avoid expansive UPDATEs.  The first would help hugely with the second, especially if (longer) strings were to be properly encoded.

    Absent encoding, data compression is the next-best alternative.  But you don't seem to use that -- seem to refuse to use that -- for some reason, I'm not sure why.  It can lower the overhead of expansive UPDATEs, for one thing, although not as much as normalization (based on a true logical design) would do.

    • This reply was modified 2 years, 4 months ago by  ScottPletcher.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Brahmanand Shukla wrote:

    I have a different opinion here as far as the OLTP is concerned. Having Clustered Index (but not utilizing it) will not benefit, instead will be overhead.

    If you do not do the following then not having the clustered index would be better, but otherwise Clustered Index is a must.

    1. UPDATE
    2. DELETE specific records
    3. SORTING on Clustered Index Key column in the sort order as specified in the index.
    4. Frequent SELECT statements with all/most columns with WHERE clause having Clustered Index Key columns.

    But as far as OLAP/Data warehouse is concerned, I totally agree with Grant. Clustered Columnstore Indexes shall definitely give a boost to the analytical queries. I did some research around it and posted an article Probing Columnstore Indexes.

    This was my initial response.

    Brahmanand Shukla wrote:

    Yes, agree that it would benefit in the read loads. But we should also consider the overhead of CI on write operations. So if the database is read-intensive then CI would definitely help in the SELECT queries (for quickly locating the rows). But if the database is write-intensive or a balanced load then we should always consider it wisely.

    Then this one followed next.

    Quoting my initial response

    If the following operations are NOT performed on a database then NOT HAVING the clustered index would be better, but OTHERWISE Clustered Index is a MUST.

    1. UPDATE
    2. DELETE specific records
    3. SORTING on Clustered Index Key column in the sort order as specified in the index.
    4. Frequent SELECT statements with all/most columns with WHERE clause having Clustered Index Key columns.

    When CI is MUST? 

    UPDATE, DELETE and SELECT, all these involve search operation, and the clustered index would be the BEST choice for quickly locating the rows. Also, CI helps to minimize the hot spot by limiting the lock-escalations, that too if the UPDATEs and DELETEs are done on CI and if the batch size (transaction size) of the update/delete is such that SQL Server thinks holding row-lock (or lowest possible locking such as Page lock) would be the BEST choice. CI would also be the BEST DECISION on a read-intensive workload if the querying is done frequently on CI Key if the CI is the natural key. It would help avoid the Key Lookups. CI would also be a WISE DECISION for a workload where the sorting (ORDER BY) is done on the CI Key columns on the same sort order as specified in the CI.

    Clustered Index Key - Natural or Surrogate?

    I published an article some time ago which includes research data of certain parameters, and the conclusion of what should be the Clustered Index Key – Natural or Surrogate? I believe this should give some insights to help choose the right CI Key columns.

    When CI would NOT make sense?

    In certain workloads where there is TRUNCATE...INSERT of the entire table data done every day. Such as, in the Staging or certain Reporting DB's where there are no UPDATEs or DELETEs being performed, and also the SELECT is not leveraging the CI (to avoid the Key Lookup cost) then having CI would not make sense. I've worked on several applications which ride on the external data being received as a flat file every day on an SFTP. The processing time was going beyond 4-5 hours, leading to the unavailability of the system during business hours. CI's were the key contributors here. The tables were holding millions of rows. They were cleaned and filed every day. The CI defined on the table was not used anywhere in the SELECT queries. There were no UPDATEs and DELETEs being performed on these tables. So removing the CI's brought down the total execution time from 4-5 hours to less than 2 hours.

    Crucial thoughts came out of this thread!

    Interesting insights from Jeff about the GUID helping control fragmentation. This is a good discovery that came out of the healthy discussion we had on this thread. Whether it only helps control the CI fragmentation or the NCI fragmentation too? This needs to be evaluated. I'm still to watch the complete video shared by Jeff.  But this is a superb finding I must say!  I would definitely try to do a test around this and extend my article Clustered Index Key – Natural or Surrogate? to include the additional parameters such as Page Splits, Fragmentation, etc., and the final conclusion that will also consider these parameters.

    Scott has also thrown lights on the CRUCIAL yet lesser focussed areas such as Data Modelling and Data Compression. Data Compression can greatly improve performance. It can also reduce the storage cost and the overall H/W cost (to some extent) that you need to spend if this feature is not used (or, used inappropriately).

    Closing note

    I would say - there is no panacea solution. It’s always advisable to evaluate every recommendation/best practice and its trade-offs holistically, before actually applying it. We should always consider "nothing is free in this world". There is some cost associated with everything (including the indexes). It is up to us to decide (with proper evaluation) if the said cost is beneficial for us or not.

    P.S. Cost is not the money in all cases. It could be time and other resources as well.

  • ScottPletcher wrote:

    Absent encoding, data compression is the next-best alternative. But you don't seem to use that for some reason, I'm not sure why.

    I do use that.  It's a part of the "other" methods that I spoke of.  I just don't want to write a 10-20 chapter book on a forum post.

    Doing things like changing a Last_Modified_By column from a VARCHAR(50) to an int and then doing the encoding for a person or process to store their int to a reference table will totally eliminate the normal "ExpAnsive" update that occurs there.

    As for data compression... as you well know, that can actually turn all of your nice, fixed width, non-expanisve, integer columns into an explosion of "ExpAnsive" updates thanks to Row Compression or Page Compression which includes Row Compression (I know you know that but saying it out loud for those that may not).  That also holds true for SPARSE columns, in spades!

    There's also the method of forcing LOBs out of rows so that the small ones that fit "in row" don't cause "ExpAnsive" Updates and excessive page usage (which really slows down any queries that do scans or range scans) and "short trapped rows".

    There's a ton of stuff on a subject that has been misconstrued, improperly test/demonstrated, and flat out wrong for more than 2 decades and, on a forum like this, you can only cover tiny parts of it at a time.

    So, let me ask you... did you actually take the time to watch the 'tube I provided the link to?  Keep in mind that it's also an introduction to a whole series.

    So, please... I understand where you're coming form but stop making statements about me not considering things on this subject... I actually have but, like I said, I don't want to post a large book on this single thread. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply