Introduction to Indexes: Part 2 – The clustered index

  • Comments posted to this topic are about the item Introduction to Indexes: Part 2 – The clustered index

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi,

    I find it difficult to believe this article, especially the reference that all non clustered indexes use the clustered index as their lowest level address (hence a large clustered index key increases the size of all other indexes)!

    If this is true, how is underlying data addressed (by an index) when no clustered index has been specified? Different rules for different scenarios (I doubt it)? What if a clustered index is Dropped? Are the 'different' rules re-applied retrospectively?

    Some time ago I had some involevement back in DBMS design and we used a Table Row Address (TRA) which was effectively the Page number of where a row was stored etc (6 bytes). Pages were added to a table (on demand) in physically adjacent 'blocks' and were numbered sequentially from 1.

    Each index at leaf level would consist of the Key value and its TRA (for the parent row) hence each 'Index entry' consisted of rows that were the length of the Key + 6 bytes in length. (Index rows in the tree followed the same construction with the highest key in the page at the lower level and its resident page number as a TRA).

    True, the leaf level would be the data pages for a clustered index however, the clustered data was supported by a standard B tree index in the same way a non clustered index is implemented.

    Where several rows are required and the query references the clustered key, the index tree would be used to locate the start point (seek) and once determined, successive reads would be at the data page (leaf) level whereas the article suggests that a full table read would be necessary, even to locate the first row in the series.

    Incidentally, "never choose a clustered key that can change" - agree however on many DBMS systems this is a standard constraint and is a good rule to follow for the sake of transportability.

    Choose a key that minimises page splitting is always a good rule to adopt however there are occassions where this can be an overall benefit.

    If interactive data accessing and updating is more likely to occur on the most recently added rows and the insertion rate is relatively high with many system users, then selecting a cluster key that 'physically scatters' data throughout the table at the data level can help to reduce the liklihood of deadlocks.

    Where this can't be avoided, it's also useful to dump and reload the table on a regular basis to optimise the structure.

    Jim

    Trainee Novice:w00t:

  • Esalter (11/11/2009)


    I find it difficult to believe this article, especially the reference that all non clustered indexes use the clustered index as their lowest level address

    Why do you find that hard to believe? The nonclustered indexes must have some way to refer to the actual data row, much as an index in a book as the page number, to reference the actual entry.

    If you want an authoritative source to confirm what I've said, see the MSDN article on Nonclustered index architecture: http://msdn.microsoft.com/en-us/library/ms177484.aspx

    If this is true, how is underlying data addressed (by an index) when no clustered index has been specified? Different rules for different scenarios (I doubt it)?

    The rules are indeed different, though only slightly. If you refer to part 1[/url] of the indexing series (or the aforementioned MSDN article)

    Nonclustered indexes are separate from the table. The leaf level of a nonclustered index has a pointer as part of each index row. That pointer is either the clustered index key in the cases where the base table has a clustered index or the RID (Row Identifier) in the cases where the table is a heap. The RID is an 8-byte structure comprised of File ID, Page Number and Slot Index and will uniquely identify a row in the underlying heap. Either way, the each row of a nonclustered index has a reference to the complete data row.

    What if a clustered index is Dropped? Are the 'different' rules re-applied retrospectively?

    Yes. If the clustered index is dropped all nonclustered indexes on the table are rebuilt with the RID as the row locator. If a clustered index is recreated, all the nonclustered indexes are again rebuilt, replacing the RID with the new clustering key.

    That's why it's often suggested to build the clustered index first and then the nonclustered indexes. Doing it the other way around is inefficient as the nonclustered indexes would be built and then later rebuilt.

    True, the leaf level would be the data pages for a clustered index however, the clustered data was supported by a standard B tree index in the same way a non clustered index is implemented.

    SQL's clustered index is a b-tree as well, just like for a nonclustered index. The main difference is that the clustered index has the entire data row at the leaf level.

    whereas the article suggests that a full table read would be necessary, even to locate the first row in the series.

    Where is that stated/implied? If you can read that into something I said, I'll edit the article to make it clearer.

    If SQL needs a range of rows from a clustered index it'll seek to the beginning of the range (or sometimes the end of the range) and then read along the leaf pages until it finds all the rows that it needs.

    If interactive data accessing and updating is more likely to occur on the most recently added rows and the insertion rate is relatively high with many system users, then selecting a cluster key that 'physically scatters' data throughout the table at the data level can help to reduce the liklihood of deadlocks.

    It can reduce contention, I would argue that if there are deadlocks there's something else wrong. The 'hotspot' reason for not choosing an ascending key kinda went away in SQL 7 with the change to the locking mechanisms. I recall someone doing a test a while back, can't recall who or specific numbers, but thousands of inserts/sec were required to cause enough contention to slow the inserts down. YMMV.

    Where this can't be avoided, it's also useful to dump and reload the table on a regular basis to optimise the structure.

    Dump and reload is not necessary. An index rebuild does the job.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Very good series on the basics of indexes. Unfortunately, I was not able to read Part 3 as of yet since there seems to be an error on the web page. One comment on Part 2 however regarding the three ways a Clustered Index is used. In particular, the Bookmark Lookup. As Gail states in her article it is a very expensive operation and should be avoided if at all possible, With the advent of SQL 2005 and 2008 there is a very easy way to avoid the classic bookmark lookup and still maintain the narrow selectivity of your original non-clustered index. For example, lets take Gail's example and put a little twist on it and then run the execution plan once again and see the difference. First lets drop her current non-clustered index on ProductNumber:

    DROP INDEX AK_PRODUCT_PRODUCTNUMBER

    ON PRODUCTION.PRODUCT

    GO

    Ok, piece of cake. Now, let's recreate it again, this time including the NAME column in her query thus making this a truly COVERED INDEX, Notice that we have not changed the original narrow selectivity of this index.

    CREATE UNIQUE NONCLUSTERED INDEX AK_PRODUCT_PRODUCTNUMBER

    ON PRODUCTION.PRODUCT (PRODUCTNUMBER)

    INCLUDE (NAME)

    GO

    SELECT ProductID, ProductNumber, Name

    FROM production.Product

    WHERE ProductNumber = 'HN-1224'

    Now, run Gail's query and execution plan again and notice the Bookmark Lookup is gone! Replaced with an Index Seek operation. Now notice how fast that row is returned? Use the INCLUDE columns in your NC indexes whenever you can to fix those expensive Bookmark Lookups using the Clustered Index. The Query Optimizer will thank you for it, trust me 🙂 HTH Travis.

    "Technology is a weird thing. It brings you great gifts with one hand, and it stabs you in the back with the other. ...:-D"

  • talltop-969015 (11/11/2009)


    Unfortunately, I was not able to read Part 3 as of yet since there seems to be an error on the web page.

    That's because it hasn't been published yet.

    Use the INCLUDE columns in your NC indexes whenever you can to fix those expensive Bookmark Lookups using the Clustered Index.

    Maybe. Just because a query can be covered doesn't necessarily mean it should be covered. As with everything, there are downsides. Covered in part 3 of the series.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hmmm, the link is there at the bottom of Part 2 though. Anyway, sorry about that.. Also, I never meant to say that every column should be covered in every query. I was merely pointing out the vast difference it makes in performance and the execution plan in your particular query.. 🙂 I agree that every query needs to be analyzed first on a case by case basis,,,

    "Technology is a weird thing. It brings you great gifts with one hand, and it stabs you in the back with the other. ...:-D"

  • talltop-969015 (11/11/2009)


    Hmmm, the link is there at the bottom of Part 2 though.

    Yes it is. Steve added the links, probably so that he wouldn't have to repeatedly edit the articles as the subsequent ones were published.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Excellent article Gail, as always, thank you for that! 🙂

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • Jim, first off, welcome to the wonderful world of SQL Server! Also, what is your primary RDBMS - i.e. from whence do you come to us? 🙂

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • I am often-times disappointed that I cannot have more than one clustered index. Do you think they will ever put this most useful feature in SQL Server Gail?? :hehe:

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Hi,

    Points taken - "hard to believe" (whole clustered index value in non clustered index) - I'm surprised that non clustered indexes don't also use the RID to provide a more efficient use of space.

    I know space isn't so much of a problem nowadays however disk reads/writes are still (generally) the most time consuming function that a PC etc performs, so reducing the physical data storage size can overall, increase the data density and reduce the amount of I/O operations required, particularly head movement.

    This is of course countered by the need to update all non clustered indexes when a row is moved to a different page as the result of a split operation but this (as you stated in your article) should be minimised by the clustered key design (i.e. stable, non changing).

    "Article suggest a full able scan required".. my sincere apologies, I have re-read the para and you didn't make that suggestion - I mis-read the content of the SCAN method of accessing a table....

    Sorry if I misled anybody or gave the impression that I thought the article was inaccurate, this wasn't my intent.

    I will be reading part 3 when I get the opportunity....

    Jim

    Trainee Novice:w00t:

  • Very nice overview. Gail, one thing that I was wondering if you could comment on... Towards the end there was a discussion of the best type of clustering key to use. I totally understand you're providing some very good general rules, but one of the attributes was that it is unique. Often times a very important use of a clustered index is to organize data in the detail table of a summary-detail table pair, for example, orders and order items. The most common query against the detail table will be to find those rows that correspond to a given summary row (ethier when looking up a specific order, or aggregating detail rows for all or a group of summary rows). In this case it is terribly important to use the one and only clustered index in the detail table to be on the foreign key linking back to the summary table, I think? I'm pretty sure that's generally the right way to do things so that range queries on the detail table (again, assuming they're the most common queries on the detail table).

    Anyways, I just thought that since summary-detail table relationships are very common that it's probably worth calling out this exception to the uniqueness guideline specifically. What do you think?

  • TheSQLGuru (11/11/2009)


    I am often-times disappointed that I cannot have more than one clustered index. Do you think they will ever put this most useful feature in SQL Server Gail?? :hehe:

    😀 Were you in my session on index DMVs?

    CREATE INDEX ON SomeTable (KeyColumn) INCLUDE (<All other columns in table)

    Voilà, a second 'clustered' index.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Esalter (11/11/2009)


    I'm surprised that non clustered indexes don't also use the RID to provide a more efficient use of space.

    A RID is 8 bytes. An integer column defined as identity (a very good and popular choice for clustered index) is 4 bytes.

    This is of course countered by the need to update all non clustered indexes when a row is moved to a different page as the result of a split operation

    And the need to rebuild every single nonclustered index when the cluster is rebuilt. Painful in terms of logging.

    The need to update all the nonclustered indexes is, I believe, one of the reasons the RID is not used as the row 'pointer' when there's a clustered index. Even when the underlying table is a heap, when a row is moved the RID in the NC indexes is not updated, rather a forwarding pointer is left behind which says where the row has moved to.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi again,

    Sadly we're still on SQL2000 so I'm reading these articles for interest (just starting to move stuff to SQL2005) however, one thing I've frequently come across that would be extremely useful is the ability to define Foreign keys that are a compound key (concatenated columns). Do you know if this is available in current versions of SQL (or ever likely to be)?

    Jim

    Trainee Novice:w00t:

Viewing 15 posts - 1 through 15 (of 122 total)

You must be logged in to reply to this topic. Login to reply