Clustered Indexes

  • Comments posted to this topic are about the item Clustered Indexes


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

  • Nice question. Thanks.

    Thanks

  • Whe I saw this question, my expectation was that most people would get it right - after all, there have already been several questions that basicallly tested the same concept. The word "superfluous" already started to form in my head.

    After answering, I was shocked to see a score of only 39% (at that time) correct answers. Almost half the people (48%, to be precise) apparently STILL think that data in a table is always physically ordered by the clustering key. So I have to apologize for the critique I (fortunately) never uttered - this question is far from superfluous, it is still very relevant and will probably have to be repeated yet more often.

    It's too bad that the author did not include a reference. For instance, the books online description of clustered indexes at http://msdn.microsoft.com/en-us/library/ms177443.aspx (found after a one-minute search), that contains a nice graphic of a clustered index - though it unfortunately never mentions that the actual locations of the depicted pages on disk can be completely random.

    But apart from this missing reference, the question is good. Thanks! 😉


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Good question!

    At the time of writing, only 34% have the right answer. So it is a good question indeed to refreshen some concepts about clustered index.

    Another great resource is this article from Gail Shaw:

    http://www.sqlservercentral.com/articles/Indexing/68439/

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • I am not satisfied with the answer, indeed I find it very misleading.

    As data is inserted or the clustering key values are updated, SQL Server only preserves logical ordering of the data. It would require far too many resources to preserve physical ordering of the data.

    What you are referring to is the physical order within a page.

    What you have left out here, is the fact that rows are indeed sorted in physical order accross pages.

    So if you have 2 records on one page, and a new record that logically belongs between them does not fit on the same page anymore, the logically last row of the existing records will be moved to a new page, and the new record will be stored on the existing page to preserve physical order.

    Here is some more background information which I think is necessary to really understand the discussed concepts:

    Clustered indexes are not a good choice for:

    Columns that undergo frequent changes

    This results in the entire row moving (because SQL Server must keep the data values of a row in physical order). This is an important consideration in high-volume transaction processing systems where data tends to be volatile.

    http://blogs.msdn.com/b/sqlserverstorageengine/archive/2006/06/26/647005.aspx

    It's a very common misconception that records within a page are always stored in logical order.

    http://www.sqlskills.com/blogs/paul/2007/10/03/InsideTheStorageEngineAnatomyOfAPage.aspx

    http://www.sqlskills.com/BLOGS/PAUL/post/Inside-the-Storage-Engine-Proof-that-records-are-not-always-physically-stored-in-index-key-order.aspx

    (Again - note that this is referring to order within a page only)

    Best Regards,

    Chris Büttner

  • Christian Buettner-167247 (10/5/2010)


    What you have left out here, is the fact that rows are indeed sorted in physical order accross pages.

    So if you have 2 records on one page, and a new record that logically belongs between them does not fit on the same page anymore, the logically last row of the existing records will be moved to a new page, and the new record will be stored on the existing page to preserve physical order.

    True, but that new page is not necessarily physically located after the existing page. There is no problem whatsoever with a page setup like this (assume ID is an int clustered index)

    File 1 page 200 ID 5-20

    File 1 page 201 ID 86-140

    File 1 page 208 ID 51-85

    File 1 page 210 ID 21-50

    It's fragmented (50% logical fragmentation) but it's perfectly acceptable. The pages have next and previous pointers that point to the page that's logically next or previous to maintain the index order. The pages are not necessarily physically contiguous or in order.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hugo Kornelis (10/5/2010)


    After answering, I was shocked to see a score of only 39% (at that time) correct answers. Almost half the people (48%, to be precise) apparently STILL think that data in a table is always physically ordered by the clustering key.

    Currently 33% and 51% respectively :w00t:

  • GilaMonster (10/5/2010)


    True, but that new page is not necessarily physically located after the existing page.

    Agreed. The rows are physically ordered within the "page universe"*, but not the pages themselves

    *and neither the rows within one page

    Best Regards,

    Chris Büttner

  • Christian Buettner-167247 (10/5/2010)


    I am not satisfied with the answer, indeed I find it very misleading.

    I have to disagree. I am completely satisfiied with the answer, as it is 100% correct. I find the explanation lacking, but definitely not misleading.

    As data is inserted or the clustering key values are updated, SQL Server only preserves logical ordering of the data. It would require far too many resources to preserve physical ordering of the data.

    What you are referring to is the physical order within a page.

    I don't think so. In fact, the cost of changing the physical order within a page is not that much; the page is in cache anyway, it is alwayys read and written as a whole, so the only cost is that of moving at most 8,000 bytes around in a memory block. Measured in fractions of a microsecond.

    The prohibitive use of resources would come when a page overflows. If physical order has to be preserved and the free space between rows runs out, all rows have to be physically moved. Just think for a while what would happen if you have a 10-billion row table, and you need to insert a row somewhere halfway in the table? Do you reallly expect SQL Server to physically relocate 5 billion rows, to make place for the new row? It does not. It allocates a new page, somewhere, and changes the pointer chain to preserve the logical order.

    What you have left out here, is the fact that rows are indeed sorted in physical order accross pages.

    So if you have 2 records on one page, and a new record that logically belongs between them does not fit on the same page anymore, the logically last row of the existing records will be moved to a new page, and the new record will be stored on the existing page to preserve physical order.

    This would only preserve physical order if SQL Server allocated the new page "between" the two old pages. It does not. (I have some repro code down below).

    Here is some more background information which I think is necessary to really understand the discussed concepts:

    Clustered indexes are not a good choice for:

    Columns that undergo frequent changes

    This results in the entire row moving (because SQL Server must keep the data values of a row in physical order). This is an important consideration in high-volume transaction processing systems where data tends to be volatile.

    http://blogs.msdn.com/b/sqlserverstorageengine/archive/2006/06/26/647005.aspx

    A change in a clustered index value can indeed result in the row moving to another page. But the use of the words "physical order" in this quote is sloppy - not up to Pauls normal standard.

    It's a very common misconception that records within a page are always stored in logical order.

    http://www.sqlskills.com/blogs/paul/2007/10/03/InsideTheStorageEngineAnatomyOfAPage.aspx

    http://www.sqlskills.com/BLOGS/PAUL/post/Inside-the-Storage-Engine-Proof-that-records-are-not-always-physically-stored-in-index-key-order.aspx

    (Again - note that this is referring to order within a page only)

    Though I don't think this is what Robert intended with his question, it does further support the correct answer. Within a single page, the rows can physically be stored out of order. The slot array is used to represent the logical order.

    To demonstrate that the physical order of pages is not equal to the logical order, run the script below and check the output. Here is an explanation of what happens.

    I first create a new database, then a table with a large data column so don't need lots of rows to fill a few pages. I then fill the row with data in ascending clustered index order. If the database files are not fragmented, all free space is continuous, the database files do not reside on SAN or RAID, and nobody else is using the database at the same time, then it is likely (though I won't vow for it) that the pages allocated will actually be in physical order.

    I then insert a single extra row that logically goes in the beginning of the table. There are four rows per page, so the row to be added goes to the second page (with key values 9, 11, 13, and 15). This page can't accomodate a fifth row, so it has to be split. The first two rows remain on their page, the other two are moved to a new page. This new page is not squeezed in between the second page and the "old" third page, but just allocated where there is place - probably right after the last page allocated to this table (assuming no other activity in the database). Pointers are used to preserve logical order.

    Then, I run a simple query. Watch the TABLOCK hint - either that or a NOLOCK hint is required to get the intended results. That is because only a full table lock or no locking at all allows SQL Server to use the "IAM scan" - a scan that uses the index allocation map to see what pages are allocated to the table and then process those pages in their physical allocation order. When the table exceeds a certain size, the cost of this kind of scan is lower than a scan that follows the pointers that define the logical order. (This size threshold is the reason I had to add 500 rows).

    If you run the query, you will see that the results are "almost" ordered - the only "disorder" is the gap between values 12 and 17; the rows with key values 13 and 15 are not missing; they appear at the very end of the results, after the vallue 999. This is what the physical order of the tablle looks like after we forced a page split.

    CREATE DATABASE Demo;

    go

    USE Demo;

    go

    CREATE TABLE Demo

    (KeyCol int NOT NULL PRIMARY KEY,

    DataCol char(2000));

    go

    DECLARE @i int;

    SET @i = 1

    WHILE @i < 1000

    BEGIN;

    INSERT INTO Demo (KeyCol, DataCol)

    VALUES (@i, 'Some data');

    SET @i = @i + 2;

    END;

    go

    INSERT INTO Demo (KeyCol, DataCol)

    SELECT 12, 'Added later';

    go

    SELECT *

    FROM Demo WITH (TABLOCK);

    go

    USE master;

    go

    DROP DATABASE Demo;

    go


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hi Hugo,

    In your explanation you have switched from order of rows accross pages to order of pages.

    I have not meant that a physical order is imposed on the pages. I have meant that a physical order is imposed on the rows accros pages. This physical ordering of rows accross pages does not result in physical ordering of the pages, and neither does it result in ordered data on the hard disk itself.

    Best Regards,

    Chris Büttner

  • Thanks Hugo. Nice explanation.

    I have used the below DBCC to see how page allocation happening and how it logically arranged using NextpageID and PrevpageID.

    DBCC IND(Demo, Demo, -1);

    GO

    [font="Verdana"]Regards,
    Rals
    [/font].
  • Christian Buettner-167247 (10/5/2010)


    Hi Hugo,

    In your explanation you have switched from order of rows accross pages to order of pages.

    I have not meant that a physical order is imposed on the pages. I have meant that a physical order is imposed on the rows accros pages. This physical ordering of rows accross pages does not result in physical ordering of the pages, and neither does it result in ordered data on the hard disk itself.

    Hi Christian,

    I must have misunderstood you - my apologies. I was convinced you were refering to both (order of rows in a page and order of pages on disk). And I tried to make clear that for both, the physical order can be different from the logical order. The logical order follows the clustered index, and is implemented using the pointer chain (for the order of pages), and the slot array (for the order of rows within a page).


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Would have been nice to have a definition of what physically ordered and logically ordered meant?

  • No Problem Hugo. I was not that clear in my first post, so I can see where you were coming from. And that just emphasizes how important it is to be very precise and clear when allegeing facts;-)

    Best Regards,

    Chris Büttner

  • It's not surprising that so many people get this wrong. Not much effort is given by authors to fully explain this, perhaps a few don't know either. I used to misunderstand it as well, but the QoD have enlightened me... As I understand it, both the pages and rows are physically sorted ONLY after an index rebuild (or the table has had no page splits yet). From there, new pages and page splits can be placed anywhere and logically ordered through double linkage. This is where fragmentation starts to occur. Rows remain physically sorted within the pages. If a single row takes up more than 1/2 of the free space in the page (> 4030 of the 8060 bytes), you will have a lot of unused space within pages and fragmentation would become very high, as every row requires it's own page.

    The following excerpts are not wrong, but don't give the full picture either.

    From Wikipedia: "Since the physical records are in this sort order on disk, the next row item in the sequence is immediately before or after the last one, and so fewer data block reads are required. The primary feature of a clustered index is therefore the ordering of the physical data rows in accordance with the index blocks that point to them."

    http://en.wikipedia.org/wiki/Index_(database)#Clustered

    MSDN: "A clustered index determines the physical order of data in a table."

    http://msdn.microsoft.com/en-us/library/aa933131(SQL.80).aspx

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply