Clustered index on identity column vs Heap

  • It seems that a lot of people design tables with a clustered index on an identity column. Given the 2 tables below, is there really any advantage to the one with the CI? I guess my real question is, "why is (if it is) a query using a non-clustered index against a table with a CI more efficient than one against a table without a CI (heap)"

    Assumptions:

    1. 90% of activity is inserts, updates and queries. 10% is deletes.

    2. All updates and queries will use PKID column and its index.

    3. I realize the tables/indexes could be designed better, but I'm just interested in the examples below.

    USE Test

    CREATE TABLE Test_NC (PKID UNIQUEIDENTIFIER PRIMARY KEY NONCLUSTERED, Col1 BIGINT, Col2 VARCHAR(20))

    CREATE TABLE Test_CL (Id INT IDENTITY, PKID UNIQUEIDENTIFIER PRIMARY KEY NONCLUSTERED, Col1 BIGINT, Col2 VARCHAR(20))

    CREATE CLUSTERED INDEX CI_Clustered ON dbo.Test_CL (Id)

    John Deupree

  • While I'm sure that there are other reasons - here are a few:

    - a heap's nonclustered indexing relies on a "heap row ID", which is basically a compound identifier (a page ID + a row/slot ID within the page). each of those is a bigint, so using an identity is more efficient size-wise.

    - forwarding pointers: when you update a record in a heap, unless it's the same size, the update will be written to a new position (on some other page). in its place there will be a pointer to the new location. When you then use the N-on-clustered index to go find this record it may have to jump through a series of these forwarding pointers until it finds the latest version of the record.

    - page fill: again based on point #2, a heavily updated heap will tend to fragment heavily (since the pages aren't reused). there's also no practical way to defragment a heap (short of copying it to a brand new table and dropping the old one).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (7/26/2012)


    While I'm sure that there are other reasons - here are a few:

    - a heap's nonclustered indexing relies on a "heap row ID", which is basically a compound identifier (a page ID + a row/slot ID within the page). each of those is a bigint, so using an identity is more efficient size-wise.

    - forwarding pointers: when you update a record in a heap, unless it's the same size, the update will be written to a new position (on some other page). in its place there will be a pointer to the new location. When you then use the N-on-clustered index to go find this record it may have to jump through a series of these forwarding pointers until it finds the latest version of the record.

    - page fill: again based on point #2, a heavily updated heap will tend to fragment heavily (since the pages aren't reused). there's also no practical way to defragment a heap (short of copying it to a brand new table and dropping the old one).

    +1

  • Beyond Matt's general data about heaps, these two particular tables have other differences that are important. One has a GUID as the leading edge of the PK, one has an Integer, for one thing. That will make a difference in a lot of things.

    First, storage. GUIDs take more than Ints. Simple as that.

    Next, querying. Most queries will end up using the PK as the main component of Where clauses, either for Selects, or for Update/Delete actions. The GUID will, again, require more space in RAM for variables, require more bandwidth to pass back and forth between servers, will take more buffer space and more I/O bandwidth on the database server, and all the other things that happen when data is bigger than other data. On a server with a high resources:load ratio, that probably won't matter, or not much, but on a server with a high load:resources ratio, it will likely make a difference cummulatively.

    Next, as an advantage for inserts, the GUID can be generated by the application and given to the database, without a round-trip. Again, on some systems that are insert-heavy and have a high load:resources ratio, this can make a big speed difference and be a solid advantage.

    None of these points address your question about heaps, but you did ask about the two specific tables, and I thought it should be mentioned that this isn't purely a question of heap vs non-heap with these two.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply