Why Allow Heaps at All?

  • Given all the advantages of clustered tables (tables with a clustered index) over heaps, why does SQL Server even allow heaps to exist? If a table is created without a clustered index specified, why not give it an internal identity-valued clustering index key with 100% fill factor, so new rows are only added at the end, and store it that way? The disadvantage would be rebuilding the clustered index when the user ever got around to specifying one, and that too only if the existing clustering key needed to be changed, but that would still be a one-time operation. Isn't that worth the performance gains to be had?

    Hakim Ali
    www.sqlzen.com

  • hakim.ali (12/28/2012)


    Given all the advantages of clustered tables (tables with a clustered index) over heaps, why does SQL Server even allow heaps to exist? If a table is created without a clustered index specified, why not give it an internal identity-valued clustering index key with 100% fill factor, so new rows are only added at the end, and store it that way? The disadvantage would be rebuilding the clustered index when the user ever got around to specifying one, and that too only if the existing clustering key needed to be changed, but that would still be a one-time operation. Isn't that worth the performance gains to be had?

    I suspect theres a mechanism already in place to efficiently index a heap, ie., whatever passes for the row id. For example, an index on a table that is a heap has to be able to point to any given row when the index key matches the search term. Since the row id for rows in the heap is internally managed by the server, I would also think that the mechanism for coming up with a new row id for each new inserted row is going to be engineered pretty well for its purpose.

  • patrickmcginnis59 (12/28/2012)


    I suspect theres a mechanism already in place to efficiently index a heap, ie., whatever passes for the row id.

    The RID is nothing more than an 8-byte combination of file ID, page number and slot index. It's the physical position of the row within the database.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (12/28/2012)


    patrickmcginnis59 (12/28/2012)


    I suspect theres a mechanism already in place to efficiently index a heap, ie., whatever passes for the row id.

    The RID is nothing more than an 8-byte combination of file ID, page number and slot index. It's the physical position of the row within the database.

    Sounds like a pretty efficient index to me 🙂

  • patrickmcginnis59 (12/28/2012)


    Sounds like a pretty efficient index to me 🙂

    If it were an index, then why would the database scan every record in a heap when you were searching for something? I can see this as an addressing mechanism, not so much as an index.

    Hakim Ali
    www.sqlzen.com

  • patrickmcginnis59 (12/28/2012)


    GilaMonster (12/28/2012)


    patrickmcginnis59 (12/28/2012)


    I suspect theres a mechanism already in place to efficiently index a heap, ie., whatever passes for the row id.

    The RID is nothing more than an 8-byte combination of file ID, page number and slot index. It's the physical position of the row within the database.

    Sounds like a pretty efficient index to me 🙂

    It's not an index at all. It's unique, but that's all.

    For it to be an index, there would have to be the upper levels of the b-tree. There isn't. You can't seek on a heap, only scan.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • hakim.ali (12/28/2012)


    If it were an index, then why would the database scan every record in a heap when you were searching for something?

    SQL only scans if you haven't got a useful index (clustered or nonclustered). It's perfectly possible to put nonclustered indexes on a heap.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • hakim.ali (12/28/2012)


    patrickmcginnis59 (12/28/2012)


    Sounds like a pretty efficient index to me 🙂

    If it were an index, then why would the database scan every record in a heap when you were searching for something? I can see this as an addressing mechanism, not so much as an index.

    I think thats a fair comment. However, if you could actually spec the page, row, etc in your SQL statement then you would be able to fetch the row directly without a scan (if of course the SQL would honor the direct request). Also, if you index a heap, you could then include the columns you index against in your query, and that index might help avoid a scan.

    But I agree that its probably not an index in any sense except that the row ID can be used (internally by systems programming) to directly fetch a row, so I'd like to equate that to an index as opposed to the original comment that heaps should default to having a clustered index, ie., direct access works (somewhat) like a clustered index with an internally generated key. Not that the original point was all that objectionable either.

  • GilaMonster (12/28/2012)


    patrickmcginnis59 (12/28/2012)


    GilaMonster (12/28/2012)


    patrickmcginnis59 (12/28/2012)


    I suspect theres a mechanism already in place to efficiently index a heap, ie., whatever passes for the row id.

    The RID is nothing more than an 8-byte combination of file ID, page number and slot index. It's the physical position of the row within the database.

    Sounds like a pretty efficient index to me 🙂

    It's not an index at all. It's unique, but that's all.

    For it to be an index, there would have to be the upper levels of the b-tree. There isn't. You can't seek on a heap, only scan.

    I was thinking more generally, comparing the end effect of the functionality he originally suggested, more like:

    http://xlinux.nist.gov/dads/HTML/arrayindex.html but I can absolutely agree that its not an index as SQL users know it.

  • hakim.ali (12/28/2012)


    patrickmcginnis59 (12/28/2012)


    Sounds like a pretty efficient index to me 🙂

    If it were an index, then why would the database scan every record in a heap when you were searching for something? I can see this as an addressing mechanism, not so much as an index.

    Yes, a RID a physic address for a specific row and should not be considered an entry in an index primarily because you would not query for a row based on the RID as you would with a clustering key. That said, if your clustered index is not unique you still cannot guarantee your table only contains one row per clustering key, and there would be no way to ask for the unique row using a uniqueifier (added by SQL Server to make rows with the same clustering key distiguishable from one another). A RID is just a physical row locator the storage engine uses to retrieve data but it says nothing about the data actually stored in the row.

    Heaps are good storage structures for staging tables (tables that are truncated and reloaded regularly) where maintaining a clustered index would force the storage engine to store the incoming data in a particular order and therefore would introduce overhead. You can test this, even with an ever-increasing IDENTITY column as the clustered index, and see that loading large quantities of data into a heap will outperform a clustered table. Heaps are also good for tables that are only ever appended to or where the only updates that occur are to fixed-width columns (e.g. numeric types, CHAR, NCHAR or BINARY).

    Now, if you are maintaining a table in an OLTP system that accepts lots of variable-length updates or deletes then using heaps can become very problematic due to the lack of reuse of page slots where rows were deleted (unless a specific type of table-lock is taken, which is not practical in an OLTP system) and the generation of forwarding pointers when updates cause rows to overflow the page they are on.

    Heaps do have their place. Use the right tool for the job.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • As noted, heaps are great for "work" / "staging" tables.

    Also, I guess in theory a heap could be more efficient for certain types of tables with multiple nonclust indexes, particularly if they were never updated after being inserted. That is, drop the indexes on the heap, truncate it, load it, rebuild the indexes, SELECTs from the heap based on different columns but (almost) never DELETE / UPDATE from it.

    No reason for SQL to prevent you from using one if you determine it's better for your particular need.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • opc.three (12/28/2012)


    Heaps are good storage structures for staging tables (tables that are truncated and reloaded regularly) where maintaining a clustered index would force the storage engine to store the incoming data in a particular order and therefore would introduce overhead. You can test this, even with an ever-increasing IDENTITY column as the clustered index, and see that loading large quantities of data into a heap will outperform a clustered table. Heaps are also good for tables that are only ever appended to or where the only updates that occur are to fixed-width columns (e.g. numeric types, CHAR, NCHAR or BINARY).

    ...

    Heaps do have their place. Use the right tool for the job.

    ScottPletcher (12/28/2012)


    As noted, heaps are great for "work" / "staging" tables.

    ...

    No reason for SQL to prevent you from using one if you determine it's better for your particular need.

    Thanks for shedding the light, and good points. However, given that these scenarios are probably the exceptions rather than the norm, why not default to all tables being clustered, and allow users to override them as heaps if/when required?

    Hakim Ali
    www.sqlzen.com

  • hakim.ali (12/28/2012)


    Thanks for shedding the light, and good points. However, given that these scenarios are probably the exceptions rather than the norm, why not default to all tables being clustered, and allow users to override them as heaps if/when required?

    They may be an exception from your perspective but I have worked with many databases with hundreds of tables in them and not one of them had a clustered index. The ones that were staging databases had no issues related to only containing heaps because all that was done in the tables was truncating and loading or appending data. The ones that were transactional databases where row-widening updates and deletes were issued did not do as well. It can be very difficult to maintain high performance levels as a heap becomes disorganized on disk. In SQL 2008 R2 and newer it is a little easier to reorganize a heap with the addition of ALTER TABLE...REBUILD but if you find yourself needing that statement you should be thinking of when you can add a clustered index.

    As to why a table is not clustered by default, I suppose it could be done but it presents many technical challenges and difficult decisions. How would you propose SQL Server decide which clustering key to use if the user does not provide one? A clustered index decouples the data access from the physical storage details unlike in the case of a heap where the RID, a rows physical location, is stored in nonclustered indexes as the lookup key. Are you saying SQL Server should randomly choose a column and use that, plus maybe a uniqueifier? What if we decide to drop the column it chooses? Or should SQL Server just append it's own inaccessible column (like a uniqueifier) to all rows added to the table and use that to cluster on? What would it use, an INT, a BIGINT? What if it runs out of values internally?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • hakim.ali (12/28/2012)


    Given all the advantages of clustered tables (tables with a clustered index) over heaps, why does SQL Server even allow heaps to exist? If a table is created without a clustered index specified, why not give it an internal identity-valued clustering index key with 100% fill factor, so new rows are only added at the end, and store it that way? The disadvantage would be rebuilding the clustered index when the user ever got around to specifying one, and that too only if the existing clustering key needed to be changed, but that would still be a one-time operation. Isn't that worth the performance gains to be had?

    Because there are occasional benefits to HEAPS especially when it comes to Temp Tables that are supposed to contain ONLY the data you need or staging tables that will be made to suffer full scans even if you were to index them. The Temp Table that I built in the first "Hierarchies On Steroids" article I wrote fits that description. Adding a Clustered index actually slowed the code down two ways... 1) adding the index and 2) the code was actually twice as slow with the clustered index.

    There aren't many places where I'd say a clustered index shouldn't be used but there are places.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You may be interested to know that SQL Azure (or whatever it's called this week), does not allow heaps.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply