index on big table

  • dva2007 (10/26/2011)


    Again according to this article here it says following:

    http://www.sql-server-performance.com/2007/clustered-indexes/

    Here are some more good reasons to add a clustered index to every table.

    Keep in mind that a clustered index physically orders the data in a table based on a single or composite column. Second, the data in a heap (a table without a clustered index) is not stored in any particular physical order.

    That article simply is wrong. It's not alone, Books Online has the same statement in it. Doesn't make it right, just makes it an incorrect statement that appears in many places.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • alexander.suprun (10/25/2011)


    drew.allen (10/25/2011)


    dva2007 (10/25/2011)


    If your index doesn't return all of the fields in your query, then it still needs to look up the record in order to retrieve those fields. If I understand it correctly, the optimizer will use the clustered index--if it is available--for this lookup, which is why having one will speed up your queries.

    Drew

    Actually it's vice versa. Having a clustered key will slow down a lookup operation. Because if it's a heap then on the leaf level of any non-clustered indexes there is an RID which directly points to the data, so only one logical operation is needed to get to the data (assuming that there are no forwarding pointers). But if you have clustered key then at the bottom of non-cl index SQL engine will find just a value which will be used to perform Clustered Index Seek. And if you have, let's say, 4 levels in clustered B-Tree then 4 pages must be accessed to perform the seek, comparing to 1 page in case you don't have clustered index.

    Alexander,

    The number of times I've seen a heap out perform a clustered index on this sort of thing can be counted on one hand with some fingers left over. You mentioned assuming there are no forward pointers. Well, this condition only exists on very static data. A table with insert/update/delete type activity will surely split the heck out of a heap - especially updates to varchar columns that make them larger. Once a heap gets split up and fragmented it really is a mess. Also, after some massive deletes the pages aren't given back for re-use.

    I'd rather traverse a few node levels to get to a key lookup rather than deal with the headaches caused by heaps.

    Todd Fifield

  • tfifield (10/27/2011)


    Alexander,

    The number of times I've seen a heap out perform a clustered index on this sort of thing can be counted on one hand with some fingers left over. You mentioned assuming there are no forward pointers. Well, this condition only exists on very static data. A table with insert/update/delete type activity will surely split the heck out of a heap - especially updates to varchar columns that make them larger. Once a heap gets split up and fragmented it really is a mess. Also, after some massive deletes the pages aren't given back for re-use.

    I'd rather traverse a few node levels to get to a key lookup rather than deal with the headaches caused by heaps.

    Todd Fifield

    Todd,

    I agree that a heap is a mess of data and it's hard to maintain such a table but this is another question. The original was - is it faster to perform a lookup on a heap table rather than clustered index? The answer is yes. Even if there are forward pointers then number of reads becomes 2 as a maximum. Which is usually less than number of levels in clustered index on a big table.

    As you mentioned, a heap table has a lot of disadvantages, so I don't say that it's a very good option. But it would be wrong to assume that clustered index speeds up a lookup operation and that's the only thing I was going to say.


    Alex Suprun

  • Alexander,

    Point taken. I just didn't want the OP to get the idea that heaps give better performance than clusters.

    Todd Fifield

  • dva2007 (10/26/2011)


    Steve, Gail,

    I always assumed that clustered index put the data in the physical order in the table.

    ------------------------------------------------------------------------------------------------------------

    Again according to this article here it says following:

    http://www.sql-server-performance.com/2007/clustered-indexes/

    Here are some more good reasons to add a clustered index to every table.

    Keep in mind that a clustered index physically orders the data in a table based on a single or composite column. Second, the data in a heap (a table without a clustered index) is not stored in any particular physical order.

    ------------------------------------------------------------------------------------------------------------

    I would look forward for the blog of gail to get proper understanding.

    Thank you all for your comments.

    The pages and rows in a clustered table are "clustered" in a b-tree fashion, which means they are ordered somewhat at the leaf level. However, all rows in the table are not ordered from top to bottom.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • thanks all for your comments.

    As i said i do understand how the table is structured with the index but i am still bit confused.

    I have a table which doesnt have small unique key (unique key is 70 character long with combination of alphabet and numeric). The 70 character field doesnt make good clustered index.

    If i create autoid and do the clustered index on auto id (Identity) then whats the advantage?

    i am never going to write query like this.

    select * from table where id > 1000 and id < 2000

    If i am never going to use auto id then whats the point of having clustered index on auto id? I cant have index on 70 characted long as it says it has to be narrow. it takes almost 500 MB to create index on 70 character long key.

  • You need an index on the 70 char key, regardless if it's the primary key. If you use it as the clustered index, then there isn't extra space created. If it's 500MB, then that's the table size.

    If you create a clustered index on an autoID, it can prevent page splits and it will always insert at the end of the table, or allocate new space, for the new rows.

    If you have other fields in the table, they might make better clustered keys, but a clustered index is better organized, and tends to perform better than a heap for queries.

    Don't confuse clustering with keys. You need indexes to improve query performance. The choice of which index to choose for the clustering key may vary from table to table. It is not necessarily the primary key for any of them.

  • dva2007 (11/1/2011)


    thanks all for your comments.

    As i said i do understand how the table is structured with the index but i am still bit confused.

    I have a table which doesnt have small unique key (unique key is 70 character long with combination of alphabet and numeric). The 70 character field doesnt make good clustered index.

    If i create autoid and do the clustered index on auto id (Identity) then whats the advantage?

    i am never going to write query like this.

    select * from table where id > 1000 and id < 2000

    If i am never going to use auto id then whats the point of having clustered index on auto id? I cant have index on 70 characted long as it says it has to be narrow. it takes almost 500 MB to create index on 70 character long key.

    By default, SQL Server will assign it's own internal 8 byte identifier (File:Page:Slot) as the ROWID for each record in a "heap" (non-clustered) table, and this ROWID is used as the lookup id in non-clustered indexes. This is the ID that non-clustered indexes use to reference a specific record in the table. However, if a table has a unique clustered key (like a 4 byte Int for example), then SQL Server will use that for the ROWID. So using that 4 byte Int will decrease the size and probably the performace of your non-clustered indexes. Based on how you're described the column definition of your table, I'd bet that you have several non-clustered indexes.

    Based on the above assumptions, it's very possible that adding the following column to your table will decrase the size of your indexes and increase their efficiency.

    ...

    uid int not null identity constraint pk_mytable primary key clustered),

    ...

    However, read up on it more, and experiment by reproducing two versions of your table in a dev environment, one with the clustered uid and one without, and then loading it with data to confirm that index sizes and query i/o and performance have actually improved.

    http://msdn.microsoft.com/en-us/library/ms190639.aspx

    Also, articles by Paul Randal and Kimberly Tripp are a goldmine of information on this topic.

    http://www.sqlskills.com/BLOGS/KIMBERLY/post/The-Clustered-Index-Debate-again!.aspx

    If your queries are not "sargable", meaning it doesn't leverage an index and is performing full table scans, then adding this unique identifier may add no benefit or perhaps even hurt.

    As it stands now, the natural order of your heap table (ordered roughly by the sequence that the records were inserted) may be better than the order of records after clustering on an identity column. Therefore, if your table has a datetime column based on insert time or reporting period, then that column may actually work out better as the clustered key than an (int identity) column.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • if there are more records with the same date in the table, can i create clustered index on that field?

    We have 5 years records and the same day have few hundred records. If i create clustered index then it is not unique field and ever increasing (like auto id).

  • dva2007 (11/1/2011)


    if there are more records with the same date in the table, can i create clustered index on that field?

    We have 5 years records and the same day have few hundred records. If i create clustered index then it is not unique field and ever increasing (like auto id).

    If you're queries always, or just freqently, filter on something like insert_date or period_date, then adding a clustered index (rather than a non-clustered index) on that datetime column will benefit, because whenver you re-build the table, the records will be clustered in basically the same natural order that they were originally inserted. If this is a history table, and you're only querying records for specific date ranges, then keeping records clustered together based on that date makes sense performance wise, even if it isn't a unique clustered key. A datetime column itself is 8 bytes long, so it isn't any more narrow than an 8 byte ROWID.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • thanks you. that makes more sense as it is not going to take more space then default ROWID (8 bytes).

  • dva2007 (11/1/2011)


    thanks you. that makes more sense as it is not going to take more space then default ROWID (8 bytes).

    If the clustered index is on a non-unique key, then SQL Server will still use the default 8 byte rowid. Still, there are practical reasons to cluster a table even on a non-unique column, like transaction_date, that keeps the records clustered by datetime, if that's how they are typically selected.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (11/1/2011)


    If the clustered index is on a non-unique key, then SQL Server will still use the default 8 byte rowid.

    The 8-byte RID is only used in nonclustered indexes when the base table is a heap. As soon as a clustered index is added (any clustered index), the pointer in nonclustered indexes is the clustering key. If the clustered index is not defined unique, SQL adds a uniquifier, a 4-byte integer that's used to distinguish between duplicate values of the clustering key.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (11/1/2011)


    Eric M Russell (11/1/2011)


    If the clustered index is on a non-unique key, then SQL Server will still use the default 8 byte rowid.

    The 8-byte RID is only used in nonclustered indexes when the base table is a heap. As soon as a clustered index is added (any clustered index), the pointer in nonclustered indexes is the clustering key. If the clustered index is not defined unique, SQL adds a uniquifier, a 4-byte integer that's used to distinguish between duplicate values of the clustering key.

    Now that you mention it, I do recall something like that, so a non-unique clustered key may end up resulting in an rowid that is wider than the default heap one. Until these alternate table def scenarios are staged in a dev environment and compared side by side, it's difficult to anticipate how it will ultimately impact the space used and query performance. I've seen some formulas on Kim and Paul's site, but I don't know off the top if there is a tool where one can plug in the table / index defs, supply some parameters, and have it give a good estimate based just on math.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (11/1/2011)


    Now that you mention it, I do recall something like that, so a non-unique clustered key may end up resulting in an rowid that is wider than the default heap one. Until these alternate table def scenarios are staged in a dev environment and compared side by side, it's difficult to anticipate how it will ultimately impact the space used and query performance. I've seen some formulas on Kim and Paul's site, but I don't know off the top if there is a tool where one can plug in the table / index defs, supply some parameters, and have it give a good estimate based just on math.

    Eric,

    As long as you brought up the space issue, I've found that heap tables that have any sort of DELETE activity can end up with empty pages that still have to be traversed. Any sort of update activity to VARCHAR/NVARCHAR type columns that increase their size can easily cause pages to have mostly forward pointers rather than data due to page splits

    An 8 byte DATETIME plus 4 bytes worth of unique identifier (12 bytes) as the clustering key isn't all that bad and would probably end up with better performance in the long run.

    That's my 2 cents worth.

    Todd Fifield

Viewing 15 posts - 16 through 29 (of 29 total)

You must be logged in to reply to this topic. Login to reply