Clustered Indexes? Sedimentary, my dear Watson

  • For me it depends is a good answer. I believe tables should have a clustered index (much the same reasons as Gail). I do not always create it on an ever increasing integer field - but that is the case most of the time.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • I have a table in a database with hundreds of millions of rows. The records are quite small (30 bytes) with a clustered index on TagID (Int) and SampleDateTime (BigInt). I didn't design the table but I am almost convinced the the clustered index should be reversed because SampleDataTime has much greater uniqueness, always increases (obviously) and we ALWAYS specify a DateTime range in our queries.

    I experimented and found that the execution time for queries that specify a DateTime range but no TagIDs came down from 45 seconds to less than 1 second. The only problem is that most existing queries take much longer, presumably because they are structured for the original clustered index,

  • I agree with the microsoft approach... but mainly because when I design tables... I reference them through the Identity keys from the get go. Therefore having a clustered index on a sequential ID provides a double benefit when you actually reference a table through the ID key. For name or address searches, I recommend non-clustered index keys that taylor to both the where and select clauses of a common SQL search. The main thing you want to avoid is bookmark-lookups, because they are expensive.

    But hey, if there was only one right way of doing things then we wouldn't need programmers or DBAs.

  • garnet.fehr (5/25/2010)


    But hey, if there was only one right way of doing things then we wouldn't need programmers or DBAs.

    There'll always be a need for DBAs to slap the programmers upside the head when they do it in the most boneheaded RBAR way imaginable. 😉

  • I've just been testing fragmentation on a table that has a clustered index on a guid (ouch!), not my design. But I've found that with an appropriate fill factor, in this case 90%, adding 100,000 rows to a table of 8 million fragments that index just a fraction of a percent.

    I conclude that page splits can be minimized with an appropriate fill factor. I also defrag all indexes that are more than 10% fragmented on a weekly basis.

    Too often I think the clustering on the identity column is done out of laziness - not spending the time to analyze the queries hittng the table to determine the best choice for clustering.

  • Richard Campbell says "...TagID(Int)...SampleDateTime(BigInt)...but I am almost convinced the the clustered index should be reversed because SampleDataTime has much greater uniqueness, always increases (obviously) and we ALWAYS specify a DateTime range in our queries."

    First, I hope your "DateTime range" is in the form of two bigint variables. (Don't get me started on the people who put functions that produce constants in where clauses. ARRRG.)

    What would it take to convince you to not switch? (I am hard pressed to think of reasons why you wouldn't.)

    1. The clustered index's binary tree length would increase by 4/3 times the current size.

    2. You can build covering indexes for every selection criteria you use that starts with SampleDateTime as the first index field (The first covering index you build wipes out reason 1.)

    3. Convince yourself you need every record for a specific TagID and redesign your code to stop querying on datetime.

    What would it take to convince you to switch?

    1. All binary trees of all indexes only keep the first field defined in the index. That means leaf by leaf each data table page is read from the beginning record until the range you are looking for is found for the currently used tagid value and you repeat that process for every tagid value. Not so bad when you only want the first 1% of the entire table because you scan only 1% of the table to read your data. When the 1% date range you want is in the 98 to 99% of the date range, you have to scan 99% of your table to find that 1%.

    2. If you ever do get a need to retrieve all specific records for one tagid value, just add an index or a covering index that starts with the tagid. (The covering index gets the SampleDataTime field for free or id part of the cost of defining the index, however you want to look at it.)

  • don.schaeffer (5/25/2010)


    I've just been testing fragmentation on a table that has a clustered index on a guid (ouch!), not my design. But I've found that with an appropriate fill factor, in this case 90%, adding 100,000 rows to a table of 8 million fragments that index just a fraction of a percent.

    Depends. I've seen a table go from 0% - 99.975% fragmentation in under 2 days, that with a fill factor of 85%

    I conclude that page splits can be minimized with an appropriate fill factor.

    Be careful of making broad conclusions based on a couple of tests.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Ken Lee-263418 (5/25/2010)


    1. All binary trees of all indexes only keep the first field defined in the index.

    Maybe I'm misunderstanding you, but...

    If you have a composite index (index defined on multiple key columns), the key columns are present at all levels of the index, leaf and non-leaf. Any include columns are only present at the leaf level.

    It's statistics where only the first column of the index contributes to the histogram.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Ken Lee-263418 (5/25/2010)


    First, I hope your "DateTime range" is in the form of two bigint variables. (Don't get me started on the people who put functions that produce constants in where clauses. ARRRG.)

    Rest assured I use only BigInt (for SampleDateTime) in any WHERE clauses.

    The only reason I am hodling back on this is that it would involve ferreting around in a lot of queries that I didn't write and changing them.

    Your reply is enough to make me give it serious thought I must say.

  • Ken Lee-263418 (5/25/2010)


    2. You can build covering indexes for every selection criteria you use that starts with SampleDateTime as the first index field (The first covering index you build wipes out reason 1.)

    When you consider building such nonclustered "covering" indexes, consider that the columns of the clustered index are always appended to end of each nonclustered index row. From BOL"

    If the table has a clustered index, the column or columns defined in the clustered index are automatically appended to the end of each nonclustered index on the table. [font="Arial Black"]This can produce a covered query without specifying the clustered index columns in the definition of the nonclustered index.[/font] For example, if a table has a clustered index on column C, a nonclustered index on columns B and A will have as its key values columns B, A, and C.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/26/2010)


    Ken Lee-263418 (5/25/2010)


    2. You can build covering indexes for every selection criteria you use that starts with SampleDateTime as the first index field (The first covering index you build wipes out reason 1.)

    When you consider building such nonclustered "covering" indexes, consider that the columns of the clustered index are always appended to end of each nonclustered index row. From BOL"

    However explicitly adding the clustering key doesn't make SQL include it twice in the index. If you want the clustering key anywhere other than the last column of the key (in the case of a non-unique index) or an include column (in a unique index), it needs to be explicitly specified in the index definition.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (5/26/2010)


    don.schaeffer (5/25/2010)


    I've just been testing fragmentation on a table that has a clustered index on a guid (ouch!), not my design. But I've found that with an appropriate fill factor, in this case 90%, adding 100,000 rows to a table of 8 million fragments that index just a fraction of a percent.

    Depends. I've seen a table go from 0% - 99.975% fragmentation in under 2 days, that with a fill factor of 85%

    I conclude that page splits can be minimized with an appropriate fill factor.

    Be careful of making broad conclusions based on a couple of tests.

    Gail, the appropriate fill factor is going to depend on a number of things including the ratio of inserted rows to existing rows before defragmentation and the row size. The larger the row size the fewer new rows can be added before a page can be split.

    In your example you didn't specify how many rows you started with in the table, how many you added, nor the row size. These are all critical factors. In my case I have a row size under 100 bytes.

  • don.schaeffer (5/26/2010)


    In your example you didn't specify how many rows you started with in the table, how many you added, nor the row size. These are all critical factors. In my case I have a row size under 100 bytes.

    No, I didn't. I just wanted to point out that drawing broad conclusions ("I conclude that page splits can be minimized with an appropriate fill factor.") based on a couple of cases is dangerous.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gail says "If you have a composite index (index defined on multiple key columns), the key columns are present at all levels of the index, leaf and non-leaf. Any include columns are only present at the leaf level.

    It's statistics where only the first column of the index contributes to the histogram. "

    I thought that was true too, until rudely and graphically being proven wrong. It was my first time creating a partitioned view. There were 7 tables with a tinyint field with a check constraint allowing a value from 1 to 7 for each of the tables. The second was a varchar(10) field that was unique. In order for the partitioned view to work the tinyint field MUST be in the the primary key. Since both fields are stored in the binary tree and tinyint was incredibly important, I put it first. (That bothered me because it went against something a teacher had said but never said why the first field should be fairly distinct.)

    WRONG! WRONG! WRONG! WRONG!

    ONLY the first field values are stored in the binary tree of ANY index. I was called in because my partitioned view was taking 3 seconds to insert or retrieve a value with only 50K records in the table. My group had hired a new DB developer who was called in as well. When I said, "Could it be..." I was told in no uncertain terms that was the reason why. Switching the fields in the clustered primary key put the response in the 3-10 millisecond range.

    Thankfully this was found in the test bed and never made it to production. Use the undocumented DBCC commands to read the index's binary tree data yourself.

  • Richard Campbell says "The only reason I am hodling back on this is that it would involve ferreting around in a lot of queries that I didn't write and changing them."

    Why change them? If they are using the date field as a selection criteria, switching the clustered fields gives you all of the benefit at the cost of 4 bytes per record in the binary tree of your clustered index. The only down side is if they are using the first field in selection criteria.

    You can easily ferret using "EXEC sp_depends 'tablename'" or see more detail about the first queries with

    "select name, case when loc<51 then substring(text,1,150) else substring(text,loc-50,150) end text

    from (select OBJECT_NAME(id) name, CHARINDEX('tablename',text) loc,text

    from dbo.syscomments

    where text like '%tablename%') d"

Viewing 15 posts - 16 through 30 (of 49 total)

You must be logged in to reply to this topic. Login to reply