Clustered Indexes? Sedimentary, my dear Watson

  • Comments posted to this topic are about the item Clustered Indexes? Sedimentary, my dear Watson

  • Heh... It certainly "depends" but try rebuilding indexes without a clustered index and see what happens. Keep in mind that indexes also "split" except they frequently do so by extent rather than by page.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Tony,

    Awesome point that I think too few people consider (or even know to consider). I always think about this when creating a new table, but I don't always know exactly how that table will be used in the "wild". I've even re-clustered tables upon seeing how they're typically queried [sp?].

    Even when I recluster, I recognize that there will be a price to pay with inserts. I personally haven't dug deeply enough to know when that price is worse than the SELECT improvements. I generally just consider the situation -- more INSERTs or more SELECTS?

    Unfortunately, I work with a crew that feels that every table should have a clustered IDENTITY index that "resolves any abiguity" -- even though 90% of the time, the tail end of a relation will NEVER query that relational ID!

    In general, our DB's are small enough that that is an OK approach. HOWEVER, we have one MASSIVE DB that has benefitted from some reclustering, and could benefit more, but there is reistance due to the above beliefs. We'll see how that sorts out over time, as we get to the 5-6 million-plus records (which is fast approaching).

    -----------

    Wish I had a clever tagline here... D.

    Jsut thinking -- on the tagline -- what do you all think of "Inspiriation / Perspiration -- smell is in the nose of the beholder" or something like that? Thanks! D.

  • Well I'm strongly in the 'clustered index to organise the table' camp, and do have reasons.

    Firstly, note that these reasons are for OLTP-type systems. Lots of changes, lots of fast, small queries. The indexing rules for decision support/OLAP can be very different because of the radically different usage pattern.

    There can only be one clustered index, and it's going to be the largest index that exists for that table. Hence if it needs rebuilding, it will take the longest and have the greatest log impact so I want to minimise the need to rebuild

    Because there can only be one clustered index, I personally prefer to use the cluster mostly to organise the table (hence the ever-increasing and non-changing attributes) and use the nonclustered indexes for queries.

    If I can get the cluster so it's quite narrow, unique (or nearly so), ever increasing, non-changing and have it useful for queries as well, that's absolute first prize. My favourite here is cluster on a date inserted column where the table is often queried by date. Many transaction-type tables fit in this category.

    If the system has lots and lots of inserts (OLTP), I need those inserts as fast as possible. Hence I do not want frequent page splits on the cluster, nor do I want data moving around in the table if the clustering key gets updated. Both of those, if occurring enough, can have nasty effects on insert performance.

    Heaps have some interesting downsides. Until 2008, they couldn't be rebuilt, and there's occasionally odd behaviour around reusing of space within a heap. Plus those forwarding pointers.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi,

    I think we should not generalize; it is purely on case to case basis, if entity poses ordered attribute and that attribute make sense to the underline domain then we don’t need Identity Column for Cluster Index (assume unique key length is not too large). But at the same time if entity poses time based attribute, then better choice is Identity column (Sedimentary) for Cluster Index.

    E.g. – In Banking domain Savings Account Number attribute in Saving entity, here most of the cases data need to be fetch in Sing Account No order. So Saving Account No can use for Cluster index. Here again new account number not creating in sequence, but need to apply some prefix/suffix based on Saving Type or other attribute then we can again go back to Identity column fro Cluster Index.

    At the same time Saving Transaction case data normally need to arrange based on time of transaction we can have Identity Column and same can use for Cluster Index.

    thanks

    Nirmal

  • Timely post for me. Two days ago, I was troubleshooting a deadlock issue and found that deadlocks were centered around updates to a table which had a 3 column clustered index and all 3 columns were being updated.

    In a case where the clustered key is likely to change, I think it's a bad idea to cluster.

    That led me to the conclusion that a particular unique composite key is ok to cluster if it rarely changes. In the absence of a suitable candidate, adding an Identity column or a uniqueidentifier with a default of newsequentialid() is my next choice for clustered index.

    - Sam

  • I come from a data analyst background but this has got me into dba and database design work. My area of work is around demographic statistics so I've built up a database to store large census datasets and other administrative datasets. Each dataset usually has its own table however many tables, particularly the population table have multiple datasets.

    Anyway, awhile ago I realised that my clustered identity key did and didn't make logical sense i.e. as per your article. It didn't sit well with me because to me the data should be grouped according to the dataset and then most by the various demographic fields i.e. age, gender etc...

    But the reality is that I only upload datasets one at a time and always in a sort ordered manner. Therefore by coincidence my data was ordered logically anyway, however not by design but more by good fortune.

    I've found that my queries are always by dataset so the first thing the query engine would do is pull out the range of ids that fit the dataset (I have non-clustered indexes to which I presume the range gets easily identified). After that it would do table joins by ID and then or possibly before the column filters get applied.

    I realised then that the clustered identity column works for me in this instance however if my database wasn't so OLAP like it could be less than ideal. I love non-composite primary keys, they are just so easy for writing queries against but I couldn't put it over server performance.

    Thanks for the article it is good to read about the pros and cons especially with regards to page splitting, and insert, updates and deletes.

  • What? NOBODY mentions the major reason why identity fields were frowned on? Has this reason disappeared? Each insert is going to the same page, at least until the page is filled and even in 2005 that will cause a page lock until the insert finishes. (No more inserts allowed until the transaction commits.) It caused the term "Hot Spot" to be invented. Also the problem with the old datetime format when trying to insert by date when you have 20 records waiting to insert in the same 3 millisecond gap. (When your key was just a datetime field)

    Since when does sequential mean organizing data together? We've had two uniqueindex, clustered, primary key fields in one table. The first identifies the client (the value found in another table) The second is definitely unique in the table because it is always generated with a newid() value when inserted. This definitely organizes the data together because everything is retrieved for the client when a client shows activity. Now that's organizing data together using a completely random identification process.

    I feel for the person with the deadlocking problem. That's an argument among a host of reasons why artificial keys like identity fields are better than natural keys. (Keys, whose values have intrinsic meanings.)

  • I have a small 12 GB database to look after plus I am not a DBA since I am actually one of the developers that developed this database and I know by now how popular developers are amongst DBA's but I base my indexing on the following.

    1. A primary key is the clustered index

    2. All other indexes are nonclustered since you can anyway only have one clustered index.

    The reason why is, as someone has said already, that with clustered indexes the data gets ordered according to the columns and therefore the data gets added that way in the index. Less scanning.

    I use identity columns only when absolutely necessary.:-D:-D:-D:-D:-D:-D

    :-PManie Verster
    Developer
    Johannesburg
    South Africa

    I can do all things through Christ who strengthens me. - Holy Bible
    I am a man of fixed and unbending principles, the first of which is to be flexible at all times. - Everett Mckinley Dirkson (Well, I am trying. - Manie Verster)

  • We manage a system where there are thousands of reads for every write operation so we use clustered indexes on the keys used to do the majority of joins (not always the primary key). Performance is woeful if we rely on normal indexes for these key join relationships. We also have the added issue that the majority of key values in the system carry additional meaning i.e. the range of numbers (INT) also implies business grouping ownership which means clustering also has a side effect of storing all the groups in an order - rather than being a burden, this happenstance appears to offer many benefits

  • It's good to see such sensible advice here. Yes, a table should always have a clustered index, and IMO an Identity helps solve many of the issues with page splits etc that are referenced in the article.

    The key to all of this is to ensure that the Identity column is there, and most importantly that it has some meaning within the database.

    If you can design your database around identity columns, and have them actually mean something in the context of your system, then you have a clustered index that can be used for searches and joins, whilst also having them organise your tables data.

    Personally, I have used this approach on some very large systems. It works.

  • How I do things:

    The choice of IDENTITY over something really depends on whether there is already a user defined key. For example, concerning trucking, the Terminal table would use the user defined TerminalID (NV5) as the PK and not an Identity column. CustomerID however would be an IDENTITY column.

  • We have a DB which is heavily queried in many bizarre ways (user-definable queries). After a lot of profiling, we found that the IDENTITY of one table was NOT the best clustered index, it was three other fields which are almost always used together, so we changed the Clustered index over.

    The next morning, the system was effectively frozen. According to the profiler, this should have been the 'perfect' solution. Unfortunately what we'd neglected to fully take into account was that although more than 60% of the queries were SARGed on that triplet, 40% were on other columns, including the ID, and the "Key Lookup" cost which was being incurred was now amazingly prohibitive.

    For OLTP systems which drive UIs, The main problem is that the App typically pulls out most of the columns for display. While having the Clustered Index be something a bit complex (or non-standard) fixes the "bulk" of the queries, it can severely impact other queries, especially if the UI allows searches such Search By multiple (optional) arguments.

    We also noticed that once the CI was moved, we had to add a lot of new statistics to 'fill in' for the old CI.

    For this reason, unless there's a good reason, we take the "hot spot" hit (low trans/sec, so not too bad). It may not be perfect, but if your system grows 'organically' (i.e. the developers add queries and screens without a Change Management Board), a non-IDENTITY CI may do more harm than good.

  • I agree with one of the comment above that stated we need to avoid over generalization. There are many design factors that need to be considered. I personally believe that you need to focus on what adds value to the process. If adding an Identity column as primary key will serve some functional purpose then by all means used it.

    If you put it there just for the sake of creating an easy to use ordering and it serves no real value, you may want to look more closely as see if there are better candidates, 'natural keys' that could be utilized.

    In general, I like the the identity based clustered index, but only if it has a good reason to exist. Thus, when I design the system, I plan to use them in a meaningful way so that it add value.

  • I agree with the posts that says "should not be generalized". In my case i have few sales tables with millions of records. We load data early morning and have few inserts/updates throughout the day but have heavy reads through reports. Each table has a unique indentity column but i have put the clustered index on date and location columns (composite) and few more nonclustered indexes on other columns. This arrangement serves the best in our scenario.

Viewing 15 posts - 1 through 15 (of 49 total)

You must be logged in to reply to this topic. Login to reply