Heap or clustered, What to do ...

  • Well - I sure feel silly. I thought all these replies (above) were going to reference the original posters. Sadly, I am a bit new to posting here. I apologize for the error.

  • So is XXXXXXXXXXX by itself unique across all sites?

    Or do you need the YYY and ZZZ values to get a unique key? Do you always specify YYY and ZZZ when querying the table?

    If the latter, you should cluster the table on:

    YYYZZZXXXXXXXXXXX

    which makes each Sales Point and Site ever-ascending, and restricts each query to a contiguous range of values that match the clustered index keys.

    But, as others have noted, don't combine them into a single column. Just make a 3-column clustering key: (Ycol, Zcol, Xcol).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • SeanNerd (11/13/2014)


    My question is about indexing strategy for the following scenario:

    I have a table where we cannot use the typical int identity primary key column. The table can have millions of rows. Data for this table is generated by client applications that are off-line, so we used the following strategy to create our keys for the TransNo bigint column:

    XXXXXXXXXXXYYYZZZ

    XXXXXXXXXXX is an incrementing value tracked locally at the client applications

    YYY is the SalesPointId 000 to 999

    ZZZ is the SiteId 000-999

    This way we can generate keys off-line, and there aren't any primary key violations during the consolidation process when the data from the clients is merged together into the main database. Note - the client databases are old-school .dbf files, so we can't use a distributed (ranged) identity column strategy.

    I have read that placing a clustered index on this column is bad since the data is not inserted in an "ever increasing" manner (5077001 can come in before 4012001, for instance). I have also read that since clustered indexes physically reorganize the data on disk in this heavily inserted table, we would likely see performance problems as the disk system tries to keep up with SQL Server's demand.

    The idea of having an increasing value for a clustered index is for an optimal, cluster. But it's not the only way to go, nor a requirement for the clustered index. Putting a clustered index on columns other than an increasing one just means you may deal with more page splits in your cluster, something to be avoided, but not the end of the world.

    What we have done before (from former research - perhaps faulty research) was to create a bogus column that is an int identity, and create a clustered index on that column. It wasn't used for anything - no foreign keys, no lookups based on this value - but we put it there because we read that every table should have a clustered index.

    Yes, every table should have a clustered index, but, it should be in the right place. Just putting a cluster on there to have a cluster is actually a poor choice. Instead, the goal has to be to put the cluster in the right place. My argument for this is to put it on the most frequent access path to the data. If this is an identity style primary key, great. If it's a foreign key from another table, fine. Whatever the primary access path to the data is, generally, you're going to get the best performance out of the cluster by using that because a clustered index stores the data for the table.

    Putting the cluster on a column that isn't used is a waste of the clustered index.

    I recently read that having a clustered index might actually cause a performance problem - because the Row Locator is kept with the clustered index - so a lookup on our meaningful key would then have to look in the clustered index to get the Row Locator, and then it would go to the row. If there was no clustered index, then it would use the internal RowLocater and skip the extra lookup - and actually be faster (Simple Talk Link[/url])

    If your clustered index is not unique, an additional uniquifier is added to make it so. Again, the data is stored with the cluster, so it has to be able to retrieve it uniquely from the disk. In general it's better to have a unique clustered index, but you don't have to.

    From the sounds of things, your column is a good candidate for a clustered key. You might want to consider reordering the numbers so that instead of what is effectively a random number at the start you get put the two classifying numbers at the start. This would lead to clustering of inserts within certain ranges, reducing the fragmentation that the random nature of that leading number would cause.

    I also understand that tables without clustered indexes (heaps) are more prone to fragmentation and can't be physically reordered, as there is no clustered index. Though, I'm not sure how important that is.

    For point lookups, if you only ever retrieve a single row, fragmentation doesn't matter much. But as soon as you start to scan indexes at the page level, fragmentation means that it has to jump all over the disk, back and forth, to retrieve data. Even on an SSD this is costly.

    So - should I make our TransNo column a primary key (but not a clustered index)? Should I put in the unused int identity column and put a clustered index on that? Should I just make it a clustered primary key and suffer with the IO issues?

    I'm anxious to hear what you think. Thanks in advance for you thoughts.

    - Sean

    I lean towards yes, making transno the primary key and the clustered index, if that's how you retrieve your data most of the time. But, as I said above, I'd probably rearrange the key otherwise you're looking at quite a lot of fragmentation due to page splits because of the somewhat random nature of the data you're inserting. The fact that this column will be unique will help it as a clustered index. Also the selectivity of the data will be quite high. The histogram might be a small issue requiring somewhat frequent statistics updates, but probably no worse than anything you're currently dealing with.

    A couple of things I haven't heard mentioned. What do your statistics look like currently and how are you maintaining them? Are you doing index fragmentation maintenance currently? Have you looked at the execution plans for your queries to ensure all these indexes are being used appropriately currently?

    A note on posting, you have this question on the 2012 forum, but you're telling us you're limited to only 2005. I'd suggest posting 2005 questions in the 2005 forums will get you more 2005 answers and fewer suggestions you can't support.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • SeanNerd (11/14/2014)


    Well - I sure feel silly. I thought all these replies (above) were going to reference the original posters. Sadly, I am a bit new to posting here. I apologize for the error.

    Not a problem. Hitting the Reply button replies to the thread. Hitting the Quote button puts the previous post into your post, making the response a little more clear.

    Welcome to the forum.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • And, i agree with the others, just making the three columns into a clustered key, especially if it's unique will work well. But, in that case, reverse what I said about making the other two columns leading. They won't be very selective and would stink as the first column (the only one used in the histogram), so you'd be better off with the current order.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • SeanNerd (11/14/2014)


    Well - I sure feel silly. I thought all these replies (above) were going to reference the original posters. Sadly, I am a bit new to posting here. I apologize for the error.

    No problem. It's a learning experience. 🙂 To make it so the reference to the post you're responding to is include, use QUOTE instead of just REPLY.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/15/2014)


    SeanNerd (11/14/2014)


    Well - I sure feel silly. I thought all these replies (above) were going to reference the original posters. Sadly, I am a bit new to posting here. I apologize for the error.

    No problem. It's a learning experience. 🙂 To make it so the reference to the post you're responding to is include, use QUOTE instead of just REPLY.

    Heh... and here's another problem. Grant already posted this suggestion and I forgot to continue reading down before posting. :blush:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply