Heap or clustered, What to do ...

  • My question is about indexing strategy for the following scenario:

    I have a table where we cannot use the typical int identity primary key column. The table can have millions of rows. Data for this table is generated by client applications that are off-line, so we used the following strategy to create our keys for the TransNo bigint column:

    XXXXXXXXXXXYYYZZZ

    XXXXXXXXXXX is an incrementing value tracked locally at the client applications

    YYY is the SalesPointId 000 to 999

    ZZZ is the SiteId 000-999

    This way we can generate keys off-line, and there aren't any primary key violations during the consolidation process when the data from the clients is merged together into the main database. Note - the client databases are old-school .dbf files, so we can't use a distributed (ranged) identity column strategy.

    I have read that placing a clustered index on this column is bad since the data is not inserted in an "ever increasing" manner (5077001 can come in before 4012001, for instance). I have also read that since clustered indexes physically reorganize the data on disk in this heavily inserted table, we would likely see performance problems as the disk system tries to keep up with SQL Server's demand.

    What we have done before (from former research - perhaps faulty research) was to create a bogus column that is an int identity, and create a clustered index on that column. It wasn't used for anything - no foreign keys, no lookups based on this value - but we put it there because we read that every table should have a clustered index.

    I recently read that having a clustered index might actually cause a performance problem - because the Row Locator is kept with the clustered index - so a lookup on our meaningful key would then have to look in the clustered index to get the Row Locator, and then it would go to the row. If there was no clustered index, then it would use the internal RowLocater and skip the extra lookup - and actually be faster (Simple Talk Link[/url])

    I also understand that tables without clustered indexes (heaps) are more prone to fragmentation and can't be physically reordered, as there is no clustered index. Though, I'm not sure how important that is.

    So - should I make our TransNo column a primary key (but not a clustered index)? Should I put in the unused int identity column and put a clustered index on that? Should I just make it a clustered primary key and suffer with the IO issues?

    I'm anxious to hear what you think. Thanks in advance for you thoughts.

    - Sean

  • SeanNerd (11/13/2014)


    My question is about indexing strategy for the following scenario:

    I have a table where we cannot use the typical int identity primary key column. The table can have millions of rows.

    use bigint

    Data for this table is generated by client applications that are off-line, so we used the following strategy to create our keys for the TransNo bigint column:

    XXXXXXXXXXXYYYZZZ

    XXXXXXXXXXX is an incrementing value tracked locally at the client applications

    YYY is the SalesPointId 000 to 999

    ZZZ is the SiteId 000-999

    consider using a sequence object instead of identity

    This way we can generate keys off-line, and there aren't any primary key violations during the consolidation process when the data from the clients is merged together into the main database. Note - the client databases are old-school .dbf files, so we can't use a distributed (ranged) identity column strategy.

    I have read that placing a clustered index on this column is bad since the data is not inserted in an "ever increasing" manner (5077001 can come in before 4012001, for instance). I have also read that since clustered indexes physically reorganize the data on disk in this heavily inserted table, we would likely see performance problems as the disk system tries to keep up with SQL Server's demand.

    the reorg happens when you create the index or rebuild it during maintenance, not continuously

    What we have done before (from former research - perhaps faulty research) was to create a bogus column that is an int identity, and create a clustered index on that column. It wasn't used for anything - no foreign keys, no lookups based on this value - but we put it there because we read that every table should have a clustered index.

    most tables should have (actually, *be*) a CI except in some DW staging situations. The thing is, since every NCI has the CI key in it, a small CI key (int or bigint) keeps the NCIs smaller.

    I recently read that having a clustered index might actually cause a performance problem - because the Row Locator is kept with the clustered index - so a lookup on our meaningful key would then have to look in the clustered index to get the Row Locator, and then it would go to the row. If there was no clustered index, then it would use the internal RowLocater and skip the extra lookup - and actually be faster (Simple Talk Link[/url])

    right about the lookup, but the advantages usually make it worth the (very small) price

    I also understand that tables without clustered indexes (heaps) are more prone to fragmentation and can't be physically reordered, as there is no clustered index. Though, I'm not sure how important that is.

    depends on your application, but usually leads to more table scans with the disk heads jumping all over creation

    So - should I make our TransNo column a primary key (but not a clustered index)? Should I put in the unused int identity column and put a clustered index on that? Should I just make it a clustered primary key and suffer with the IO issues?

    make transno a PK backed by a CI and use a sequence object for it

    I'm anxious to hear what you think. Thanks in advance for you thoughts.

    - Sean

    Gerald Britton, Pluralsight courses

  • g.britton (11/13/2014)


    SeanNerd (11/13/2014)


    My question is about indexing strategy for the following scenario:

    I have a table where we cannot use the typical int identity primary key column. The table can have millions of rows.

    use bigint

    Int goes up over 2 billion. Bigint is in no way necessary for tables with 'millions of rows'

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • You should have a clustered index. The storage engine is built around tables having clustered indexes.

    Unless you're doing millions of single-row inserts per second, you're unlikely to hit problems from hotspots. Test though and remember you can always reduce fill factor

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • g.britton (11/13/2014)


    SeanNerd (11/13/2014)


    My question is about indexing strategy for the following scenario:

    I have a table where we cannot use the typical int identity primary key column. The table can have millions of rows.

    use bigint

    I would actually recommend using 3 different columns that can be a single primary key. 1000 values might seem enough, but you might find they're not. Also, it will make your queries easier to code.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • GilaMonster (11/13/2014)


    g.britton (11/13/2014)


    SeanNerd (11/13/2014)


    My question is about indexing strategy for the following scenario:

    I have a table where we cannot use the typical int identity primary key column. The table can have millions of rows.

    use bigint

    Int goes up over 2 billion. Bigint is in no way necessary for tables with 'millions of rows'

    And, you could always seed the table with a negative value so that it starts at -2 billion and can then increment to positive 2 billion, incorporating 4 billion all told.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Luis Cazares (11/13/2014)


    g.britton (11/13/2014)


    SeanNerd (11/13/2014)


    My question is about indexing strategy for the following scenario:

    I have a table where we cannot use the typical int identity primary key column. The table can have millions of rows.

    use bigint

    I would actually recommend using 3 different columns that can be a single primary key. 1000 values might seem enough, but you might find they're not. Also, it will make your queries easier to code.

    I would second this suggestion. You can use a computed column if needed to get the combination out in one piece.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (11/14/2014)


    Luis Cazares (11/13/2014)


    g.britton (11/13/2014)


    SeanNerd (11/13/2014)


    My question is about indexing strategy for the following scenario:

    I have a table where we cannot use the typical int identity primary key column. The table can have millions of rows.

    use bigint

    I would actually recommend using 3 different columns that can be a single primary key. 1000 values might seem enough, but you might find they're not. Also, it will make your queries easier to code.

    I would second this suggestion. You can use a computed column if needed to get the combination out in one piece.

    And I third it. Something like XXXXXXXXXXXYYYZZZ will eventually start working against you.

    XXXXXXXXXXX, YYY, ZZZ are distinct pieces of information - keep them separate (as per rules of normalisation).

    If you're worried about fragmentation, stick an int/bigint identity column on the table and cluster on that, but create your PK across 3 fields.

  • Gazareth (11/14/2014)


    GilaMonster (11/14/2014)


    Luis Cazares (11/13/2014)


    g.britton (11/13/2014)


    SeanNerd (11/13/2014)


    My question is about indexing strategy for the following scenario:

    I have a table where we cannot use the typical int identity primary key column. The table can have millions of rows.

    use bigint

    I would actually recommend using 3 different columns that can be a single primary key. 1000 values might seem enough, but you might find they're not. Also, it will make your queries easier to code.

    I would second this suggestion. You can use a computed column if needed to get the combination out in one piece.

    And I third it. Something like XXXXXXXXXXXYYYZZZ will eventually start working against you.

    XXXXXXXXXXX, YYY, ZZZ are distinct pieces of information - keep them separate (as per rules of normalisation).

    If you're worried about fragmentation, stick an int/bigint identity column on the table and cluster on that, but create your PK across 3 fields.

    ...and if your table is ever referenced in a foreign key, the ID column would be a handy surrogate key for that purpose.

    Gerald Britton, Pluralsight courses

  • g.britton - thanks very much for your views. I am using bigint. (Actually I'm forced to use numeric(17, 0) because our FoxPro clients don't understand bigints). I am pretty sure that the physical arrangement of records occurs continuously, though. It may get fragmented as pages fill up, but I'm pretty sure the physical arrangement happens during inserts with clustered indexes.

    You say that advantages make it worth the very small price - but you didn't say what the advantages were... unless I messed it.

    I cannot use sequence objects because I must maintain compatibility with 2005 as long as Microsoft supports it.

  • @gilamonster - I have to use bigint because of the way the keys are constructed, as I mentioned in the post. If I used ints, I would only be able to get 2147 unique values per sales point. (actually 2146 for any salespoint with an id > 483).

    Besides, the question was about indexing, not data type. Thanks anyway.

  • @gilamonster - Well - we are certainly not doing millions per second. But surely tens or hundreds per second. Finding the breaking point will certainly be possible with lots and lots of time testing - which I'm willing to do. I just wanted some sort of definitive reason why clustered will be better in this case.

  • @ Luis Cazares - Sorry if I confused anyone - we are using bigints. We did discuss three separate keys, and maybe this is the right way. We avoided it because of the crazy joins we would have to write for every query that touches the transaction table. As I write this, it seems like it could be construed as a lazy way of thinking. But there are hundreds of places where we join the transact table. and that is only one table where we would have the compound keys - I would say something like 50 to 100 tables would have keys like this - including some linking tables.

  • Thanks, Grant - but, we are using bigints, and the question is really about whether or not to use clustered indexing - not data type.

  • @gilamonster - Interesting idea - I never thought of a calculate combined column! Thanks. I will review this idea with the team.

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply