Non-Clustered Primary Key and Non-Unique Clustered Index Questions

  • I'm working on a table design and I think the table will have a non-clustered primary key, but I'm wondering how that's implemented behind-the-scenes.

    Suppose my table looks like this (sort of, the other_data is actually about 20 other columns, all containing fixed-length data, some of it nullable):

    CREATE TABLE test (

    thing_id CHAR(15) NOT NULL PRIMARY KEY NONCLUSTERED,

    customer_id CHAR(8) NOT NULL,

    other_data CHAR(40) NOT NULL,

    insert_timestamp DATETIME2(0) NOT NULL DEFAULT SYSDATETIME()

    );

    CREATE CLUSTERED INDEX test_idx ON test(customer_id, thing_id);

    This particular table is re-populated nightly (effectively, dropped and re-created) and most of the prior day's rows are not kept. During a day, it has the following characteristics:

    - Heavy insert activity, one row at a time, the key is in random sort order, the end-of-day table will far exceed the beginning-of-day table in size

    - Heavy update activity, one row at a time, always using the thing_id in the where clause, should be update in place because only fixed-length, non-null columns are updated, and never the thing_id or customer_id

    - Heavy select activity, always with customer_id in the where clause, selecting all of the columns

    - No deletes

    Given the activity, it seems that the single-row update does not benefit sufficiently from a clustered primary key, while the select should get enormous benefit of reading from entire and/or contiguous pages every time. Assume that the indexes are re-created nightly with an appropriate fill factor.

    Out of curiosity, given the table above, what does the non-clustered unique index (the primary key) have in the rows of its index pages? It's supposed to have the index columns followed by the clustered index columns. So is it composed of (thing_id, customer_id, thing_id)? Does it repeat thing_id or does it only include it once, recognizing that it's already in the index row? And is there SQL to run to figure out how it build the index rows?

    As for the clustered index, besides starting it with customer_id, I don't know what else (if anything) should be in it. Options include:

    - customer_id, with uniqueifier

    - customer_id, thing_id (no uniqueifier)

    - customer_id, insert_timestamp, with uniqueifier (matches how the client will usually sort the results, but the SELECT itself does not have an ORDER BY)

    - customer_id, insert_timestamp, thing_id (no uniqueifier)

  • Stephanie Giovannini (4/24/2015)


    I'm working on a table design and I think the table will have a non-clustered primary key, but I'm wondering how that's implemented behind-the-scenes.

    Suppose my table looks like this (sort of, the other_data is actually about 20 other columns, all containing fixed-length data, some of it nullable):

    CREATE TABLE test (

    thing_id CHAR(15) NOT NULL PRIMARY KEY NONCLUSTERED,

    customer_id CHAR(8) NOT NULL,

    other_data CHAR(40) NOT NULL,

    insert_timestamp DATETIME2(0) NOT NULL DEFAULT SYSDATETIME()

    );

    CREATE CLUSTERED INDEX test_idx ON test(customer_id, thing_id);

    This particular table is re-populated nightly (effectively, dropped and re-created) and most of the prior day's rows are not kept. During a day, it has the following characteristics:

    - Heavy insert activity, one row at a time, the key is in random sort order, the end-of-day table will far exceed the beginning-of-day table in size

    - Heavy update activity, one row at a time, always using the thing_id in the where clause, should be update in place because only fixed-length, non-null columns are updated, and never the thing_id or customer_id

    - Heavy select activity, always with customer_id in the where clause, selecting all of the columns

    - No deletes

    Given the activity, it seems that the single-row update does not benefit sufficiently from a clustered primary key, while the select should get enormous benefit of reading from entire and/or contiguous pages every time. Assume that the indexes are re-created nightly with an appropriate fill factor.

    Just remember that al the UPDATE activity is going to require a read as well. And, if the primary key is nonclustered, then a lookup to the clustered index to find the row location will required for every single UPDATE.

    Out of curiosity, given the table above, what does the non-clustered unique index (the primary key) have in the rows of its index pages? It's supposed to have the index columns followed by the clustered index columns. So is it composed of (thing_id, customer_id, thing_id)? Does it repeat thing_id or does it only include it once, recognizing that it's already in the index row?

    No, it won't store that value twice because it doesn't have to. But, if the clustered index is created as you have it above, since it's not marked as UNIQUE, it's going to add an internal column to make the key unique. If that combination of columns is unique, you should mark it as such to avoid that extra column, which will be included in any nonclustered indexes.

    And is there SQL to run to figure out how it build the index rows?

    As for the clustered index, besides starting it with customer_id, I don't know what else (if anything) should be in it. Options include:

    - customer_id, with uniqueifier

    - customer_id, thing_id (no uniqueifier)

    - customer_id, insert_timestamp, with uniqueifier (matches how the client will usually sort the results, but the SELECT itself does not have an ORDER BY)

    - customer_id, insert_timestamp, thing_id (no uniqueifier)

    I think you'll want to test whether you have more reads or more updates + reads in order to determine which will make a better clustered key, the customer_id/thing_id or just thing_id. I wouldn't go with customer_id alone if it's not unique and if the customer_id/thing_id is unique, make sure you mark it that way. Other than that, I don't think I'd consider anything else for the clustered index based on what you've stated.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (4/24/2015)


    But, if the clustered index is created as you have it above, since it's not marked as UNIQUE, it's going to add an internal column to make the key unique. If that combination of columns is unique, you should mark it as such to avoid that extra column, which will be included in any nonclustered indexes.

    Doh! I forgot that. Thanks.

    I think you'll want to test whether you have more reads or more updates + reads in order to determine which will make a better clustered key, the customer_id/thing_id or just thing_id. I wouldn't go with customer_id alone if it's not unique and if the customer_id/thing_id is unique, make sure you mark it that way. Other than that, I don't think I'd consider anything else for the clustered index based on what you've stated.

    The number of reads is determined by user behavior while the number of updates is determined by external systems. The process that reads and the one that updates are different. By the time I am able to test it, I will not be (administratively) allowed to change whatever clustered index has been decided upon short of a glaring performance problem, and if I have such a glaring performance problem, then something besides the index choice is very wrong.

    I'm leaning toward the customer_id, thing_id UNIQUE clustered index because the update of a single row only has one row's worth of indirection. The update will do a non-clustered index seek followed by a clustered index seek, both of which will return exactly one row. Then it (should) be able to do an update-in-place. The select, on the other hand, will usually be returning multiple rows. If the rows themselves are arranged by customer_id, then it may be able to read them off the same or contiguous pages. The speed of all of these operations is critical. I expect single-row inserts and updates not to pose much of a problem, but the select (per customer_id) also must be very fast, no matter how many rows are being returned.

  • I think your logic is sound. I'm just a big fan of testing things to be sure.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I'd consider clustering by only customer_id, unless the selects themselves order by both customer_id and thing_id.

    I'd also use a lower fillfactor for these indexes, maybe starting somewhere between 80% and 90%. A few page splits aren't that bad, but you probably don't want every page immediately splitting.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply