t-sql key on a table

  • Below is the sample sql and what is the best way to create a primary key on it?

    Is it ok if a column is added as ...... [ID] [int] IDENTITY(1,1) NOT NULL and/or create a composite key?

    IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[UserTable1]') AND type in (N'U'))

    DROP TABLE [dbo].[UserTable1]

    GO

    CREATE TABLE UserTable1(Item VARCHAR(5) NOT NULL,

    ItemNumber INT NOT NULL)

    INSERT INTO UserTable1(Item, ItemNumber)

    VALUES ('ABC11',58),

    ('ABC11',20),

    ('ABC11',58),

    ('ABC11',43),

    ('ABC11',16),

    ('ABC22',08),

    ('ABC22',53),

    ('ABC22',90),

    ('ABC22',43),

    ('ABC22',101),

    ('ABC33',43),

    ('ABC33',58),

    ('ABC33',105),

    ('ABC33',32),

    ('ABC33',63),

    ('ABC44',54),

    ('ABC44',58),

    ('ABC44',76),

    ('ABC44',99),

    ('ABC44',25),

    ('ABC55',54),

    ('ABC55',65),

    ('ABC55',90),

    ('ABC55',42),

    ('ABC55',100)

    SELECT * FROM UserTable1

  • For this table, it is fine if you add a column as IDENTITY (1, 1) and define it as a primary key. It'll be a 4-byte key.

    It appears that the combination of the two columns is unique, so the other possibility is to define a natural key using the combination of the two columns. This would be called a composite key because it contains more than one column. There is great debate between using artificial keys and natural keys and I'm not trying to ignite a debate. Ami Levin has a chapter on the debate in "SQL Server Deep Dives Volume 2" and it covers both sides to the argument.

    What matters in the decision here is the overall design of the table, the uniqueness of the columns and if the values are ever-increasing. If you add more columns to this table and add nonclustered indexes to them, all nonclustered indexes inherit the values of the clustered index, which increases the size, so you'd want the size of the clustered index to be as narrow as possible. If you add an artificial primary key and the two columns are unique, you should also consider adding a unique constraint to the two columns.

    I know I've thrown out a lot of theory here, but as with so very many things, "it depends" strikes again.

  • Identity is a terrible idea on that table. Use the existing columns, in whichever order best matches your use of the table: either ( Item, ItemNumber ) or ( ItemNumber, Item ).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • *grin* Ed did not want to ignite debate, but it was inevitable. I agree with Scott, creating a third column of meaningless data makes no sense, if a natural, meaningful key exists. If you didn't do that, you'd want to index your two columns anyhow, you're clearly going to filter or search on them.

  • Actually, it's not a terrible idea to add an IDENTITY column to this table especially if it suffers a lot of inserts. If you make the clustered index (the default for PKs) on the column pairs given, you will constantly be fighting massive page splits that could actually cause timeouts for a GUI. I agree that the PK should not be on the IDENTITY column, but the clustered index should not be on the column pair... it should be on an IDENTITY column.

    [EDIT] Please see my next post below for an exception to what I've stated above.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I did not know that a clustered index is the default for a PK. Doesn't a clustered index mean the entire row is written in the index ?

  • Just to add to that, "It Depends".

    If this table is a lookup table that is mostly static, leaving out an IDENTITY column and using the the two columns as both the PK and the Clustered Index would be the way to go especially for performance.

    Like I said previously, if this table suffers a large number of INSERTs, then add the IDENTITY column and make it the UNIQUE Clustered Index and make the PK Non-Clustered on the other two columns.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Christian Graus (1/6/2014)


    I did not know that a clustered index is the default for a PK. Doesn't a clustered index mean the entire row is written in the index ?

    To be semantically correct, it means that the rows of the table form the leaf level of the Clustered Index and a B-Tree is built over the top of it. That's why a Clustered Index Scan is actually the same as a Table Scan except that a Table Scan can only happen on HEAPs because there is no Clustered Index on those.

    And, yeah... if you assign a Primary Key without explicitly assigning it as a Non-Clustered index and no Clustered Index has yet been assigned, the PK will be created as a UNIQUE Clustered Index along with the special PK meaning of "NOT NULL".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks - good to know.

  • Christian Graus (1/6/2014)


    Thanks - good to know.

    Just so you don't think that I'm making this up, here's the entry from Books Online under the title of "Creating and Modifying PRIMARY KEY Constraints ". I know I hate it when someone tells me something about SQL Server and can't actually back it up.

    The Database Engine automatically creates a unique index to enforce the uniqueness requirement of the PRIMARY KEY constraint. If a clustered index does not already exist on the table or a nonclustered index is not explicitly specified, a unique, clustered index is created to enforce the PRIMARY KEY constraint.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ed Wagner (1/3/2014)


    For this table, it is fine if you add a column as IDENTITY (1, 1) and define it as a primary key. It'll be a 4-byte key.

    It appears that the combination of the two columns is unique, so the other possibility is to define a natural key using the combination of the two columns. This would be called a composite key because it contains more than one column. There is great debate between using artificial keys and natural keys and I'm not trying to ignite a debate. Ami Levin has a chapter on the debate in "SQL Server Deep Dives Volume 2" and it covers both sides to the argument.

    What matters in the decision here is the overall design of the table, the uniqueness of the columns and if the values are ever-increasing. If you add more columns to this table and add nonclustered indexes to them, all nonclustered indexes inherit the values of the clustered index, which increases the size, so you'd want the size of the clustered index to be as narrow as possible. If you add an artificial primary key and the two columns are unique, you should also consider adding a unique constraint to the two columns.

    I know I've thrown out a lot of theory here, but as with so very many things, "it depends" strikes again.

    All good points. The only thing that you've left out of the "It Depends" scenario is how the table will be used. See my "It Depends" entry a couple of posts up.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Actually, it is a terrible idea to add an identity to this table.

    Depending on your specific data, future inserts could be sequential anyway. Even if not, rows are inserted once and typically read 10s, or even 100s, 1000s or more times, particularly for lookup tables, which this certainly seems to be. Besides, tables can of course be rebuilt when truly needed, and rebuilt online for this table (assuming Enterprise or other qualifying edition).

    Overall, the biggest performance gain you'll get is to stop using dopey identity clusters by default and instead consider carefully how the table is accessed. I've removed over 6000 indexes and improved SELECT performance dramatically, particularly for joins, the majority of it from just replacing identity clusters with properly-keyed clustered indexes.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (1/6/2014)


    Actually, it is a terrible idea to add an identity to this table.

    Depending on your specific data, future inserts could be sequential anyway. Even if not, rows are inserted once and typically read 10s, or even 100s, 1000s or more times, particularly for lookup tables, which this certainly seems to be. Besides, tables can of course be rebuilt when truly needed, and rebuilt online for this table (assuming Enterprise or other qualifying edition).

    Overall, the biggest performance gain you'll get is to stop using dopey identity clusters by default and instead consider carefully how the table is accessed. I've removed over 6000 indexes and improved SELECT performance dramatically, particularly for joins, the majority of it from just replacing identity clusters with properly-keyed clustered indexes.

    First of all, stop using words like "dopey". There are some very good uses of IDENTITY columns that I use and your comments infer that I'm dopey. Unless you really like pork chops, please keep such comments to yourself. They don't become you as a professional, they add nothing to the conversation, and they can easily be taken as attempts to belittle or as ad hominem attacks.

    It certainly is possible that "future inserts could be sequential". If that's true, then that fits one of the commonly accepted best practices of making a Clustered Index "ever increasing" and you would, indeed, not need an identity column for this table.

    If, however, they are not entered sequentially, then it's not so much the performance of SELECTs that I'd be concerned with. Rather it would be the performance of the INSERTS. If there are a lot of INSERTs to this table by a lot of concurrent users, there will be massive page splits that can and will cause timeouts at the GUI.

    If the table is mostly static and is of manageable size, then I agree, it almost doesn't matter if an occasional page split occurs. That brings up another point. You have to also plan on maintenance requirements and space. If this is a large table, then massive page splits can easily cause the table to double in size due to the space wasted by the page splits and, contrary to popular belief, disk space isn't cheap. Yeah... you could REORGANIZE the Clustered Index on a regular basis, but even though you can do that in an online fashion, it's expensive CPU and IO wise. The other thing is that REORGANIZE is FULLY LOGGED no matter what recovery mode you’re in and REORGANIZE does NOT rebuild the B-TREE.

    If you decide to REBUILD a Clustered Index, you might be able to get away with the WITH DROP EXISTING optimization to keep your MDF file from growing. If you can't, then remember that any index over 128 extents (just 8MB), will have its page usage preserved until the new copy of the index has been created. If that's the Clustered Index of a 1TB table, then you better have a free TB+ hanging around to REBUILD it.

    The only good part about REBUILDing indexes is that they will be minimally logged in the BULK-LOGGED and SIMPLE recovery modes. Just keep in mind that if a minimally logged operation occurs while in the BULK-LOGGED mode, any log backups made during that timeframe cannot be partially used for true Point-in-Time recovery. You either have to use the whole log backup or you have to stop at a point before it.

    You ALSO have to consider the point (prior to 2014) that if a table had ANY blob columns in it, then you can NOT rebuild the Clustered index online. If the table has a blob in it, that automatically means that the Clustered Index cannot be rebuilt online.

    So, lots to consider. You can't just say, "Besides, tables can of course be rebuilt when truly needed, and rebuilt online for this table" because you might not actually be able to either because of blobs or because you don't actually have the disk space.

    The overall problem with this thread, so far, is that no one but the OP actually knows HOW the table will be used and he hasn't said yet.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/6/2014)


    ScottPletcher (1/6/2014)


    Actually, it is a terrible idea to add an identity to this table.

    Depending on your specific data, future inserts could be sequential anyway. Even if not, rows are inserted once and typically read 10s, or even 100s, 1000s or more times, particularly for lookup tables, which this certainly seems to be. Besides, tables can of course be rebuilt when truly needed, and rebuilt online for this table (assuming Enterprise or other qualifying edition).

    Overall, the biggest performance gain you'll get is to stop using dopey identity clusters by default and instead consider carefully how the table is accessed. I've removed over 6000 indexes and improved SELECT performance dramatically, particularly for joins, the majority of it from just replacing identity clusters with properly-keyed clustered indexes.

    First of all, stop using words like "dopey". There are some very good uses of IDENTITY columns that I use and your comments infer that I'm dopey. Unless you really like pork chops, please keep such comments to yourself. They don't become you as a professional, they add nothing to the conversation, and they can easily be taken as attempts to belittle or as ad hominem attacks.

    It certainly is possible that "future inserts could be sequential". If that's true, then that fits one of the commonly accepted best practices of making a Clustered Index "ever increasing" and you would, indeed, not need an identity column for this table.

    If, however, they are not entered sequentially, then it's not so much the performance of SELECTs that I'd be concerned with. Rather it would be the performance of the INSERTS. If there are a lot of INSERTs to this table by a lot of concurrent users, there will be massive page splits that can and will cause timeouts at the GUI.

    If the table is mostly static and is of manageable size, then I agree, it almost doesn't matter if an occasional page split occurs. That brings up another point. You have to also plan on maintenance requirements and space. If this is a large table, then massive page splits can easily cause the table to double in size due to the space wasted by the page splits and, contrary to popular belief, disk space isn't cheap. Yeah... you could REORGANIZE the Clustered Index on a regular basis, but even though you can do that in an online fashion, it's expensive CPU and IO wise. The other thing is that REORGANIZE is FULLY LOGGED no matter what recovery mode you’re in and REORGANIZE does NOT rebuild the B-TREE.

    If you decide to REBUILD a Clustered Index, you might be able to get away with the WITH DROP EXISTING optimization to keep your MDF file from growing. If you can't, then remember that any index over 128 extents (just 8MB), will have its page usage preserved until the new copy of the index has been created. If that's the Clustered Index of a 1TB table, then you better have a free TB+ hanging around to REBUILD it.

    The only good part about REBUILDing indexes is that they will be minimally logged in the BULK-LOGGED and SIMPLE recovery modes. Just keep in mind that if a minimally logged operation occurs while in the BULK-LOGGED mode, any log backups made during that timeframe cannot be partially used for true Point-in-Time recovery. You either have to use the whole log backup or you have to stop at a point before it.

    You ALSO have to consider the point (prior to 2014) that if a table had ANY blob columns in it, then you can NOT rebuild the Clustered index online. If the table has a blob in it, that automatically means that the Clustered Index cannot be rebuilt online.

    So, lots to consider. You can't just say, "Besides, tables can of course be rebuilt when truly needed, and rebuilt online for this table" because you might not actually be able to either because of blobs or because you don't actually have the disk space.

    The overall problem with this thread, so far, is that no one but the OP actually knows HOW the table will be used and he hasn't said yet.

    Even if some splits do occur, the table won't "easily double in size", since that would require every page to split.

    Automatically adding an identity as the clustering key by default simply is dopey. The clustered index is the most important index on any table, and so it should always be considered carefully.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply