Why Primary Keys can be important on large tables

  • Comments posted to this topic are about the item Why Primary Keys can be important on large tables

  • Proposal of modification (elimination of small tables):

    SELECT

    s.name

    , t.name

    , p.rows

    FROM

    sys.tables t

    JOIN

    sys.schemas s

    ON t.schema_id = s.schema_id

    JOIN (select distinct x.[object_id],x.rows from sys.partitions x WITH (NOLOCK)) AS p ON p.[object_id] = t.[object_id]

    WHERE

    t.type = 'U'

    and p.rows>1000

    AND NOT EXISTS ( SELECT

    k.name

    FROM

    sys.key_constraints k

    WHERE

    k.parent_object_id = t.object_id

    AND k.schema_id = s.schema_id

    AND k.type = 'PK' )

    ORDER BY

    t.name

  • Interesting article, but there is no explanation of WHY declaring a primary key caused the database to shrink, or why not having one caused excessive growth.

  • but there is no explanation of WHY declaring a primary key caused the database to shrink, or why not having one caused excessive growth.

    ..or whether it was the clustered index, not the primary key, that did it.

    Best wishes,
    Phil Factor

  • Well of course the unused space is going to shrink, since the table is being rebuilt. After many inserts and updates, the table will get fragmented again and the unused space would increase. Of course the amounts would depend on several things. I would like to see the stats on this particular table if it were rebuilt, but still a heap?

    Mind you, I'm all for having a primary key on a table.

  • In spite of declaring that PKs and CIs are not the same, the author seems to conflate the two. Pretty much every table should have a CI -- hopefully with a narrow index like an identity column or date+identity and hopefully UNIQUE as well). PKs are generally a good idea too, but I try to use business columns (possibly with the identity column as a uniquifier) to help the heaviest/most-frequent queries.

  • In case anyone is curious as to the relationship between a Primary key and a Clustered index in SQL Server, I cover the topic here Primary Key Primer for SQL Server[/url]

    Best wishes,
    Phil Factor

  • Basically, the article is describing the need for clustered indexes and not precisely the need of Primary Keys.

    PKs are used for referential integrity and are part of the logical design.

    CI are used to organize the contents of a table and are part of the physical design.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Is there some level where frequent random inserts and deletions would make a clustered index a poorer choice, because of constant movement of data (where it would seem that a non clustered index would just add and delete pointers)?

    ...

    -- FORTRAN manual for Xerox Computers --

  • @jay-h

    A 'table' without a clustered index is a bad idea when its data is usually returned in an aggregated form, or in a sorted order.

    However, I've noticed a few occasions where a heap (a table that has no indexes at all) can perform well as a log or a ‘staging’ table used for bulk inserts, since it is read very infrequently, and there is less overhead in writing to it. A table with a non-clustered index , but without a clustered index can, in unusual circumstances, sometimes perform well even though the index has to reference individual rows via a Row Identifier rather than a more efficient clustered index. The arrangement can be effective for a table that isn’t often updated if the table is always accessed by a non-clustered index and there is no good candidate for a clustered index, but you don't come across this often.

    The general rule is that things work best with a primary key enforced by a clustered index, but there are unusual places where a primary key enforced by a non-clustered index, where there is a clustered index elsewhere in the table, performs better and also, more rarely, where there are no indexes at all! It is best just to play safe and follow the general rule!

    Best wishes,
    Phil Factor

  • This article is one of the more worse ones because it is dealing with recommendations which - concerning the original requirements of the database - are nonsence:

    "because if you don’t have a primary key / clustered index and your table does inserts, updates and deletes, SQL Server will fragment the table, and it will take up a lot of extra space"

    This is not true because Microsoft SQL Server will scan the PFS for free space. If the percentage value in the PFS is sufficient the record will be inserted on a arbitrary page which has enough space (see my article in TECHNET WIKI here: http://social.technet.microsoft.com/wiki/contents/articles/21877.sql-server-how-does-sql-server-allocate-space-in-a-heap.aspx)

    Most important sentence concerning the reason behind using a heap seems to be that one:

    "This database was used as a staging area to load data into the live database."

    A heap can much faster load data than a clustered index. As long as it is not used for consolidation but only staging it would be the best choice. But that is an assumption which may fail if I know the workload 🙂

    "In general, if your table is doing inserts, update and deletes, it is a bad idea to have a table without a clustered index."

    NO, that is nonsense - a heap will ALWAYS better perform than a clustered index because of the following reasons:

    in a heap you won't create "hot spots" if you clustered key is a contigious one all INSERT will be concentrate at the end of the table!

    in a heap you won't have fragmentation when you INSERT data.

    AND - due to the fact that we doesn't know the meta data of the table...

    - will it be fragmented if the data are only fixed lenght numerics?

    - will it be fragmented if the table is using (N)CHAR instead of (N)varchar?

    NO - it won't because the fixed length elements cannot be expand because - e.g. strings - will be filled with 0x20 (blank)

    The autor is quite often mixing PK and CI. In one sentence he is talking about a PK (which is a constraint!) and in the next sentence about a CI (which means physical allocation for data).

    "So I altered each table and made the existing identity column the primary key / clustered index. The effect was the total size of the database dropped significantly. Here is an example of one table"

    Aha - and than you have reduced fragmentation, rebuild the table (and ALL other indexes) and the log has bloated!

    The described values are absolute normal and have NOTHING to do with any benefit of a CI (not to mention the PK).

    You describe it like a witchdoctor who wants to sell some kind of "ghost medicine". If you would have checked the sys.dm_db_index_physical_stats for the index = 0 i bet you have had thousands of forwarded records. If you don't know what it is you can get some info here: http://www.sqlskills.com/blogs/paul/forwarding-and-forwarded-records-and-the-back-pointer-size/

    "After a primary key was added the data space and the unused space dropped and the index space used didn’t go up that much."

    WTH! It is not because of the PK but of the creation of the Clustered Index.

    "I hope this article has shown the importance of adding primary keys to large tables to reduce the overall size of the database. "

    No - it didn't. This article may confuse a novice when reading this stuff.

    I would recommend to the author:

    - read the article again

    - review it

    - rewrite some passages of it to get rid of the conflicts in this article!

    Microsoft Certified Master: SQL Server 2008
    MVP - Data Platform (2013 - ...)
    my blog: http://www.sqlmaster.de (german only!)

  • I would personally stay away of GUI when creating PKs. This is not only because additional locks that the GUI will create, but the extra control that the DBA will have over the code and PK if creates it with pure TSQL.

  • sql-lover (6/30/2014)


    I would personally stay away of GUI when creating PKs. This is not only because additional locks that the GUI will create, ...

    There will be no additional locks when you create the PK with the GUI. In the moment the GUI will fire the command ALTER TABLE it will hold exactly the same locks as the command will do 😉

    All additional overhead is attributable to the usage of - any - GUI which should make meta data / user data visible!

    If you check it with PROFILER or extended events you will see what I mean...

    Microsoft Certified Master: SQL Server 2008
    MVP - Data Platform (2013 - ...)
    my blog: http://www.sqlmaster.de (german only!)

  • Uwe Ricken (6/30/2014)


    sql-lover (6/30/2014)


    I would personally stay away of GUI when creating PKs. This is not only because additional locks that the GUI will create, ...

    There will be no additional locks when you create the PK with the GUI. In the moment the GUI will fire the command ALTER TABLE it will hold exactly the same locks as the command will do 😉

    All additional overhead is attributable to the usage of - any - GUI which should make meta data / user data visible!

    If you check it with PROFILER or extended events you will see what I mean...

    Not correct. Or I didn't express myself properly.

    It may be due additional GUI overhead, but deploying the PK via GUI takes longer, much more than via TSQL. And I'm not mentioning the time it takes to open the GUI, browse, write, etc. Typing the command is faster.

    I've been a DBA for over of 10 years and I only used it when I was learning stuff.

    Professional DBAs should always use TSQL for creating a PK, in my opinion.

    GUI helps with other stuff, but using it or recommend it for creating a PK may be better for newbies. At least, that's my opinion. Of course, others may differ, depending of work experience, etc.

  • I'd like to ask what are everyone's thought on the use of non-clustered unique indexes.

    I've found these useful for the following reason.

    Our DBA's re-index on a weekly schedule and the default option is online=on and the sort option is not set for tempdb.

    This means that the free space required in the database needs to be roughly equivalent to the size of the biggest table with a clustered index.

    We have a databse of ~500GB and the biggest table (primary key clustered index) is 250GB hence we needed ~750GB just to cater for online re-index.

    We changed the index to non-clustered unique and saved ourselves ~220GB.

    What are the performance problems we may face with this approach?

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply