Using a Surrogate vs Natural Key

  • Comments posted to this topic are about the item Using a Surrogate vs Natural Key

    Steven Henderson
    The SQL Guy

  • Good article but you didn't mention IMO a more important reason to use surrogate key rather than natural key.

    In your example, assume that an employee changes his SSN (which is not impossible situation in case of identity theft).

    In that case what we can do? updating the SSN on Employee table and all related tables might be horrible task.

  • Using the Employee table as an example: What if you need to have an unique index on the social security number to prevent duplicates.  Would that not mute everything?  (As most MVP's always says: It depends)

    5ilverFox
    Consulting DBA / Developer
    South Africa

  • I've always found the following points be in favour of artificial keys

    if my key is "firstname, surname, dateofbirth" then for every linked table I have to use 2 varchars and a date column

    imagine that I store something in key value pair format that has a foreign key reference to my "person" table - then I get massive amounts of storage in my child tables

    now looking at your sql code, your join structure becomes very difficult to read

    if I put an identity as my PK on the person table then my child tables become narrower (those child tables have many more rows than the parent table and scanning though that size foreign key is plain silly)

    To be clear - the PK doesn't have to be the clustered index - and you can still put a unique constraint on the fields that define your uniqueness, but for performance, I pretty much always go for a surrogate/artificial key and use unique indexes along side them

     

    MVDBA

  • Primary Key can have Clustered Index or Non Clustered Index.

    If you just wanted to have Primary Key with Non Clustered Index then it hardly matters whether you choose Surrogate Key or Natural Key. If you choose the Surrogate Key then you can have Unique Key / Unique Non Clustered Index on the column identified as Natural Key candidate and there won't be much difference.

    But here is the catch. If you wish to have Surrogate Key / Natural Key with Clustered Index then you will have to evaluate your decision very wisely considering your workload and the kind of queries being executed including the joins. You will also have to evaluate whether your workload is read intensive or write intensive? You also need to keep in mind the following important points considering your workload.

    Non Clustered Index involves Key Loop-up if the column is not present in the Index column or is not included in the index. Your decision can have significant performance impact.

    Clustered Index also stores the copy of the data. Best practices recommended for Clustered Index column(s).

    1. It should be as narrow as possible in terms of the number of bytes it stores.
    2. It should be unique to avoid SQL Server to add Uniquefier to duplicate key values.
    3. It should be static, ideally, never updated like Identity column.
    4. It should be ever-increasing, like Identity column to avoid fragmentation and improve the write performance.

    Data is sorted based on clustered index key values and then stored. The only time the data is stored in sorted order when the table contains a clustered index. Sort operation is most costly operator in SQL Server. Since data is stored in sorted order hence sorting on the clustered key column avoids the sort operator and makes it best choice for sorting.

  • Japie Botma wrote:

    Using the Employee table as an example: What if you need to have an unique index on the social security number to prevent duplicates.  Would that not mute everything?  (As most MVP's always says: It depends)

    Actually, in this case I personally would use a surrogate key as the primary key and create a unique index on SSN as an alternate key.

    This is still better than using SSN as the primary key because the primary key must be used in all foreign keys. Assuming your "lifetime scope" of employees will never exceed 2 billion (what company would *ever* hire two billion people over its lifetime? :)) the surrogate key could be an INT identity without needing to use the negative values in the domain. That's 4 bytes vs 9, of a sequentially increasing value, so 0 fragmentation, only 1 index for SSN (of very limited size) and that 4 vs 9 bytes happens for every single foreign key.

    Plus, (assuming a char(9) SSN) the primary key being INT also takes advantage of all the specialized hardware in the CPU to deal with 4 byte integers, making them inherently faster.

  • Not a bad article but no mention of multi-column natural keys?  in my experience, every database I've worked with that uses natural keys ends up with some tables having to use two, three, or even four to five fields to make up a natural key.  It's a foreign key nightmare.  I've long been of the opinion that it is best to think of a key as a row identifier and not a data identifier.  When you can wrap your head around that principal database management and development is galactically simpler.

    the latest example of a natural key nightmare was when we merged our systems with a shipping company we had just purchased. Both companies  used natural keys to identify their orders, but when we merged systems there were clashes.  It would have been so simple had we just used sequential big ints.  In fact, we took a little bit of time and converted everything to surrogate keys across the board and it has been wonderful.

  • The problem with using the social security number as a "natural" unique ID is that, erm, it isn't.

    See this: https://www.idanalytics.com/press-release/20-million-americans-multiple-social-security-numbers-associated-name/

    Thomas Rushton
    blog: https://thelonedba.wordpress.com

  • If there is a natural candidate key and if you decide to use a surrogate key, test adding a unique constraint on a candidate key’s columns. Test using a simulated or a profiled concurrent workload, with simulated or real data, and compare their actual or estimated plans. You may discover a unique constraint is beneficial, especially when it covers predicates. A side benefit is that unique constraints prevent duplicates (obviously :).

  • Apart from the performance aspects, from a programmer's point of view I prefer using a surrogate key (mostly named IDPK and of type int or bigint) because it standardizes the structure of my tables (first column is always IDPK) and simplifies joins. A Foreign Key to e.g. tblCustomer always  is named CustomerFK and refers to tblCustomer.IDPK. One less thing that I have to think about when creating and using my database.

  • The problem with using natural keys is that they can change. Suppose a VersionNumber is added to the table and becomes part of the natural key. The index upheaval would be pronounced and ugly. Using a surrogate key renders changes to the natural key relatively low impact. While a non-clustered index on the natural key would be affected at rebuild time, the impact on normal operations would simply be the additional storage required for the version number. As was pointed out in this article, using a surrogate key decouples the clustered index from the business value of the natural key making it much simpler for the underlying table to evolve without impacting the referential integrity.

    "Beliefs" get in the way of learning.

  • Great point.  We standardize all table names and keys, also.  This helps with joins and other database programming not to mention helping the other guys who are creating applications that interface with the database.  If the table is named tbl_wtf then rest assured the pk is pk_wtf and all other tables referencing said data have a foreign key of fk_wtf.  Sooooooo simple and efficient.  KISS at it's best.

  • I'm also firmly in the surrogate camp.  Key reasons for me are:

    • simplicity of natural key updates (no down-stream updates)
    • simplicity of joins (no multi-column keys)
    • performance

    I also very often use int for my PKs, but you need to make a realistic assessment of whether there is any chance you'll exceed the type's range in the life of the application.  I find more times than not int is just fine.

  • Absolutely....we use INT and sometimes BIGINT.  Who really knows what this maximum big int limit is:  9,223,372,036,854,775,807

    I think that's 9.2 Google  🙂

  • Thomas Rushton wrote:

    The problem with using the social security number as a "natural" unique ID is that, erm, it isn't.

    and we could say the same in the UK for our national insurance number - except you don't get one until you are 16 - and under GDPR we would classify it as personally identifiable data - probably have to mask it or encrypt it.... who really wants to use an encrypted PK????

    PersonID int identity(1,1) is plenty fine for me

    MVDBA

Viewing 15 posts - 1 through 15 (of 34 total)

You must be logged in to reply to this topic. Login to reply