Give Up on Natural Primary Keys

  • Comments posted to this topic are about the item Give Up on Natural Primary Keys

  • In the data warehouse side, it makes sense to use surrogate keys and replace all natural keys. I don't really deal with PII data, but even in cases where I get keys from other systems, everything is replaced with the data warehouse surrogate keys. Everything is cleansed and controlled by the data warehouse with signatures/ownership of systems so nothing is amiss. I feel this is becoming the standard at least in this space and it makes sense to me!

  • xsevensinzx - Sunday, September 30, 2018 5:18 PM

    In the data warehouse side, it makes sense to use surrogate keys and replace all natural keys. 

    Definitely agree.  A data warehouse will outlive the operational systems that feed it and succeeding operational systems won't necessary use the same keys or even data types.  For that reason surrogate keys in the data warehouse are a necessity.
    In operational systems I have learned that there are very few immutable natural keys. 

    3rd party keys, even from ISO standard sets tend to change.  For example ISO3166 country codes change rather more frequently than you might imagine.  ISO5218 Gender seems relatively safe.  Thanks to the LBGTQIA community for an example to us all of how thorough testing should be done.

  • I completely agree with you Steve. Another very strong case for surrogate keys.

  • Maybe I'm missing something regarding the backup example. But I would expect the right to be forgotten to include being removed from backups.

  • dsor - Monday, October 1, 2018 3:08 AM

    Maybe I'm missing something regarding the backup example. But I would expect the right to be forgotten to include being removed from backups.

    Depending upon how the data is backed-up it can be technically difficult to do such a thing. I certainly don't fancy the idea of restoring my 7TB databases from each of the dozen or so back ups, to remove a person's data and then re-create the backup. Just to have another request land a few minutes later.
    So, one option is to keep a separate file that can be used to identify a customer, for the purposes of removal only. If a backup is restored then a process must run immediately that removes the person's data, as identified by this separate file.
    When a person makes a request to have their data removed it must be made clear to them that it will still exist in backups, that the backups are secure and if the backups are used then their data will be removed before any processing against that restored data can commence.

  • Following our own GDPR compliancy efforts earlier this year, we concluded that keeping a list of e.g. email addresses or user ids for the purpose of removing those users post-restore was not acceptable, and that we would be better off storing a hash of the users' PK and using that to remove the users later if required.

  • Surrogate keys are practically always the only way to go for ease of development as well as performance, especially in large databases. A single column int or bigint will always perform better than some multi-column natural key with one or two char/varchar columns. Think of the costly index(es) on that beast and the cost of having to potentially have that combination in lower tables as well instead of the one int PK that could be down there.

    The column means nothing to users and you know you can always identify one row by one int key. Programming / development is made much easier by this as well.

  • BrainDonor - Monday, October 1, 2018 3:54 AM

    Depending upon how the data is backed-up it can be technically difficult to do such a thing. I certainly don't fancy the idea of restoring my 7TB databases from each of the dozen or so back ups, to remove a person's data and then re-create the backup. Just to have another request land a few minutes later.
    So, one option is to keep a separate file that can be used to identify a customer, for the purposes of removal only. If a backup is restored then a process must run immediately that removes the person's data, as identified by this separate file.
    When a person makes a request to have their data removed it must be made clear to them that it will still exist in backups, that the backups are secure and if the backups are used then their data will be removed before any processing against that restored data can commence.

    Pretty much how things work in call centres.  Anything you do, and any dataset is always matched against the national (and internal if kept) Do Not Call list, held in a separate system, and the data cleansed.  It soon becomes second nature.

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

  • GDPR-like laws are something I hadn't considered, but it makes sense in an absolutist sort of way.

    At least here in the U.S. lawyers are more than happy to try and cheat sue anyone for any possible reason unlawful trespass on their clients' rights, so I can see natural keys being the target of a scum-sucking shark over zealous legal beagle fine upstanding pillar of the legal community slavering to strike reluctantly noting a deliberately misconstrued blatant violation of the law.

    Regardless, I've always been a firm believer in surrogate keys over natural ones for any number of purely mechanical reasons. First, natural keys change--a lot, meaning fragmentation of the table and indexes. Second, using something like an identity lets you append new records, and third, ala GDPR a surrogate key doesn't leak PII--at least, not by itself.

  • I've been through several bank acquisitions by other banks and how it affects our data.  I WISH that people had been smart enough to use only the surrogate keys that someone actually did setup.  It's amazing that people thought the bank name and account numbers would never change.  Instead of making the changes in just two tables, the changes need to be made to virtually all non-reference tables.  It's a week long evolution when it happens.

    Surrogate keys aren't just for DWs.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • this is becoming the standard at least in this space and it makes sense to me!

    This has been the standard in data warehouse design since the mid-nineties when Kimball published his first book on DW design.  

    A single column int or bigint will always perform better than some multi-column natural key with one or two char/varchar columns.

    Performance is not the overriding concern.  I use identity keys if there is no good natural key or at some point at three or four different columns I'll use an identity key instead.  But then you have to ensure that the columns don't have what in reality is a duplicate, which I've seen in this type of design.  A unique constraint would solve this, but what impact does that have rather than just making it the key to begin with.

    Programming / development is made much easier by this as well.

    I'm not current on the state of programming.  But back in my programming days, when an identity was necessary and also involved a child table, it was not easy to add the parent, retrieve the generate key, and then insert the key.  The SEQUENCE, which didn't exist in SQL Server when I was programming, helps that greatly.  But I certainly don't see how it makes it easier.  It may be *not be* harder.  But that's not the same as easier.

    //Edited to add "not be" as noted.

  • A single column int or bigint will always perform better than some multi-column natural key with one or two char/varchar columns.

    Performance is not the overriding concern. I use identity keys if there is no good natural key or at some point at three or four different columns I'll use an identity key instead. But then you have to ensure that the columns don't have what in reality is a duplicate, which I've seen in this type of design. A unique constraint would solve this, but what impact does that have rather than just making it the key to begin with.

    Performance is a major concern when you're dealing with tens of millions or billions of rows in parent / child relationships with multiple levels, and wide tables at that.


    Programming / development is made much easier by this as well.

    I'm not current on the state of programming. But back in my programming days, when an identity was necessary and also involved a child table, it was not easy to add the parent, retrieve the generate key, and then insert the key. The SEQUENCE, which didn't exist in SQL Server when I was programming, helps that greatly. But I certainly don't see how it makes it easier. It may be harder. But that's not the same as easier. 

    I was speaking mainly about joining tables, and especially multiple levels. Having 1 column to join down to on each level is a huge plus, in terms of performance and clarity, and not to mention the data space savings in not having to have the upper level natural key columns repeated in your lower level table just to have the matching relationship. Surrogate keys rock. I'd rather have that one column relationship and do the key generation and insert downstream (an easy operation, really) when needed, which is rare. Stored procedures take care of that very easily when adding data normally.
    To each his own... this entire subject is very polarizing for db folks.

  • I agree with all of the reasons given for use of surrogate keys.  Two thoughts come up for me around this:

    1.     I’ve found that when surrogate keys are used,depending on the person in charge of the database, duplication of information increases.  Whether due to hiccups in process, laziness, or something else.  I see administrators assign a new surrogate to an existing entity because apparently that is easier than determining if the entity already exists.   With natural keys (at least anecdotally) this does not seem to happen as much.  

    2.     Is there any risk of some ‘master surrogate’ keybeing associated with a given entity across most databases (think a cyber SSN),unintentionally creating an even bigger security risk.

  • dsor - Monday, October 1, 2018 3:08 AM

    Maybe I'm missing something regarding the backup example. But I would expect the right to be forgotten to include being removed from backups.

    That's not correct. As mentioned above, the effectiveness is what's important. Once you've asked to be forgotten, your data can't be used and shouldn't be exposed in a data breach. You could view this as removing data from backups, but that isn't practical. Most companies would merely have a way to limit that data from any online system and protect their backups.

Viewing 15 posts - 1 through 15 (of 108 total)

You must be logged in to reply to this topic. Login to reply