Give Up on Natural Primary Keys

  • nova wrote:

    A key is a key whether it is implemented using a PRIMARY KEY constraint, UNIQUE constraint or other means. Referential integrity is something else because not all keys are referenced by a FOREIGN KEY constraint - a table can have several keys but typically only one of them will be referenced by foreign key(s).

    GDPR is not concerned with how keys or referential integrity are implemented but it is very much concerned with identifiers, and in a database the keys (or more accurately the superkeys) of a table are usually the most important identifiers.

    Again, the idea that surrogate keys are "superior" to natural keys suggests a very partial understanding of those concepts. They are not comparable because they serve different purposes and one is not a replacement for the other. You might as well say a hammer is superior to a screwdriver - but would you use a hammer to turn a screw?

    Good stuff!

    I also get the sense that people are confusing physical and logical design and maybe de-emphasizing one in favor of the other.

    Also I believe that changes in physical implementation shouldn't be considered some sort of "get out of jail free card" when considering any legal obligations. You can call the rows and columns limburger cheese, and if it contains data that is of legal concern, then your efforts in renaming will all be for naught.

    And finally, lol @ "primary keys are not data".

  • Thanks X. There is a (perhaps understandable) tendency for technologists like us to stick with "patterns" as a substitute for the more fundamental issues. Sometimes you have to deal with fundamentals though and that's when discussions like this one can get confusing.

    As for keys not being data, well I'll just say that's... controversial. And I'm trying to ignore the sound of Ted Codd turning in his grave.

  • A few comments:

    1. I often need to use incrementing integers as a primary key.  However, that is never my first choice.  I have seen incrementing integers used for state tables, e.g. Ohio is not OH, but 23 or whatever number it would be.  That's even when OH is there.  So rather than have OH in the tables that reference, which would be human readable and often not even require the join as in many cases OH is sufficient, it's 23.  The forces the join to always be made because 23 makes no sense.  This to me seems utterly absurd.
    2. There are many cases that an item is given a business key with some letter prefix to indicate what it does.  I will not add an integer column to that in an OLTP design when there is a perfectly acceptable business key.  Adding an integer column and making the other column unique seems like unnecessary work, and you are losing the prefix.
    3.  Primary keys aren't data?  C'mon.
  • RonKyle wrote:

    A few comments:

     

      <li style="list-style-type: none;">

    1. I often need to use incrementing integers as a primary key.  However, that is never my first choice.  I have seen incrementing integers used for state tables, e.g. Ohio is not OH, but 23 or whatever number it would be.  That's even when OH is there.  So rather than have OH in the tables that reference, which would be human readable and often not even require the join as in many cases OH is sufficient, it's 23.  The forces the join to always be made because 23 makes no sense.  This to me seems utterly absurd.

     

      <li style="list-style-type: none;">

    1. There are many cases that an item is given a business key with some letter prefix to indicate what it does.  I will not add an integer column to that in an OLTP design when there is a perfectly acceptable business key.  Adding an integer column and making the other column unique seems like unnecessary work, and you are losing the prefix.

     

      <li style="list-style-type: none;">

    1.  Primary keys aren't data?  C'mon.

     

    Ok, let me clarify "keys vs data" since I seem to have annoyed the purists. 🙂

    When I said that natural keys are data I was being somewhat literal. Natural keys are indeed data. For instance, someone's social security number may indeed be a natural key that points to another table or another database and in that sense, yes, it's a key. But it is also data. So is an employee's name, or just about every other natural key.

    A surrogate primary key (such as an Identifier as used in SQL Server) is a little different. While technically it's data (a fact about something) it's also a mechanical linkage between tables, either the primary key of a table or a foreign key in a different table. As a mechanical linkage it needs to follow certain purely mechanical constraints. It has to be small for performance reasons. It has to be the only FK candidate for other tables for both performance and security reasons. An arbitrary number means nothing unless you can link it to PII (for GDPR compliance) or sensitive data (for hacking). Using natural keys isn't desirable mechanically, because they're larger, they themselves are sensitive or PII and thus can more easily leak.

    As for the state name example 🙂 well, yes OH may be sufficient instead of creating an identity where OH is keyed by 23--but what if the state name was itself sensitive information? Also, what if the state was actually a full table that you needed to audit? Having 23 in the audit logs instead of OH means that's one less place to leak info from. 🙂

    And finally, if you're storing some kind of ID number that is only used outside your system, but you put a unique constraint on it, is that really an alternate key? Or just data that needs to be unique?

  • Hi Roger,

    Maybe this off-topic part of the discussion has gone on a bit too long, but on a matter of terminology, maybe you might make yourself clearer if you avoided this idea of "mechanical linkage". I never heard anyone else use those words about a database. You are thinking of foreign keys I'm sure, but foreign keys constraints are logical data integrity rules rather than mechanical(!) links. The point of the relational model after all is that there are no predefined "links" but that tables can be joined in any number of different ways through queries.

    Your argument that surrogates are an obfuscation feature seems a bit unusual. For the overwhelming majority of cases, obfuscation of that kind is obviously unnecessary and adds nothing to real security. Most data doesn't need to be obfuscated and if it does there are simpler and better ways of achieving it.

    roger.plowman wrote:

    if you're storing some kind of ID number that is only used outside your system, but you put a unique constraint on it, is that really an alternate key? Or just data that needs to be unique?

    Data that needs to be (irreducibly) unique and has suitable uniqueness constraints on it is a key. What makes you think otherwise?

    • This reply was modified 4 years, 8 months ago by  nova.
  • Ok, let me clarify "keys vs data" since I seem to have annoyed the purists.

    I am not a purist, but a practicalist.  That may not be a real word, but it means that I understand why the rules exist and know that there are good reasons to follow them.  I have too often been had to deal with the issues caused by poorly designed databases because the designers did not want to apply the rigor that is required to do it right.

  • I have climbed up on this soap box many times before and cannot resist this time.  I have been a database designer for Oracle and MS SQL Server for decades and i believe that the natural unique identifier is a safety net for the application to prevent duplicate data that is hard to unravel when different records (different surrogate keys) were selected to represent the same entity (e.g. the same person has two records).  I also believe in surrogate keys for foreign key constraints.  In logical design (not considered important to many inexperienced developers because they create tables on the fly) every entity should have a unique identifier.  When you move to the relational or physical model, you decide if you want to add surrogate keys for technical reasons such as performance and storage space conservation but you keep the unique identifiers for date integrity.  It is my opinion that you always need both.  The definition of surrogate is a substitute for something else. So without a natural key, an identity column is just a record number. If a particular table needs the natural key, the foreign key columns can reference the unique key columns, but in my experience that is rarely needed.  And with the naming convention of MyTableID, it is very easy to create joins without having to script out the PK/UK to figure out column names and you can easily identify the parent table.

    Calling the natural unique key or the surrogate unique key the primary key is more of a technical differentiation.  Many tools that create FK constraints automatically, based on relations, use the PK by default for the reference in the foreign key column, so I make the surrogate key the PK for convenience and the natural key the unique key.  There is a difference between a unique key enforced with a unique index and just a unique index.  Unique indexes cannot be referenced by the foreign key constraint.

    One of the discussion points in this thread is that surrogate keys help with GDPR compliance, with which I partially agree.  But if a database is not properly normalized, the surrogate key does not help.  I have seen systems, using surrogate keys, where names, DOB and SSN were denormalized in multiple tables "to make it easier for the developers."

    If you are tired of reading you can stop reading now because I am going to give an example of how "surrogate keys only" can create a nightmare database.  On a recent contract the either/or decision was made (before I arrived) to use surrogate keys only.  (I started adding natural keys on all new tables I created).  We  spent a large portion of our time running data fixes to try to repair the damage caused by duplicate records.  We created tools to allow the end users to try to fix some of these issues but they used the tool for additional purposes, not understanding the consequences.  Both of these methods of  "data fixes" lead to more bad data that had to be fixed.  As I stated earlier,   I strongly support the use of surrogate keys to use as foreign key columns while keeping natural keys to prevent duplicate records, one of the tenets of relational theory, data integrity.    On the People table it was argued that you cannot come up with a natural key, using the lame excuse  that you cannot uniquely identify George Forman's children (if you don't know what I am talking about, search the internet for an explanation).  There was e a case where the same person was created over 700 times because of a glitch in the code.  While this was an extreme case of code gone wild, there were other cases where same person was created by the end user multiple times.  Then when users were given a list of people's names they were able to pick different records for the same person with multiple People records or two people with the same name.  As this propagated down the process, it was not possible to know if this was the same person or not.  Maybe it was two different people with the same name or the same person with two different names.

    There is another problem with no natural unique key.  I have heard it said many times, by a developer or business anaylyst , "Just create another row."  I started saying "NO JCAR!"  Like NO RBAR!  While this is not a problem if you inactivate the old row, the developers wanted to be able to see the old row as well.  We tried to introduce a soft delete but that was interpreted in different ways by different developers so that failed.  My solution would is to have unique records in the working table and save the previous records in an journal table.  Most code would only use the working table.

    Sorry for the long post but I keep hoping that designers understand it is not an either/or decision.  You need both a natural unique identifier and a surrogate key.

  • OH is actually an alfa surrogate for Ohio.  And SSN is a semi-arbitrary surrogate key that originally carried some geographic information but not so much any more.  Often businesses have standard abbreviations for things so I use those as FK's since they are usually short and there are not many rows in the table.  If you are creating a small lookup table and you have the abbreviation, short name and description I see no need for a surrogate key.  I have done it both ways.

  • Some great points there, hmbacon. The problems you describe happen when developers look at things only in the abstract and lose sight of the fact that databases succeed or fail based on how well people can rely on and use their information content.

    I'm uncomfortable with your description of state codes and SSNs as surrogates however. Ultimately all identifiers are symbolic "surrogates" for the things they identify, including the strings OHIO and OH. I think everyone understands that. But surely in database design the term surrogate key has a much more specific meaning. A surrogate means a key which is not used as an external identifier but only exists for technical reasons in a database. (E.F.Codd: "Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them"). By that measure, SSN and state codes do not qualify as surrogates.

    Some people do like to distinguish between a key that is "artificial" and one that isn't. So SSN might qualify as an artificial key I suppose. That has never seemed like a very useful distinction to me. It's pretty much impossible to pin down what is meant by artificial, but simply being artificial (whatever it means) doesn't make something a surrogate.

  • hmbacon wrote:

    I believe that the natural unique identifier is a safety net for the application to prevent duplicate data that is hard to unravel when different records (different surrogate keys) were selected to represent the same entity (e.g. the same person has two records).

    I think the point was made earlier that you can just add a unique constraint or index to any column that should be unique, it doesn't have to be a primary key.

  • OH is actually an alfa surrogate for Ohio.  And SSN is a semi-arbitrary surrogate key that originally carried some geographic information but not so much any more.

    This is not according to the standard definitions.  You seem to be playing fast and loose with the definition, but this just muddles things up.  Although there may not be a central dictionary for database design, most designers would probably accept the following definitions as valid:

    A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key which is derived from application data. From http://www.quora.com/What-is-a-surrogate-key-in-a-relational-database.

    A natural key (also known as business key) is a type of unique key, found in relational model database design, that is formed of attributes that already exist in the real world. It is used in business-related columns. In other words, a natural key is a candidate key that has a logical relationship to the attributes within that row. A natural key is sometimes called domain key.

    From https://en.wikipedia.org/wiki/Natural_key

    OH is not a surrogate alpha.  It is a natural key.  And in most databases is would be the best choice for primary key for a lookup table of US states.

    For all of the long post, it's not clear to me at the end what you mean by:

    Sorry for the long post but I keep hoping that designers understand it is not an either/or decision.  You need both a natural unique identifier and a surrogate key.

    Do you mean to say that you will need to have recourse to both in the database, or that you need both in every table?  If the first, then yes, I agree.  There are many cases where there is not a good natural key.  If you mean the second, that you would use a surrogate even when there is a natural key available (based on the definitions above), then I disagree.  You would have to explain your case further.

    That said, this only applies to OLTP designs.  OLAP designs require a surrogate key even when there is a good business key available as it might become a 2SCD table at some point.

     

    • This reply was modified 4 years, 8 months ago by  RonKyle. Reason: Missing punctuation
  • Hi Nova.  I was loose with the words I chose.  Sorry.   On my last project, the one with no natural unique keys, the end users wanted to have the surrogate keys displayed because they realized that duplicates caused problems and they wanted to be able to refer to a specific instance of the person being displayed.  It is a slippery slope when there are no natural unique keys.

    Good points, RonKyle.  And yes, this only applies to OLTP.  For a simple lookup table, no surrogate key needed, just use the business key/abbreviation.  In other cases I advocate a natural key composed of one or more  attributes of the entity to enforce uniqueness and a surrogate key to be used in FK's.  In all the designs I have been involved in that has been my practice and I never saw any disadvantages.

    Jonathan, what do you mean by "you can just add a unique constraint or index to any column that should be unique?"  What is the purpose?  Putting a UK or unique index on a single column without considering candidate keys could lead to problems.  As a simple,  trivial and unrealistic example, suppose you decided that you wanted your People table to have unique DOB's and you put a unique index on that one field.  Suddenly you can have only one person with the DOB of 9/19/2000 but that might not have been what was intended.  To me, DOB is part of the candidate key to identify a person. There will be a  point when you want to find something in a table and you don't/shouldn't know the surrogate key.  If there are duplicates, you never know which one is the right one.  Sometimes I miss the point so could you give an example of a  single column that you would want to make unique after the table was created?  Why wasn't the uniqueness identified during design?

  • hmbacon wrote:

    Jonathan, what do you mean by "you can just add a unique constraint or index to any column that should be unique?"  What is the purpose?  Putting a UK or unique index on a single column without considering candidate keys could lead to problems.  As a simple,  trivial and unrealistic example, suppose you decided that you wanted your People table to have unique DOB's and you put a unique index on that one field.  Suddenly you can have only one person with the DOB of 9/19/2000 but that might not have been what was intended.  To me, DOB is part of the candidate key to identify a person. There will be a  point when you want to find something in a table and you don't/shouldn't know the surrogate key.  If there are duplicates, you never know which one is the right one.  Sometimes I miss the point so could you give an example of a  single column that you would want to make unique after the table was created?  Why wasn't the uniqueness identified during design?

    Well clearly in the example you've given you wouldn't want to put a unique index on a DOB column as it's not a column that should be unique.  I was referring to columns that could be a natural primary key. You could put a unique index/constraint on them and that would prevent duplicates being inserted as in your initial statement of the problem of duplicates.

  • Jonathan,

    Yes of course, that's what hmbacon is saying. Not to identify and implement the natural keys would be a serious mistake. Natural keys are almost essential and it's difficult to imagine many/most databases working accurately and effectively without them. In functional terms surrogates are almost by definition usually optional, whereas natural keys are close to being essential. That's why it's a huge error to think that there's a choice to be made between natural or surrogate.

    Don't get distracted by the arbitrary idea that one key or other has to be primary key. In practical terms it doesn't matter which of the keys in a table is called "primary". What matters is what keys you have and how you use them.

  • nova wrote:

    Don't get distracted by the arbitrary idea that one key or other has to be primary key. In practical terms it doesn't matter which of the keys in a table is called "primary". What matters is what keys you have and how you use them.

    It makes a huge difference when you trying to get developers and other people to use one or the other and how they use it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 91 through 105 (of 108 total)

You must be logged in to reply to this topic. Login to reply