Do You Really Need Surrogate Keys? Part 1 – Concepts and Methods

  • However, if you want to see the best point made, it is the one by JCamburn at 05/23/2003 : 23:19:21

    See http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=6136&whichpage=6

    He actually gets it.

    Maybe you referenced the wrong post, but that one completely misunderstands what normalization even is (he believes it applies only to single tables), then compounds his error by claiming surrogate keys "make normalization more difficult". Total nonsense.

    The reason the "surrogate vs. natural" debate even exists is because there is no one right answer in all situations. It depends not only on the data being modeled, but a whole host of other factors: whether you're OLAP vs. OLTP, query vs. update ratio, and even things like code maintenance costs.

    Surrogate keys often require additional indexes. And, because you're eliminating natural columns in child tables as join keys, additional joins as well. You may need to define additional unique constraints because your key no longer enforces business uniqueness. Finally, they make your tables slightly larger, which for very narrow tables, may have performance and space implications. They used to be prone to index contention issues in some RDBMs (though modern products have pretty much entirely corrected this).

    That said, in **practice** these caveats are usually (but not always) minor, often to the point of insignificance. Surrogate keys reduce cascading updates and other annoying problems. They can save space when a natural key is very wide. But most importantly, they tend to make a system more resilient to changes in business logic. In nearly all firms, programmer time is a resource far more valuable than cpu cycles or disk i/o's.

    The rule of thumb I always tell junior developers is, if you can clearly and concisely state why a natural key should be preferred in a specific instance -- use it. If not, then a surrogate key should be your default.

  • RichB (10/25/2010)


    You seriously embrace his inane ramblings? :hehe: Good luck with that!

    Totally! Sure, it takes a bit of time to get the good stuff out, but everything said is solid and shows that it's written from experience. Harsh experience where someone's followed the info from the original article and created a nightmare for this guy to pick apart and repair. That'd tend to make me slightly bitter too.

    But... Is there really much to disagree with? Not his writing style, but the actual information provided?

  • What if the natural key changes? (It can happen and it DOES happen.) This is the primary reason why I always use surrogate keys.

    Any rule that has the word "always" in it is usually a bad rule.

    Surrogate keys are generally preferable, but there are countless cases where they are not. Consider an audit table that contains only a single column: the timestamp an event occurred. That value will *never* change. Period. Adding a surrogate will vastly bloat the table (by 100% if you're using a smalldatetime), double the number of required indexes, force page lookups whenever you need to perform a join (since your PK is no longer a covering index). And in return, gives you -- nothing whatsoever.

    The only reason to ever surrogate-key a table like that is if multiple rows might possibly have the same timestamp. But in cases where business logic forbids this, there is no reason to do so.

    Now you may rightfully ask "what if your business logic changes in the future". If there's even a remote possibility of that, then you likely want to bite on a surrogate key anyway. But in cases where that just isn't possible because of the data being modelled, then a "always use surrogate rule" is pointless and damaging.

  • this article serves a perfect example of data bases designed by those DBAs who understand little of the real business worlds and tend to take a static view of the data at hand.

    If the world could stand still, the DBA can do a superb job of optimizing a database. Unfortunately, that fantasy only exists in toy databases such as "AdventureWorks".

    Never use a candidate key that has any business meaning in it as your PK, period!

  • Another issue that is milder but chafes me is the concept that the PK field is usually the clustered index. Certainly this is default, and certainly this sometimes makes sense, but in large tables it may well be that other fields make more sense. If one table joins to another table frequently and that join returns multiple rows, like a claim table joined to a notes table on claim ID, it may make sense to cluster the notes table on claim ID since that will allow the query that paints the notes screen on the claim system to work far faster (and since note ID is almost never used for any queries). Again, one needs to think about what makes sense performance-wise and not just blindly make all primary keys clustered.

  • Michael Wang (10/25/2010)


    this article serves a perfect example of data bases designed by those DBAs who understand little of the real business worlds and tend to take a static view of the data at hand.

    If the world could stand still, the DBA can do a superb job of optimizing a database. Unfortunately, that fantasy only exists in toy databases such as "AdventureWorks".

    Never use a candidate key that has any business meaning in it as your PK, period!

    To "almost" quote ebay: Any rule that has the word "never" in it is usually a bad rule.

    Consider a history table, where the "old" version of a row in the Cutomers table is stored when something changes in the row. My choice of PK would be whatever the PK of the Customers table is, plus the datetime column that holds the time of the change. This key has business meaning, so it violates your rule. Why is this bad?

    And consider what so many people call a junction table - a table that represents the many-to-many relationship between two tables. I will usually model these as just the two foreign keys to the two tables in the relationship, and a primary key over the combination of those columns. Again, a PK with business meaning. And again, I see no advantages to be gained by adding a surrogate, only a few disadvantages.

    For tables that are being referenced by other tables, it might make sense to use a surrogate key, though still not in all cases. I would neveer apply a generalized rule here, but rather decide on a case by case bases which tables would benefit from a surrogate key, and which tables would not.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • An interesting article.

    The numbers used in the tests are rather low though.

    In my world, surrogate keys are primarily used in Datamarts or Datawarehouses.

    I don't think the test cases are a fair indication of what to expect when you are working with billions of rows.

    I also don't see a better way of dealing with Slowly-changing dimensions than the use of surrogate keys.

    They are fast to index and fast to seek.

    If you pick their datatypes appropriately they take up much less space.

    In the case of a fact table that stores monthly data, you have 2 choices "Store the Date" or "Store a tinyint or smallint" as the generated surrogate key.

    If you picked the tinyint, you will save yourself 7bytes per row.

    If you have a billion rows in that fact table you would have saved yourself 6GB of space in data storage, some more for the indexes.

    Less space makes for smaller files to scan.

    For Data Warehousing and Datamarts...

    GO Surrogate Keys!!!

  • ebay-1087924 (10/25/2010)


    What if the natural key changes? (It can happen and it DOES happen.) This is the primary reason why I always use surrogate keys.

    Any rule that has the word "always" in it is usually a bad rule.

    Surrogate keys are generally preferable, but there are countless cases where they are not. Consider an audit table that contains only a single column: the timestamp an event occurred. That value will *never* change. Period. Adding a surrogate will vastly bloat the table (by 100% if you're using a smalldatetime), double the number of required indexes, force page lookups whenever you need to perform a join (since your PK is no longer a covering index). And in return, gives you -- nothing whatsoever.

    The only reason to ever surrogate-key a table like that is if multiple rows might possibly have the same timestamp. But in cases where business logic forbids this, there is no reason to do so.

    Now you may rightfully ask "what if your business logic changes in the future". If there's even a remote possibility of that, then you likely want to bite on a surrogate key anyway. But in cases where that just isn't possible because of the data being modelled, then a "always use surrogate rule" is pointless and damaging.

    You're right of course. Always is the new never. (Never say never!) It would have been more accurate to say that my default methodology is to use surrogate keys but rare indeed is the case where it isn't applicable.

    I'm afraid I disagree with your example primarily because you think the value of the key is what is significant. When I said a natural key can change, I meant the fields that comprise a natural key can change not the values in the fields. I worked for a client once that had an order table and an order line item table. The natural key on the order line item table was the OrderID and the LineItemID. Of course, then some genius decided that it was okay to logically delete a line item and then reuse the line item number. Suddenly, the natural key has duplicates. That scenario can't occur with surrogate keys.

    One of the more signifcant benefits of surrogate keys is scalability. You can change the underlying data structures including changing the structure of the natural key without affecting relational integrity.

    "Beliefs" get in the way of learning.

  • Totally agree with Hugo...Our database standards doc says to make sure you have a Natural Business Key (Unique Constraint) on all tables. This is what identifies the day.

    Hugo,

    Thanks for the good examples.

    Thanks,

    Thomas LeBlanc

    @TheSmilingDBA

    Thomas LeBlanc, MVP Data Platform Consultant

  • I'm glad you caught me saying the word that I never use to myself. Apparently this is a bad word for you. But for countless DBAs who are unfortunately tasked with designing a data model for business and cannot properly decipher the ever evolving business world, the word is a good word. I'd rather see them *always* use a surrogate key than see them being *smart* in choosing a natural key (except for timestamps that are naturally sequential and never change) in the name of *optimizing* the data model. Such optimization rarely yields good results in the long run.

  • that's right: make sure you have a unique business key to constrain a record. But this is different from making it the PK. Another confusion between PK and constraints.

  • In the applicaiton database world I would say the use of surrogate key are very situational and would be one of the latter methods of addressing performance issues.

    In the data warehouse world they are in my mind requiered. They are by far the easiest way to manage Slowly Changing Dimentions as well as manageing updates to a dimentional model.

  • Michael Wang (10/25/2010)


    ...But for countless DBAs who are unfortunately tasked with designing a data model for business and cannot properly decipher the ever evolving business world, the word is a good word. I'd rather see them *always* use a surrogate key than see them being *smart* in choosing a natural key (except for timestamps that are naturally sequential and never change) in the name of *optimizing* the data model. Such optimization rarely yields good results in the long run.

    a) We're paid professionals, not trade-school pipe-welders. Justifying a rule on the basis of "I/he/she/they can't understand to how do any better" is rather specious. I recognize the attractiveness of cookie-cutter set-in-stone rules, but for anyone trying to be at the top of their game, I strongly urge a deeper understanding.

    b) Optimizing out spurious surrogates can sometimes yield performance benefits exceeding 100%. Gains of such magnitude are not common, but prevalent enough that it behooves us to understand the situations under which they can occur.

    c) The timestamp example I gave was just one situation. There are many others. In databases which must hold older or original values of rows, for instance, a natural key is often preferred, as even if the key later changes, you don't *want* that change cascading downward.

    d) Even in situations where the natural key may be updated, claiming this somehow precludes its use is a facile argument. It all depends on the situation. Balance the cost of a cascading update vs. the additional index maintenance of a surrogate. Often, the cascade is essentially "free", since any change to the natural key data must also trigger some other business logic anyway. OTOH, a changed key may cause complications beyond a simple cascade, in which case the balance tilts towards the surrogate.

    Having said all this, I still prefer surrogates as a *general* rule. And there are also cases where surrogate key use can *boost* performance. That's why the point in my original post stands: there is no hard and fast rule as to which key is better in all cases. Well-educated professionals understand why to use one versus the other, and make the choice appropriately.

  • Hugo Kornelis (10/25/2010)


    Michael Wang (10/25/2010)


    this article serves a perfect example of data bases designed by those DBAs who understand little of the real business worlds and tend to take a static view of the data at hand.

    If the world could stand still, the DBA can do a superb job of optimizing a database. Unfortunately, that fantasy only exists in toy databases such as "AdventureWorks".

    Never use a candidate key that has any business meaning in it as your PK, period!

    To "almost" quote ebay: Any rule that has the word "never" in it is usually a bad rule.

    Consider a history table, where the "old" version of a row in the Cutomers table is stored when something changes in the row. My choice of PK would be whatever the PK of the Customers table is, plus the datetime column that holds the time of the change. This key has business meaning, so it violates your rule. Why is this bad?

    And consider what so many people call a junction table - a table that represents the many-to-many relationship between two tables. I will usually model these as just the two foreign keys to the two tables in the relationship, and a primary key over the combination of those columns. Again, a PK with business meaning. And again, I see no advantages to be gained by adding a surrogate, only a few disadvantages.

    For tables that are being referenced by other tables, it might make sense to use a surrogate key, though still not in all cases. I would neveer apply a generalized rule here, but rather decide on a case by case bases which tables would benefit from a surrogate key, and which tables would not.

    Hugo,

    I'm a little hazy on what you defined as the primary key. If it's a many-to-many relationship how can the combination of the two foreign keys represent a primary key? There is no guarantee of uniqueness in a many-to-many relationship. That's why we created the junction table after all. Perhaps I'm misunderstanding you. I like to use surrogates in this instance so I can have a legitimate primary key as well as a clustered index to avoid having the table be a heap. For me, being able to create a clustered index is where the surrogate key provides value.

    "Beliefs" get in the way of learning.

  • Robert Frasca (10/25/2010)


    I'm a little hazy on what you defined as the primary key. If it's a many-to-many relationship how can the combination of the two foreign keys represent a primary key?

    Hugo's example was actually a FK + a non-FK column. However, there are plenty of cases where two FKs do constitute a natural PK -- a link table implementing a M:M relationship, for example. Consider a table of customers, and a table of airline flights. The table linking customers to flights will have a customer FK and a flight FK, and their concatentation is the natural key for that table.

Viewing 15 posts - 16 through 30 (of 178 total)

You must be logged in to reply to this topic. Login to reply