Do You Really Need Surrogate Keys? Part 1 – Concepts and Methods

  • David Portas (11/4/2010)


    Michael Wang (11/4/2010)


    David Portas (11/4/2010)


    Michael Wang (11/4/2010)


    Two problems keep surfacing from this thread: confusion between a primary key and unique key constraints, and more fundamentally why the two are required in the first place.

    A primary key is required to identify a data item - it's purely physical and has nothing to do with the real world but everything to do with the data storage. This is where the physical model is insulated from the real world.

    No, this is not a helpful distinction in any way.

    ....

    Either you see it or you don't: the distinction is to separate your model world from the real world. It's where the real line is drawn.

    Sure but in the words of Chris Date it's a "purely psychological" distinction, calling a key "primary" being just an aide-mémoire or a convenient label. It's only as important as you want it to be. What I objected to was your implication that this was somehow a definition of a primary key - it obviously isn't because the name does not define any special type of key.

    OK, I accept your criticism: by primary key I meant a surrogate key and by unique key I meant natural unique key.

  • ta.bu.shi.da.yu (11/4/2010)


    the formal definition of the normal forms relates to the tables themselves. You seem to be confused with how the normal forms work.

    Note that I'm not saying that what you say is a bad idea - if you have two tables that do the same thing, then that is quite silly and you should of course rationalize the table design.

    Good god, did they stop teaching data modelling entirely in college? Or have people just stopped going in the first place?

    I've already not only explained why this is false, I even gave examples of normalization that required modifications to multiple tables simultaneously. I suggest you read them....or just read a basic primer on normalization.

  • Michael Ebaya (11/4/2010)


    ta.bu.shi.da.yu (11/4/2010)


    I am really curious to know why it's a good idea to not enforce the immutability of the key!

    This has already been covered in excruciating detail. First of all, a surrogate reduces the likelihood of an update, but it doesn't eliminate it entirely. More importantly, immutability is a desirable quality of key, but its not a requirement of a key. There are other desirable qualities that natural keys have (in some cases), and surrogates do not (again, in some cases). Neither is best in any and all cases.

    Still trying to see what desirable qualities that natural keys have. Could you point out some examples?

    Random Technical Stuff[/url]

  • Michael Ebaya (11/4/2010)


    ta.bu.shi.da.yu (11/4/2010)


    the formal definition of the normal forms relates to the tables themselves. You seem to be confused with how the normal forms work.

    Note that I'm not saying that what you say is a bad idea - if you have two tables that do the same thing, then that is quite silly and you should of course rationalize the table design.

    Good god, did they stop teaching data modelling entirely in college? Or have people just stopped going in the first place?

    Which particular normal form do you believe helps you rationalize particular tables? I'm seriously interested in which one you refer to. Please, just either answer:

    a. 1st normal form,

    b. 2nd normal form,

    c. 3rd normal form

    I won't go into further normal forms, unless you want to, that is.

    Also, which basic primer on normalization would you recommend? I'm curious which one supports your idea.

    Random Technical Stuff[/url]

  • ta.bu.shi.da.yu (11/4/2010)


    Still trying to see what desirable qualities that natural keys have. Could you point out some examples?

    A natural key enforces a data integrity rule that stops some meaningful data being duplciated. A surrogate key does not. I already discussed the example of a unique login name where it's desirable to have a key that is unique but changing. Not really sure what other example you need.

  • ta.bu.shi.da.yu (11/4/2010)


    Michael Ebaya (11/4/2010)


    ta.bu.shi.da.yu (11/4/2010)


    I am really curious to know why it's a good idea to not enforce the immutability of the key!

    This has already been covered in excruciating detail. First of all, a surrogate reduces the likelihood of an update, but it doesn't eliminate it entirely. More importantly, immutability is a desirable quality of key, but its not a requirement of a key. There are other desirable qualities that natural keys have (in some cases), and surrogates do not (again, in some cases). Neither is best in any and all cases.

    No, you are an anonymous poster who claims that he knows a great deal about databases, and not only that make claims about normal forms that aren't true.

    Such as what? Your little myth that "no normal form affects more than one table" has already been roundly debunked.

    Actually, it hasn't. As has been pointed out to you a number of times, the formal definition of the normal forms relates to the tables themselves. You seem to be confused with how the normal forms work.

    Note that I'm not saying that what you say is a bad idea - if you have two tables that do the same thing, then that is quite silly and you should of course rationalize the table design.

    What's seems to be confusing a number of you is that you keep referring to entities as "tables". A data entity is NOT a table. For example: An Invoice entity might look something like this:

    Invoice Number

    Invoice Type

    Account Number

    Invoice Date

    Fiscal Period

    Billing cycle number

    Amount

    While it looks like a "table" this is an abstraction of a business object, i.e. an entity. It can be normalized into a number of other entities, each with their own set off attributes including:

    Invoice

    InvoiceType

    Account

    FiscalDate

    BillingCycle

    When I drill into each of these entities I may normalize into even more entities in order to create relationships to other areas in the data model or I may choose to make each of these a table when I ultimately create the physical model. In other words, many of the attributes of my invoice entity might actually turn out to be foreign keys to other entities.

    Getting back to the original topic for a moment, if I use surrogate keys as the primary keys of these entities I am not bound to existing naming or numbering conventions. For example, my "natural" account number might be numeric; however, I may acquire another business that uses alphanumeric account numbers. By using surrogate keys I can merge the new business into my existing account structure by simply changing the datatype of my AccountNumber attribute without having to make changes to my relational infrastructure.

    "Beliefs" get in the way of learning.

  • David Portas (11/4/2010)


    ta.bu.shi.da.yu (11/4/2010)


    Still trying to see what desirable qualities that natural keys have. Could you point out some examples?

    A natural key enforces a data integrity rule that stops some meaningful data being duplciated. A surrogate key does not. I already discussed the example of a unique login name where it's desirable to have a key that is unique but changing. Not really sure what other example you need.

    David, there are other ways to enforce the uniqueness of the natural key. It doesn't need to be the primary key. By the way, a natural key that can change isn't a key. It's just another attribute. I didn't see your example but a login name isn't a key. It's an attribute of a user account. In Windows, for example, there is a surrogate key in active directory called a SID (Security ID). An excellent example of the power and flexibility of a surrogate key.

    "Beliefs" get in the way of learning.

  • Robert Frasca (11/4/2010)


    What's seems to be confusing a number of you is that you keep referring to entities as "tables". A data entity is NOT a table.

    However, in the relational database model, normalization applies specifically to relations, not "entities" (or tables for that matter). The idea of distinguishing between entities and relationships is an ER modelling concept that came along well after the relational model and normalization were invented. Many people find ER modelling useful but not everyone uses it. So I think it's best to avoid ER modelling terms if we are going to discuss relational database normalization. There is too much ambiguity in the term "entity" unless you make it clear that you are talking specifically and only about ER modelling concepts rather than just relational databases.

    Robert Frasca (11/4/2010)


    there are other ways to enforce the uniqueness of the natural key. It doesn't need to be the primary key. By the way, a natural key that can change isn't a key.

    If uniqueness is enforced declaratively then by definition that is a key. SQL provides at least two ways to implement keys, using either a PRIMARY KEY constraint or a UNIQUE constraint (on non-nullable columns of course). They both achieve the same thing.

    As already stated, in the relational model a key is only required to be unique and irreducible. Stability is never a requirement because in principle a tuple is identifiable only by attributes and it is arbitrary and pointless to say that one tuple "cannot" be replaced by a different one in the same relation. Therefore keys are no more or less updateable than any other attribute. SQL is not truly relational of course but in this respect the SQL standard fortunately follows the relational model and allows keys to be updated.

  • David Portas (11/4/2010)


    As already stated, in the relational model a key is only required to be unique and irreducible. Stability is never a requirement

    Exactly so. Too many developers are so used to the convenience of using a PK as a rowid to allow single-row updates that they conflate the terms, and erroneously believe a key can't be updated and remain a key.

    You're also spot-on with the statement that normalization relates to relationships, not tables. A relationship (or the somewhat more fuzzy term 'entity') can be captured within one table, or in multiple tables...but in all cases, normalization defines a rule for the that relationship is expressed. It's not something that's "specific to one table at a time".

  • I don't understand why any discussion about surrogate vs natural keys always degenerates into an idealogical 'flame-war'. At least this particular thread has remained fairly tame.

    So a further 2c...

    What we're in danger of forgetting in these 'enthusiatic' discussions, is that databases don't exist for their own benefit, and that the people who pay us (directly or indirectly) to create said databases (i.e. end users) don't care about database theory (no matter how much we try and convince them that they should).

    Then, in between the end users and our databases is a 'abstraction layer' known as the 'application software developer', these good people have a moral obligation to understand basic database theory, but also have the task of converting 'Data' into 'Information' for their end users.

    To me, the *prime* obligation of the database designer is to design databases that a) accurately model the end users' data requirements b) help our developers to be effective in producing applications that their end-users find user friendly, and c) help ourselves and our successors in database maintenance, by making our designs sensible, reliable, simple but well normalized, as 'future-proof' as we can practically make them and as 'self documenting' as possible.

    If we forget these, we are failing in the duty of care we have for our clients and our co-workers.

    To this end, as someone who operates regularly at all 3 levels (user, developer and db designer), and usually working on someone else's long neglected application, my view is that it doesn't matter so much which methodology (surrogate vs natural) you choose to use in your database design, as long as you normalize your database structure properly, and apply your preferred methodology consistently. Do that, and I can work with what you give me.

    If your choice of key (be it natural or surrogate) for any given table makes it difficult for your software development team to deliver good information to their end-user, then you have failed.

    For example, if your product development team need to use an application generation tool that works better with surrogate keys than with natural keys, then to me it would make sense to implement surrogate keys. To not do so just because it would 'corrupt' your database structure would seem to me to be counter-productive.

    I've seen databases that work using both techniques, and I can't say that I have a really good argument that one is better than the other.

    Although I *do* have a previously stated preference for surrogates, with a recent experience of migrating a client's data from an old accounting package to a new ERP package highlighting why this is - I would have saved an awful lot of time if both systems used surrogate keys instead of slightly different sizes of 'natural' key, that my client decided they wanted to keep. But that's a different story.

    /Ryan

    [EDIT] Just in case someone misses this in my points: My main concern here is about 'practical implementation' - a theoretical data model should always have relationships based on 'natural' attributes of the relevant entities.

    [EDIT] Took out some unneeded, distracting 'rhetoric' (Thanks David)

    P.S. my 'rhetoric' was trying (badly) to point out that many people fall into the trap of equating 'surrogate vs natural keys' with 'normalization', as if they are the same discussion. They are not, IMHO; quality of normalization has little to do with whether surrogate or natural keys are used. I've seen both bad and good designs from both 'camps'. Since I can't figure out where to put this point within my original text, I've added it as a postscript.

  • Ryan C. Price (11/4/2010)


    I don't understand why any discussion about surrogate vs natural keys always degenerates into an idealogical 'flame-war'.

    Part of the problem is a widespread lack of understanding about relational databases in general and keys in particular. Some people unwisely assume that a table can / should have only one key, therefore they see natural keys as a "threat" because they think they are being forced to choose between either a surrogate or a natural key. That's nonsense of course. Surrogates and natural keys achieve different things and a table should have as many keys as needed to model the busines domain accurately. Surrogate keys alone can't achieve that because they won't stop business data being duplicated in a table. That doesn't mean DON'T use surrogates, it just means DO use natural keys.

    To me, the *prime* obligation of the database designer is not to design databases to 'twelfth normal form'*, but to design databases that a) accurately model the end users' data requirements b) help our developers to be effective in producing applications that their end-users find user friendly, and c) help ourselves and our successors in database maintenance, by making our designs sensible, reliable, simple but well normalized, as 'future-proof' as we can practically make them and as 'self documenting' as possible.

    I highly recommend you avoid using rhetoric like "twelfth normal form". It might be (mis)interpreted as a lack of respect for data integrity and sound design practices. Data integrity is one of the biggest challenges in our industry and lack of it is one of our biggest problems. It is not to be taken lightly. Obviously normalization helps make sure you achieve all of your stated a), b) and c).

  • David Portas (11/5/2010)


    I highly recommend you avoid using rhetoric like "twelfth normal form". It might be (mis)interpreted as a lack of respect for data integrity and sound design practices. Data integrity is one of the biggest challenges in our industry and lack of it is one of our biggest problems. It is not to be taken lightly. Obviously normalization helps make sure you achieve all of your stated a), b) and c).

    point taken. I will revise my post; I was just having a 'moment', it was early in the morning when I posted that.... reading on a new day, I can see that I could have expressed my point better.

    I was only intending to highlight that this discussion is supposed to be about surrogate keys vs natural keys, not normalization; while they're related, they're not the same. As I think you pointed out, if not in those exact words.

  • JJ B (10/25/2010)


    Hugo wrote:

    And consider what so many people call a junction table - a table that represents the many-to-many relationship between two tables. I will usually model these as just the two foreign keys to the two tables in the relationship, and a primary key over the combination of those columns. Again, a PK with business meaning. And again, I see no advantages to be gained by adding a surrogate, only a few disadvantages.

    I can think of a giant advantage--at least to my programs.

    Suppose I have a such a junction table (I call it a link table) with a primary key on the two FKs as you describe. Now suppose the user in the front-end tries to change one of the values in an existing row and unwittingly tries to change one of the FK values to a value that would result in a duplicate row.

    Of course, you wouldn't think about creating such a table without also having a unique constraint across the two foreign keys right? As far as I can tell, assuming you use a proper design, your problem is no different whether you have a surrogate key or not.

    If you have a junction table with nothing but a set of FK values (i.e. no other columns), then it represents a child list to the parent object. Your code to save the user's data would look something like:

    Find a rows with the values entered by the user (notice I don't care if there is a duplicate)

    If you find that row, then update it and delete the old row

    If you do not find that row, the insert a new one and delete the old one.

    In SQL 2008, I believe you could do this entire operation in a MERGE statement.

  • I think Hugo said it best many pages ago, but it bears repeating. For every table in which you use a surrogate key, you should also have at least one additional unique constraint. Without it, users can create row upon row of duplicate data except for the surrogate key (I've seen it too many times). This is probably the most common mistake made by new database developers. Using a surrogate key does not obviate the need for another key that enforces uniqueness comprised of user viewable values since we never show the surrogate to the user.

Viewing 14 posts - 166 through 178 (of 178 total)

You must be logged in to reply to this topic. Login to reply