The Identity Debate

  • Adding surrogate keys just to ensure a data record is unique may not be the best use of surrogate keys. But if a data record does not have a natural key, or the only natural key turns out to be the entire record, wouldn't it make sense to add a surrogate key? Especially if this table is used in multiple joins where the surrogate key would take a lot less space than using the entire record (if it is 16 columns or smaller).

    The case you provided, no the surrogate key doesn't make sense. This is where design and trade-offs occur, and they must be looked at carefully before just adding them. Also, surrogate keys do not eliminate the need for natural keys. A natural unique key on Country in your example would have eliminated the duplication of US and CA from the table.

    😎

  • Matt Miller (2/11/2008)


    That's an interesting thought, but there are times when you might get the same data (say - you're tracking outcomes of consecutive coin tosses). There really might not be anything unique other than something artificial. Your approach would preclude ever storing that data. That's the case where there isn't a natural key.

    Actually your natural key would be (TimeOfToss), or perhaps (TossNumber), in that case. If you were tossing multiple coins simultaneously the natural key could be something like (CoinNumber, TimeOfToss) or (CoinNumber, TossNumber). Generally there's always some way to uniquely identify a row, although people's data models don't always accurately reflect enough information to make unique row identification possible.

  • Matt Miller (2/11/2008)


    That's an interesting thought, but there are times when you might get the same data (say - you're tracking outcomes of consecutive coin tosses). There really might not be anything unique other than something artificial. Your approach would preclude ever storing that data. That's the case where there isn't a natural key.

    I would argue that in the case you describe the "toss number" *is* the natural key, and you want to assign that in a way that guarantees no gaps in the sequence, even if the data is moved around.

    If you don't have a natural key at all, you just don't have a relation. I'm assuming this discussion revolves around relational databases, if not, there does become many other ways to do things...

  • Mike C (2/11/2008)


    Matt Miller (2/11/2008)


    That's an interesting thought, but there are times when you might get the same data (say - you're tracking outcomes of consecutive coin tosses). There really might not be anything unique other than something artificial. Your approach would preclude ever storing that data. That's the case where there isn't a natural key.

    Actually your natural key would be (TimeOfToss), or perhaps (TossNumber), in that case. If you were tossing multiple coins simultaneously the natural key could be something like (CoinNumber, TimeOfToss) or (CoinNumber, TossNumber). Generally there's always some way to uniquely identify a row, although people's data models don't always accurately reflect enough information to make unique row identification possible.

    True - and that's where your description from earlier starts applying ("data model rarely survives contact with the real world").

    Assuming I have control over the data, and I CAN get the unique data, and I can ENSURE that it's unique, then I do; I also put up a fight when a situation arises that would lead me to implement without one.

    That being said, that's a heavy "burden of proof" and an impossible one if you can't get the end-user to participate (meaning - if the extra info is "not relevant"/not needed in their minds, etc...). At that point - adding any number of other arbitrary elements just to make things unique is really no better than any other arbitrary element. Also - what if you DON't want identifiable info stored (such as metrics, but not necessarily something traceable back).

    If you can't PROVE that you have a natural key, then you'd be better off not assuming you have one. Notice I didn't say anything about not fighting the good fight, and trying to get one: there just are circumstance where you might need to go to ludicrous extents to pull something out of the air just to make it unique. If that data doesn't fit within the scope of what you're out to store, then what value does it bring?

    And yes - we are talking about relational data. this kind of data would be at the "bottom" or child position in relations most of the time.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Without an SID, how would you store multiple addresses for a person, and be able to modify a single address, without using every field in the address? Would you really want to use the PersonID + StreetAddress1 + StreetAddress2 + City + State + ZipCode as the primary key of a personaddress table. Its not that big of a deal until you start having to link to an individual PersonAddress record, like say, tracking promotionals you have sent to a person, maybe you want to know which address as well. Even worse, what if 4 people shared the same address, and you did want to reflect that fact that they are all indeed the exact same address. Wouldn't it be best to have a People table (PersonID) and an Addresses table (AddressID) and a PersonAddresses table (PersonID, AddressID) (assuming that multiple people can share an address and people can have multiple addresses).

  • SteveRowland (2/11/2008)


    Without an SID, how would you store multiple addresses for a person, and be able to modify a single address, without using every field in the address? Would you really want to use the PersonID + StreetAddress1 + StreetAddress2 + City + State + ZipCode as the primary key of a personaddress table. Its not that big of a deal until you start having to link to an individual PersonAddress record, like say, tracking promotionals you have sent to a person, maybe you want to know which address as well. Even worse, what if 4 people shared the same address, and you did want to reflect that fact that they are all indeed the exact same address. Wouldn't it be best to have a People table (PersonID) and an Addresses table (AddressID) and a PersonAddresses table (PersonID, AddressID) (assuming that multiple people can share an address and people can have multiple addresses).

    I think this is a great example of where a SID does make sense. However, you would still need to ensure uniqueness on combination of StreetAddress1 + StreetAddress2 + City + State + ZipCode , and you would need to use each of those fields to check whether a given address already existed before adding a new address with a new AddressID. What you really don't want is to have 2 different records that point to the same physical address with different SIDs.

  • SteveRowland (2/11/2008)


    Without an SID, how would you store multiple addresses for a person, and be able to modify a single address, without using every field in the address? Would you really want to use the PersonID + StreetAddress1 + StreetAddress2 + City + State + ZipCode as the primary key of a personaddress table. Its not that big of a deal until you start having to link to an individual PersonAddress record, like say, tracking promotionals you have sent to a person, maybe you want to know which address as well. Even worse, what if 4 people shared the same address, and you did want to reflect that fact that they are all indeed the exact same address. Wouldn't it be best to have a People table (PersonID) and an Addresses table (AddressID) and a PersonAddresses table (PersonID, AddressID) (assuming that multiple people can share an address and people can have multiple addresses).

    For something like that, you would usually have some sort of Address Type identifier. For instance, you probably wouldn't want to mail items to a physical address if the person has a mailing address specified. Without the Address Type identifier (which would be part of the natural key), you would theoretically retrieve addresses "randomly" (in no particular order) and you wouldn't be able to pull up a specific type of address (home, work, mailing, billing, etc.) for any given person. You might want to split out the AddressType into a separate table that relates person to address in an m:n relationship for purposes of further normalization; and in that case a surrogate key definitely makes more sense than duplicating all address columns in an intermediate m:n relationship table.

    To me though, the important point is not whether your natural key takes 1 column or 20 columns, but rather that you can identify a natural key and zero or more candidate keys within the table.

  • To me the natural keys are like requirements. Both have the twisted tendency to change. In many cases what looks like a perfect natural key today looses it's quality tomorrow.

    Also each application has it's own particular relationships. Take the address - person relationship mentioned above. In my database the person and his/hers address are on the same record. I don't care if someone has multiple 'locations' - I just need one so I can mail them whatever I have to mail them. But maybe for some tax related software you need a different kind of relationship.

    ---------------------------------------------
    [font="Verdana"]Nothing is impossible.
    It is just a matter of time and money.[/font]

  • I use Identity columns in many, but not all tables.

    For example, and identity is the best key I've found yet for people. Same name different person, same person different name, etc., make it difficult/impossible to have a good natural key for people. Yes, I have had problems with the same person ending up with two different IDs, when an end-user was, for example, adding "Steve Jones" to a system, and didn't bother to find out that he was already in there, possibly under "Steven Jones" or "Stephan Jones" or "Steve A Jones", etc. I can't think of any way to really prevent that.

    On the other hand, many-to-many join tables, for example, I would never put a identity in again. (I did, several times, when I new to databases. Regretted it later every time.) Those tables should have the FKs of the tables being joined as the PK of the join table. Works much better.

    I've often been lazy and used state abbreviations as a PK/FK and not had an identity on those. Works for a modern commerce database, but would be a mess if I had to have more history than I've had to deal with. Also confuses some people when they see "Where State = 'VI'", and they think Virginia but it's actually the Virgin Islands territory, and I have to admit I sometimes have to think a bit to remember which one is "AL" and which "AK", etc.. (The Oregon Territory, for example, was eventually split up in to Oregon, Washington and, if I remember correctly, Idaho. A database that had historical data from that territory would definitely be better off with a more abstract key.)

    More modernly, what would happen if the US had split Iraq into three countries (or more), based on ethnic territories? What if one of the countries were still called "Iraq" and kept the ISO abbreviation for that name? Oil production databases would have a problem comparing the historical production versus the modern, unless a more abstract key were used.

    On pure log tables, I often don't include even a primary key or clustered index (just a heap table, no indexes of any sort). Since the only thing that really matters on those on a routine basis is speed of insert, that works quite nicely. Then, when I archive the logs, I add a primary key, clustered index, non-clustered indexes, and do deduping and unique indexes as needed at that point.

    So, I have to say, I use mixed methods depending on what the table is for and how it will be used.

    On the related subject of GUIDs, I generally haven't found them useful. They take up too much disk space, slow down the whole database (because of size), and have few benefits. Yes, they can cut down on the network traffic on complex, high transaction systems, because of local generation. In that instance, I probably would use them. Other than that, I don't. I find I can roll-my-own replication using a separate "database instance ID" as part of the PK, but again that's probably only useful to me because I've not had to deal with a lot of it.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I have been in at least one design process in which the users seemed to regard it as a game: you ask if one set of columns is unique; they screw up their faces and come up with some outlandish situation in which the set of columns is not unique. You point out that no such thing has ever occurred; they reply that you have to allow for the possibility. You point out that we don't design for the 1 in 10,000,000 case; they say "Yes you do." And so on.

    Their work is so refined and complex that not even the most innocuous generalization holds. To suggest otherwise is tantamount to insult.

  • On the other hand, many-to-many join tables, for example, I would never put a identity in again. (I did, several times, when I new to databases. Regretted it later every time.) Those tables should have the FKs of the tables being joined as the PK of the join table. Works much better.

    In a join table, the identity is often nice-to-have, but not always as useful as in a PK table.

    As for the case of natural PKs as FKs in many-to-many tables: now you're in for some real pain when a "natural" PK needs to be updated: you can't update it in the PK table (FK violation), AND you can't update it in the join table (again, FK violation) without dropping the FK constraint (temporarily, of course). Played that game, no fun.

    The FK combinations should be unique.

  • In my 15 years of database design and software engineering experience, I have learned a few important aspects of design:

    1) Doing a detailed functional design before a database design is always required for the best results. This seems to be a “lost art” these days.

    2) A good functional design usually always drives the natural keys. This not only aids in database design, but significantly improves application performance.

    3) Identity columns (surrogate keys) are the best way to create one to many related tables. There should always be a single primary table with a unique natural key that is generally the primary read in a query. Joins to all related tables are then achieved through the surrogate key for best performance and integrity.

    4) Every table with significant amounts of data should have a clustered unique index, whether it be a surrogate or natural key. The lack of a clustered index, especially on large tables, will result in very poor performance.

    5) There is no problem with a natural unique key changing values as long as it remains unique (which you can control). Generally speaking, as long as the natural key is not duplicated in other tables (one reason for surrogate keys), then you have the best of both worlds.

    In summary, the functional design and application requirements should always determine how you setup your table keys, not whether we “like” this method or the other. The end result of all this is good performance, which is the key measure of any good application.

  • Well said Mike.

    I'm going to assume you include data integrity as part of a well performing application 😀

  • Thanks.

    Yes, good relational data integrity would be implied if the application performs well. A person should not be designing databases if they don't know how to maintain that integrity.

    Also, all good applications should establish well defined business rules to validate data whether it be technology or procedurally driven (again, another major focus of the functional design).

    Garbage is still garbage even if relational data integrity is solid. Don't get me started on user data accountability -- we could do a whole new thread on that !!

  • Same subject... slightly different slant...

    I, for one, would like to congratulate this fine community of professionals such as those witnessed in this thread. The subjects of auto-incrementing columns and the use of surrogate keys v.s. natural keys have been hotly debated several times... and never have I seen it done anywhere in a more professional, courteous, informational, and educational manner than on this thread!

    Thanks for the great read, folks!

    Steve, I think part of the professional responses to your article are because of the way you wrote the article. Well done, Sir... no wonder you're an "MVP". :w00t:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 46 through 60 (of 129 total)

You must be logged in to reply to this topic. Login to reply