The Identity Debate

  • I religiously make use of surrogate keys, even still I do not use them where no advantage is gained by doing so(reference data tables, i.e. zip code, code tables, country tables, etc). Natural keys are still needed, for data integrity and ensuring that no duplicate data is generated, but these always find themselves being AK's in my designs.

    Identities as foreign keys have fallen out of favor with me. In today's world, we do not always have a single database. We have to replicate (I use the term loosely) data to multiple databases, or allow data entry in multiple locations. In many cases people generate a new identity in the destination DB, this creates too much pain and confusion in my opinion. Guids, while requiring more storage space, do not greatly impact performance of joins, and due to their nature, are ideal for surrogate keys.

    The first couple of systems I worked on in my career (set the way back machine to the early 90's) involved assumptions made regarding natural PK's (including SSN's and what were supposed to be unique air-can ids). In both those cases, I discovered that not only can these values change, they can be entered incorrectly due to human error with alarming frequency.

    Since then, I have been a zealous conscript. 🙂

  • Like most everybody who has posted, I'm also straddling the fence. And while it could have been put more nicely, I must agree that there is some loss in formal understanding of the design process and what, in the relational world, primary keys exist for.

    The one comment I waited for was that incremental clustered primary keys are sequential and don't cause page splits. True enough, but in the real world, isn't there some other sorting mechanism one might use in a database to improve search times that mimics the real relationships in the data? If all the data moves only one way (IN), the incremental clustered primary key is ideal. But when you are working on data that must be retrieved given various WHERE and ORDER BY conditions, that clustered key is gonna be useless when it comes to retrieval.

    This is in no way or means to be taken as advocacy against using an incremental surrogate key. However, it IS advocacy regarding learning and understanding how the data will be used so that your design also considers the best way to retrieve that data for the general case.

    ------------
    Buy the ticket, take the ride. -- Hunter S. Thompson

  • I definitely agree that surrogate primary keys should be used rather than natural primary keys. As mentioned earlier, you can always put a unique index on the set of columns that make a row "naturally unique".

    This may be a separate argument, but I don't like the use of identities. I've always advocated the use of either OIDs or GUIDs. Once I moved from Sybase to MSSql, I found GUIDs to be very useful since they can be generated on the server or on a .Net client app and are still guaranteed to be unique. The problem with identities is they aren't guaranteed to be unique across servers or multiple tables, and if you ever have to combine the data, you'll have to deal with the mess of translating those identity values to a new value, updating child table records, etc. In moving the data from one table to another, you'd also have to turn off the identity generation, move the data, turn it back on, and make sure the next generated identity doesn't conflict. I'm not sure of the details exactly how painful this is since I'm a developer and not a DBA, but in general identities don't handle change as well as GUIDs.

  • (G Bryant McClellan:) Well - now I'm thinking you are confusing the primary Key and clustered key functions. They are not the same and do not necessarily have anything to do with each other, and it's dangerous to use them interchangeably. I am actually not so thrilled by the fact that SQL Server implicitly encourages you to think of them as the same thing, when they ultimately have very little to do with each other (on the "purpose for being" side of things).

    It's of course a matter of balance. It's nice and all if the key dictating your table's physical order is also the one most used for filtering and sorting the rows. Not spending the time to figure out what is used most in those cases is essentially a problem.

    Personally - once the numbers of inserts occurring is small compared to the overall data set, I'm more likely to introduce some slack space to allow for inserts throughout, with an occasional reorg/rebuild of the table to weed out excess page fragmentation. But still - not having to hunt down and handle the DRI nightmare of continuously having changes to your natural Primary keys is not a small benefit to be disregarded lightly.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • From my point of view, there is definitely room for both identities and natural keys, but I have to say that I much prefer natural keys for one main reason that I don't think has been mentioned: readability. A WHERE clause that says "where state = 'MN'" is instantly readable; a WHERE clause of "where state = 14" doesn't mean a lot by itself.

    Identities do definitely have their place - frequently-changing data should never be used as a key, for example - but they can definitely be overused, also:

    * One application that my company purchased from a third-party actually uses join tables to map dates to a table that has an identity value for each date. (This was discovered when the system ran out of pre-mapped dates and crashed.)

    * Another application that I've worked with had numerous tables that only consisted of dozens of columns of meaningless, autogenerated numbers.

    We also have a data warehouse (of sorts) in which entire tables of 500,000 rows are reloaded every night from dumps from the primary systems. If we used identities on these tables, the same row could potentially have a different identity value each day.

    As someone else said, it's all a matter of balance.

    Ed Leighton-Dick | Consultant | Microsoft Data Platform MVP | MCSE | PASS Regional Mentor

  • I wondered if anyone would mention this point. I'd refer everyone to search out Kimberly Tripp's analysis of the use of clustered indexes, and her reasons why ID fields make an ideal cluster index. Bottom line, she basically recommends clustering most tables that hold significant amounts of data, and using an ID for the cluster key. There may be exceptions, but they're the exceptions, not the rule. I won't rehash her logic, she explains it much better than I, and she's one of those MVPs that any SQL Server professional needs to know.

    Granted, you may not need generated keys for small lookup tables. But they probably don't cost that much.

    With regard to the sex example, I once mentioned the lack of necessity for a lookup table on sex to a colleague (we were discussing the notion of set domains). After all, everyone is one or the other, right? He told me that California recognizes something on the order of 2 dozen different sexes. What are the others? Male, formerly female. Anatomically male, living as female... You get the idea. The wonders of science and modern medicine.

    BTW, anyone ever have to update several hundred reports when the company name changed, and the old one was hardcoded as a literal everywhere? Funny the stuff IT professionals sometimes don't even recognize as data in the first place.

    I used to be a big believer in natural keys, until I realized that IT folk have a naive view of the workings of a lot of the fields we support. The fewer assumptions the better. When in doubt, normalize and abstract.

  • Wow! No time to read all of the comments right now, but I'll add my 2 cents.

    Steve Jones said:

    ...and above all, were easy for a DBA to remember and work with when building applications or troubleshooting issues.

    When I'm building a moderately sized application from the ground up, by myself. I would be designing the database, the user interface, and the data access code at each end. Identity columns are perfect for all of the reasons stated by Steve. The choice of foreign key constraints becomes a no brainer, and very easy to remember when going from the T-SQL code to the UI app's code, and back. Easy for joins. Easy to return to when you realize the need for additional functionality, or just for testing, or troubleshooting.

    Additional unique constraints should still be used when there is a role for them.

    I'm not preaching that Identity fields are the only way to go, just that it is the choice I usually will make in this type of scenario.

    Tom Garth
    Vertical Solutions[/url]

    "There are three kinds of men. The one that learns by reading. The few who learn by observation. The rest of them have to pee on the electric fence for themselves." -- Will Rogers
  • Having run into databases with large numbers of natural keys, I became a convert to surrogate keys when it became clear that many of these "natural" keys shared many properties that "natural" things have: mutability, inconsistency and complexity.

    For small lists of things that are unlikely to change I will use a natural key: gender, states and other small lookup tables of that nature. The pain of a one time cleanup on these is small and the ability to write more natural looking queries is useful. (Yes, I have had to clean up gender in the past; it is quite a shock when you realize that XX and XY are just the tip of the iceberg. If you are looking at genetic testing, triple forms such as XXX and XYY are fairly common (both are 1 in 1000 in the population of each gender, and there is a mosaic of more complex forms as well).

    However, I have started to even convert those tables over to surrogate keys simply because I am using an Object Relational Mapper for 99% of my projects now and the natural query problem goes out the window because I can have the ORM map the keys out for me... so now it depends on if there will be extensive access outside the ORM or not: if not, I am now 100% surrogate keys.

    That said, I do hate it when the natural key isn't covered with a proper index. Using surrogates does not mean just slapping an surrogate down and ignoring the search aspects that the natural key (in index form) provides.

  • I use identity keys on virtually every table for the same reasons mentioned:

    1. Natural keys have a way of proving themselves non-unique as time goes on

    2. Relational integrity is easier to maintain, and I believe that maintaining relational integrity trumps "purism" in a single table

    3. Trigger code is simplified immensely, especially on multi-row operations that fire the trigger a single time

    4. Developers are not given the "opportunity" to play games with smart keys or similar assumptions

    I am blissfully unaware of the debate about NULL. I use NULL when/where it is appropriate. Namely, not null when I demand a value and null when I don't. I cannot see any reason not to, frankly. I have found that incorrect use of NULL columns very often leads to "smart" values being stored in columns that actually mean NULL. When that happens, I believe that you are left with the scenario where all business logic must be implemented in the database layer, since any and all reports, etc. must understand the same "this means NULL" rules. Thus, your n-tier design collapses and the DBA catches all the work when the BL changes.

  • I think I see a misunderstanding of the use of natural keys vs surrogate keys in one of the posts. Yes, WHERE State = 'MN' is clearer than WHERE State = 14. Using a surrogate key does not mean you can't use natural keys, but to supplement them. Your queries can make use of the natural keys, while the system can use the surrogate keys behind the scenes.

    This is especial true where the never will change natural key suddenly changes, you have 100,000's of historical records tied to it in other tables. With a surrogate key, you could change the ProductID (a natural key) and still have it pointing to all the appropriate records with the ProductSID (a surrogate key). When writing queries in this case, you woud write the where clause using ProductID, but the join clauses between tables would use the ProductSID.

    😎

  • Hi Jaholbrook: Your post comes down firmly in the anti-sid camp, but you didn't really say anything about why they are poor design, etc. Any thoughts?

  • I also make extensive use of identity surrogate keys for all the reasons described above. For me, the question is why not use an identity key? There are some cases, e.g. a table of states, where it's not really needed, but generally it is a good idea.

    Another practice of mine that is correct, I think, but uncommon, is to use identity keys of the smallest size to speed queries and reduce DB size (in primary and foreign keys). I regularly use tinyints and small ints instead of ints where I'm sure the number of records will not exceed their limits. Since I prefix all field names with the data type, programmers are aware of the difference.

  • jaholbrook (2/11/2008)


    identity columns are like the fast food of the database world.

    ..."Super Size Me" ring a bell.

    Basically a lack of understanding on the upfront work involved in modeling. Does anyone still know what logical modeling, forward engineering, or even relational is??

    Way to much focus on the technology and not enough focus on DESIGN.

    I guess you can't blame it all on the IT world. Part of the problem is that businesses want a jack of all trades to be the DBA/network admin/developer rather than have a true

    DBA.

    Planning is a wonderful thing. The 10,000 foot-view is terrific, as long as you have a parachute. Unfortunately out here in the "real world" no plan survives "first contact with the enemy".

    -"The best laid plans o' mice an' men gang aft a-gley."

  • Many of the responses I'm seeing here just confirm for me why I'm very skeptical of people who just use some kind of artificial key across the board. Whenever I see "I like to add an identity field because it guarantees my data will be unique" I would have no confidence in that data. If you don't have another way to guarantee the uniqueness of your data, how in the world can your system possibly know when it should or shouldn't be creating a new record vs. updating an existing one?

    I've seen lookup tables of ISO country codes that as far as I'm concerned had duplicate data in them because I don't see how people can say a table with data like

    1 US

    2 US

    3 CA

    4 CA

    is guaranteeing the uniqueness of the data.

    I understand there are times when an identity (not my first choice because of the non-portability) or other artificial key is very helpful. But I don't think the system should even let you assign an identity column until you already have some other unique index defined.

  • pauls2 (2/11/2008)


    I understand there are times when an identity (not my first choice because of the non-portability) or other artificial key is very helpful. But I don't think the system should even let you assign an identity column until you already have some other unique index defined.

    That's an interesting thought, but there are times when you might get the same data (say - you're tracking outcomes of consecutive coin tosses). There really might not be anything unique other than something artificial. Your approach would preclude ever storing that data. That's the case where there isn't a natural key.

    If there IS a natural key, it would be irresponsible (IMO) to not implement a unique constraint on it. That still doesn't mean you necessarily need to use the natural key as your PRIMARY key. Like has been pointed out before - if your natural key is "very wide" (even if you are in one of those wonderful worklds where it never changes), then it might still be better to implement a "skinny" surrgoate key to (in the phrase well used by other posters) supplement, not replace, the natural key.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 15 posts - 31 through 45 (of 129 total)

You must be logged in to reply to this topic. Login to reply