Do You Really Need Surrogate Keys? Part 1 – Concepts and Methods

  • Robert Frasca (10/27/2010)


    Since when? With all due respect that's pretty much a nonsensical statement. What difference does it make whether it's a new schema or an existing one? Either way table definitions are still the product of normalization whether it's a new schema or a revised schema.

    There are several problems with your statement. First of all, saying "tables are the product of normalization" implies a cause-effect relationship that rarely exists outside the classroom. In the real world, preexisting tables are normalized to create new ones, then re-normalized over and over as business logic changes, ad infinitum through the life of the application.

    Your statement also implies that tables cannot exist without normalization. But tables are often created by people who don't know the rules of normalization, or by people intentionally breaking those rules. What happens when someone creates a new table by denormalizing an existing one? Would we call that "the product of de-normalization?"

    What would actually be accurate is to say that normalized tables are the product of normalization. But tables per se can and do exist without the normalization process-- unfortunately, in most cases.

  • ebaya (10/27/2010)


    Robert Frasca (10/27/2010)


    Since when? With all due respect that's pretty much a nonsensical statement. What difference does it make whether it's a new schema or an existing one? Either way table definitions are still the product of normalization whether it's a new schema or a revised schema.

    There are several problems with your statement. First of all, saying "tables are the product of normalization" implies a cause-effect relationship that rarely exists outside the classroom. In the real world, preexisting tables are normalized to create new ones, then re-normalized over and over as business logic changes, ad infinitum through the life of the application.

    Your statement also implies that tables cannot exist without normalization. But tables are often created by people who don't know the rules of normalization, or by people intentionally breaking those rules. What happens when someone creates a new table by denormalizing an existing one? Would we call that "the product of de-normalization?"

    What would actually be accurate is to say that normalized tables are the product of normalization. But tables per se can and do exist without the normalization process-- unfortunately, in most cases.

    ebaya (10/27/2010)


    Robert Frasca (10/27/2010)


    Since when? With all due respect that's pretty much a nonsensical statement. What difference does it make whether it's a new schema or an existing one? Either way table definitions are still the product of normalization whether it's a new schema or a revised schema.

    There are several problems with your statement. First of all, saying "tables are the product of normalization" implies a cause-effect relationship that rarely exists outside the classroom. In the real world, preexisting tables are normalized to create new ones, then re-normalized over and over as business logic changes, ad infinitum through the life of the application.

    Your statement also implies that tables cannot exist without normalization. But tables are often created by people who don't know the rules of normalization, or by people intentionally breaking those rules. What happens when someone creates a new table by denormalizing an existing one? Would we call that "the product of de-normalization?"

    What would actually be accurate is to say that normalized tables are the product of normalization. But tables per se can and do exist without the normalization process-- unfortunately, in most cases.

    There is a cause-effect relationship. When I'm data modeling, I don't consider existing tables in a schema to be "tables". I consider them to be "entities". Obviously, if I'm attempting to normalize them then the existing structure might be considered broken and therefore irrelevant. I am, for all intents and purposes, starting from scratch. After normalization, I will have new entities. Some of those entities may have the same name as the old entities but they are still new entities with new attributes. Some of the old entities may go away. If I've designed a solid model then I shouldn't have to "re-normalize" because I should have anticipated growth while designing the original model. I will grow the model but not have to start at the beginning again.

    Let's summarize. The cause is the process of normalization, the effect is the creation of new logical entities. An entity does not become a "table" until I (the architect) say it is and, at implementation time, the table in the schema may not match the logical entity.

    Yes, the product of de-normalization is also a table. Obviously, you're not familiar with the dimensional data model which is largely predicated on the de-normalization of a normalized model.

    I never said that tables couldn't exist without being normalized. That is a silly way to twist my words. I said that tables were the product of normalization just as gasoline is the product of oil refining. It's true that I could choose to leave the crude oil in its natural state or the badly designed data model in its natural state but there is little value in that.

    Part of the problem is that many people confuse the physical implementation, i.e. creation of a schema, with data modeling. Data modeling, including normalization and de-normalization, is a logical process not a physical process. A logical data "model" is platform independent until implementation time. In other words, theoretically it doesn't matter if the target is SQL Server or Oracle during the data modeling phase.

    There are a lot of good data modeling books and I suggest you read one. Then we can have a more interesting exchange of ideas on the topic.

    "Beliefs" get in the way of learning.

  • Robert Frasca (10/27/2010)


    When I'm data modeling, I don't consider existing tables in a schema to be "tables". I consider them to be "entities".

    When I'm standing in the middle of the freeway eating my pork sandwich, I don't consider oncoming cars to be "cars" either. How well do you think that will work out in practice?

    Let's say it all together now. A table is a table is a table. Tables can be created before or after normalization, or without it entirely. Renormalizing (or denormalizing) a schema can be considered as either modifying the original tables, or recreating new versions from scratch. Functionally, there is no difference between the two. You're invoking a pedantic, sterile quibble over semantics in an attempt to justify your remarks.

    If I've designed a solid model then I shouldn't have to "re-normalize" because I should have anticipated growth while designing the original model.

    This more than anything else you've posted displays a gross misunderstanding of normalization. Business rules change. When those changes impact functional or join dependencies, or create new ones, what once was a perfectly normalized schema now becomes one that is not. The need for renormalization wasn't a result of "poor design" up front, but of changing requirements. See the contract expiration example I posted above.

    I never said that tables couldn't exist without being normalized. That is a silly way to twist my words. I said that tables were the product of normalization just as gasoline is the product of oil refining.

    Can one create gasoline without refining? (I'll leave off the question as to whether Fischer-Tropsch qualifies).

    There are a lot of good data modeling books and I suggest you read one.

    Many thanks, but I've been modeling professionally since long before SQL Server existed as a product. For the last decade, I've been a principal architect on one of the 5 largest databases in the world.

    You, on the other hand, are the chap who grew confused when other posters here were explaining why the creation of a surrogate key should usually be paired with the creation of an additional unique constraint, and believes renormalization is only the result of "bad design". Given that, I'm going to offer you your own advice.

  • Robert Mahon-475361 (10/24/2010)


    Anytime this gets brought up (and it does, time and time again), before we argue further, I /really/ recommend reading;

    http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=6136&whichpage=2

    and quazibubble's posts. Yes he's argumentative, but puts up an impressive reasoning.

    Thank you so much for this. I'm reading quazibubble's super-mega-long post now about creating the world's most perfect natural key and I don't think I've laughed this hard in months.


    "If I had been drinking out of that toilet, I might have been killed." -Ace Ventura

  • autoexcrement (11/3/2010)


    Robert Mahon-475361 (10/24/2010)


    Anytime this gets brought up (and it does, time and time again), before we argue further, I /really/ recommend reading;

    http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=6136&whichpage=2

    and quazibubble's posts. Yes he's argumentative, but puts up an impressive reasoning.

    Thank you so much for this. I'm reading quazibubble's super-mega-long post now about creating the world's most perfect natural key and I don't think I've laughed this hard in months.

    quazibubble


    Natural keys (even if a single column) violate normal forms 1, 2, and 3

    What a load of rubbish. Please don't post links that might encourage others to read it.

  • In what way out of curiosity?

    SSN's keep being mentioned, but they can change/be missed/be duplicates. Most other 'natural' keys end up falling to the same traps. Therefore, they're not valid keys and don't meet the rule. The linked article, in an aggressive style true, appears to be very valid though. What's specifically wrong?

  • Robert Mahon-475361 (11/4/2010)


    In what way out of curiosity?

    SSN's keep being mentioned, but they can change/be missed/be duplicates. Most other 'natural' keys end up falling to the same traps. Therefore, they're not valid keys and don't meet the rule.

    This just isn't true at all. Natural keys violate no normalization rules.

    A key is valid if it uniquely identifies the entity, not because it uniquely identifies a row in a table. Most people don't understand the difference. Nor does a change in a key value violation any normal form ... it just creates a bit of a headache for developers.

    Take your example. A SSN uniquely identifies a person, period. If you're seeing duplicate SSNs in your data, and you create a synthetic key, you have't "normalized" anything. You've simply pushed your data integrity issue up a level -- from the database up to the application. In general, that's a very bad thing ... which is why even if you did choose to use a synthetic key for other reasons, you'd be wise to set a unique constraint on SSN anyway.

    Synthetic keys have a lot of advantages. But anyone who tells you they should be used in any and all situations, or that natural keys "violate a normal form" is an ignorant, unshaven buffoon who either has never taken a class on data modelling, or slept through it if he did.

  • ebaya (11/4/2010)


    Robert Mahon-475361 (11/4/2010)


    In what way out of curiosity?

    SSN's keep being mentioned, but they can change/be missed/be duplicates. Most other 'natural' keys end up falling to the same traps. Therefore, they're not valid keys and don't meet the rule.

    This just isn't true at all. Natural keys violate no normalization rules.

    A key is valid if it uniquely identifies the entity, not because it uniquely identifies a row in a table. Most people don't understand the difference. Nor does a change in a key value violation any normal form ... it just creates a bit of a headache for developers.

    Take your example. A SSN uniquely identifies a person, period. If you're seeing duplicate SSNs in your data, and you create a synthetic key, you have't "normalized" anything. You've simply pushed your data integrity issue up a level -- from the database up to the application. In general, that's a very bad thing ... which is why even if you did choose to use a synthetic key for other reasons, you'd be wise to set a unique constraint on SSN anyway.

    Synthetic keys have a lot of advantages. But anyone who tells you they should be used in any and all situations, or that natural keys "violate a normal form" is an ignorant, unshaven buffoon who either has never taken a class on data modelling, or slept through it if he did.

    I guess my question is: under what scenario would a natural key be appropriate?

    Incidentally, natural keys by themselves cannot violate a normal form. 100% agree with ebay here!

    Random Technical Stuff[/url]

  • That's what I don't get, an SSN doesn't uniquely identify a person. I've seen duplicates, missing data, or it's wrong. Is it an attribute of an entity? Aye. Can it be used as a key? Only if you have to then add things later to get around it not working all the time. In which case, why use it in the first place?

    If you put constraints on the SSN, then you have to deal with it at the application level when it already exists, should it be allowed (duplicates! so to fix that you add something else, but how do you handle Fred Smith with the same SSN when it's not the same person? (systems I've written recently encounter this frequently)).

  • Robert Mahon-475361 (11/4/2010)


    In what way out of curiosity?

    SSN's keep being mentioned, but they can change/be missed/be duplicates. Most other 'natural' keys end up falling to the same traps. Therefore, they're not valid keys and don't meet the rule. The linked article, in an aggressive style true, appears to be very valid though. What's specifically wrong?

    A key is a set of attributes that have exactly two properties: Uniqueness and Irreducibility. The fact that a key value might change doesn't make it any less of a key, stability being a very useful property of a key but not an absolute requirement. Normalization is entirely orthogonal to changing keys because all normal forms are defined in terms of relations, not relation variables. So stable keys are not any requirement of normalization.

    Of course stability is an important design consideration but in some cases it makes perfect sense to enforce unqiueness of key values that might change. A good example is a user login name. The user may be allowed to change his/her login name at any time as long as long as the new name is not a duplicate with any other.

    Ultimately any key value (including a surrogate) can change but what makes keys useful and important is the fact that the database enforces their uniqueness and that therefore users can rely on them to identify data correctly in the database. To say that a natural key violates Normal Form suggests a profound misunderstanding of what keys and normalization are all about. If that were so then normalization would be useless as a design tool for analysing and designing databases and ensuring that dependencies are correctly represented in the database.

  • David Portas (11/4/2010)


    Ultimately any key value (including a surrogate) can change but what makes keys useful and important is the fact that the database enforces their uniqueness and that therefore users can rely on them to identify data correctly in the database.

    That's the bit that throws me. When Natural Keys have been discussed, the way they change, and the faffing about to change everything else hanging from those tables, and even other systems that have connected up, it's been a nightmare. When using a surrogate, it can't change can it? Once set, if it's a huge sequential INT, or a GUID, it's never going to be amended, all that logic on checking 'does this key exist' isn't needed. If if IS changeable, then you've either got the potential for duplicated data that does break normalisation, or you've got bits of data meaning different things in different places.

    I guess it's not AS bad if those changes can cascade, though I've seen accountants Spreadsheets go to hell when things were changed and some macro's that someone created long ago didn't take into account changing keys (will never use an int PrimaryKey as an invoice number again *ahem*). Just keeping it as that INT, it'll never change.

    When WOULD a Primary Key's actual data (natural or surrogate) ever actually be changed? For a natural key I could see you'd have to, but a surrogate?

  • Robert Mahon-475361 (11/4/2010)


    When WOULD a Primary Key's actual data (natural or surrogate) ever actually be changed? For a natural key I could see you'd have to, but a surrogate?

    I alreadt gave an example of a natural key that might change (a login name). Surrogate keys can change for several reasons. For example if the schema has to be changed the data might need to be moved to new tables with pre-existing data, or if new data is merged from another database source then keys may need to be reallocated. To merge and deduplicate data from multiple sources you need a natural key.

    In the case of natural keys, key change may be easier to handle if foreign key references use a surrogate (i.e. in the situation where the natural key of a table changes but the surrogate doesn't). However, the natural keys are by definition still important from a data integrity perspective. Enforcing the natural keys prevents duplication of business data, which the surrogate key can never do. So it's more important to ensure that each table has appropriate natural keys defined on it and then decide whether you need to implement a surrogate on it as well. In all cases you may need to consider whether and how you want to support key changes.

  • Robert Mahon-475361 (11/4/2010)


    That's what I don't get, an SSN doesn't uniquely identify a person.

    Yes it does. You're still thinking at the database level. But a database is simply a model of some real world set of requirements. The real world is what counts here. In the real world, no two people have the same SSN.

    A duplicate SSN is a data error. Changing your key doesn't "solve the problem". It simply pushes it up a level. You still have the data error -- but now you've let that error into your table, rather than keeping it out. To repeat, this is why that even if you choose to use a synthetic key for other reasons, you'd be wise to put a unique constraint on SSN anyway.

    Saying "the natural key might have duplicates so I need a surrogate" is the hallmark of a lazy developer. It's a way for you to shift the buck, and allow you to claim your database is still technically functioning properly, because you allowed a bad row to get shoved in without throwing an error back up to the application.

    Similarly, claiming "the natural key may change so I need a surrogate" is the hallmark of a lackwit. Guess what? Data changes in databases. If your key changes, you update it. Many times natural keys change so infrequently that this is irrelevant. Many times a natural key is only used in one or two particular tables. And even when the natural key changes often, and/or is used as a FK in dozens of other tables, using a synthetic is simply a performance optimization, not some hard-and-fast requirement.

    You might think the missing SSN is a reason not to use a natural key. Wrong again. If your particular application allows people to be identified without a SSN, then SSN isn't a natural key to begin with. It's a dependent attribute of some other natural key. Maybe your business rules (NOT your database) has some other natural key for people. Likely it's some lengthy set of attributes, in which case you _could_ use it, but you'd be wise to substitute a surrogate key. Maybe it has no natural key at all, which is usually a real problem you need to take up with whoever's defining your business logic.

  • I see what you're saying, it's probably that I've written systems where duplicated SSN's were far more common than they should be and it was WANTED to allow them to be entered for later... purposes!

    Why I've probably got the view I have and see it just as an attribute of an entity. I've never had the issues Natural Keys appear to have (and why there's so many articles about them I guess!)

    But many thanks to try and explain a different point of view.

  • Robert Mahon-475361 (11/4/2010)


    I see what you're saying, it's probably that I've written systems where duplicated SSN's were far more common than they should be and it was WANTED to allow them to be entered for later... purposes!

    In that case you should ask the user / business owner how they intend to identify employees in the organisation and make sure you implement the key(s) they need to do that. In organisations I have worked for, employees were identified with a company-specific employee number that was required to be unique. If you ignore the problem of defining an appropriate natural key then you just create problems for yourself and your users down the line.

Viewing 15 posts - 136 through 150 (of 178 total)

You must be logged in to reply to this topic. Login to reply