Do You Really Need Surrogate Keys? Part 1 – Concepts and Methods

  • Hugo Kornelis (10/26/2010)


    I'd say that you are both correct. The normalization process deals with one table at a time. But you have to do this for all tables to get a normalized data model.

    No. The normalization process deals with how a fact is expressed within the database. Examples are usually illustrated via single table composition for purposes of clarity only.

    See my earlier example on 2NF, or consider what you must do to place the below schema into 3NF:

    [font="Courier New"]Table Books: ISBN, Title, Author, Agent

    Table Authors: Author, Age, Sex, Address

    [/font]

  • I'll give a somewhat more realistic example. Say you have a chain of bookstores, that sell product from different publishers, and each specializing in one or more genres. Due to contractual obligations, if a store carries books from one publisher, they must sell all genres from that publisher that they currently sell. In 5NF form, that schema is:

    [font="Courier New"]Table StoreGenres: Store, Genre

    Table StorePublishers: Store, Publisher

    Table PublisherGenres: Publisher, Genre[/font]

    Now ... the contract expires, and individual stores are free to sell (or not sell) any genre from any publisher. Fully normalize the above schema based on the new business model. As should be evident, you cannot do it without operating on all tables simultaneously.

    Would an article on the subject be helpful, with some more real-world examples? I could probably throw something together rather quickly.

  • Tables are the product of normalization. Actually, it would be more accurate to say that they are the physical implementation of the product of normalization. The product is actually a set of entities typically depicted in an ER (Entity Relationship) diagram. This is data modeling 101. There is a reason that data modeling has it's own taxonomy and that's to provide clarity for arguments such as this. The actual table definitions are the very last step of an elaborate process and can vary based on the engine they are deployed on and the choices made by the data architect but the logical entities remain the same.

    A good example of this is with a dimensional model. The architect may choose to logically depict a snow-flaked (normalized) dimension for clarity but may choose to physically implement a de-normalized version for a variety of reasons.

    "Beliefs" get in the way of learning.

  • wagner crivelini (10/25/2010)


    Denis

    As I said on this forum, I'm also a fan of surrogate keys.

    Actually, my goal in this first article is to show the concepts and methods I'm using to compare a surrogate key against any other candidate key.

    I pick the product table (surrogate versus char natural key) just to illustrate the use of those methods.

    And, by the way, I built 4 new tables from scratch, so I would not have to consider the impact of extra indexes. Ok, this is not a good implementation on the real world.

    There are so many issues impacting performance that I tried to isolate things and make the comparison a bit easier to understand.

    Wagner, the problem is that this is not a sensible comparison and does not show what you claim it shows because you are not comparing like for like. One table allows duplicate ProductNumbers and one does not. So if one design is correct then the other is incorrect. Your performance comparison is like saying that a car will take me from London to Manchester faster than a jet plane will take me from London to New York! Sure it will, but that doesn't help if New York was where I wanted to be!

    The more fundamental error you made was not to distinguish properly between the two important but different concepts being discussed, namely Keys and Indexes. A key is a logical rule that prevents duplicate data. An index is a physical implementation feature for optimising database performance. Separating those different logical and physical concerns is important if you want to take best advantage of SQL Server features and ensure correctness of data. The lesson here ought to be to first decide what keys you want to implement and then decide what indexes you need to support them.

    Also, stability is usually a desirable feature of a key but it is not essential. Most keys are likely to get updated at some point. Even surrogate ones.

  • Robert Frasca (10/26/2010)


    Tables are the product of normalization.

    True-- when starting from a clean slate. In practice, the word "normalization" has come to mean aligning an existing schema with the appropriate normal form.

    In any case, the operative rule here is that normalization does not apply solely to single tables, but to how a database fact is distributed across one or more tables.

  • ebaya (10/26/2010)


    ta.bu.shi.da.yu (10/26/2010)


    I encourage you to read the following Wikipedia article. The standards didn't change, but if you stuck with 3-letter country codes then you'd find it fairly hard to move to the numeric country codes, should your organization decide that these are a better fit.

    You see? Even ISO country code standards bodies eventually found it easier to use a natural key.

    You disagreed with someone advocating a natural key...but saying they should use a natural key?

    If you meant to say they found it easier to use a surrogate key, you should be aware that ISO-3166-1 country codes are, by definition, natural, not synthetic.

    No actually, I didn't. I pointed out that if you wanted to change the basis of the surrogate key, or if a new, better standard gets adopted because the old one was inadequate, you'd have quite a time changing from one to another.

    My second point was a bit silly.

    Random Technical Stuff[/url]

  • ebaya (10/26/2010)


    Robert Frasca (10/26/2010)


    Tables are the product of normalization.

    True-- when starting from a clean slate. In practice, the word "normalization" has come to mean aligning an existing schema with the appropriate normal form.

    In any case, the operative rule here is that normalization does not apply solely to single tables, but to how a database fact is distributed across one or more tables.

    A good point. Perhaps confusion arises because Normal Forms are always defined in terms of individual relations (the properties of a relation that qualify it to be in some Normal Form) whereas Normalization itself is concerned with a whole schema.

    Another problem is that Normalization is synonymous (quite incorrectly) in many people's minds with a decomposition process, which is in fact just one possible procedure used to produce a database design - one that's rarely exercised much outside the classroom in my experience.

    I think that possibly the biggest cause of misunderstandings with normalization however is that practically everything written about it since Codd deals exclusively with normalization of relations and not relation variables. Seemingly all the classic papers and textbooks have this awkward limitation in common. It's no wonder if students have trouble making real world sense of the examples.

  • David Portas (10/26/2010)


    ebaya (10/26/2010)


    Robert Frasca (10/26/2010)


    Tables are the product of normalization.

    True-- when starting from a clean slate. In practice, the word "normalization" has come to mean aligning an existing schema with the appropriate normal form.

    In any case, the operative rule here is that normalization does not apply solely to single tables, but to how a database fact is distributed across one or more tables.

    A good point. Perhaps confusion arises because Normal Forms are always defined in terms of individual relations (the properties of a relation that qualify it to be in some Normal Form) whereas Normalization itself is concerned with a whole schema.

    Another problem is that Normalization is synonymous (quite incorrectly) in many people's minds with a decomposition process, which is in fact just one possible procedure used to produce a database design - one that's rarely exercised much outside the classroom in my experience.

    I think that possibly the biggest cause of misunderstandings with normalization however is that practically everything written about it since Codd deals exclusively with normalization of relations and not relation variables. Seemingly all the classic papers and textbooks have this awkward limitation in common. It's no wonder if students have trouble making real world sense of the examples.

    Bingo. EBay's example is actually very interesting, and to my mind extremely sensible and yes, that is normalization. However, the example he gives is not an example of applying the second normal form. 2NF is (as Wikipedia puts it so well):

    No non-prime attribute in the table is functionally dependent on a proper subset of a candidate key.

    Random Technical Stuff[/url]

  • ta.bu.shi.da.yu (10/26/2010)


    Bingo. EBay's example is actually very interesting, and to my mind extremely sensible and yes, that is normalization. However, the example he gives is not an example of applying the second normal form. 2NF is (as Wikipedia puts it so well):

    No non-prime attribute in the table is functionally dependent on a proper subset of a candidate key.

    Thanks. In my 2NF example, nonprime attribute "name" was functionally dependent only on SSN, rather than SSN + email. A better example probably would have been to use physical address than email address, as that would have made it more clear I intended a composite key. That's part of the difficulty of crafting good normalization examples. Hopefully I did a better job in the 5NF reverse-decomposition example I posted afterwards.

  • Guys,

    As usual - the amount of difficulty never disappears...

    Surrogate keys are excellent for Referential Integrity - small PK / FK.

    The default of having PK as a clustering one is a very BAD choice in most cases - because it creates Hot Spot on INSERT and table scan on SELECT.

    The choice of clustering index is to have data stored logically together to optimise SELECT and spread contention on INSERT, UPDATE, and DELETE.

    So, in most cases a PK constraint SHOULD NOT be a clustering one - just a UNIQUE index.

    A clustering index has to be chosen carefully... and the last column in the index should be a surrogate (PK) column to ensure uniquiness. And this is an "arty" part of table design...

    That's all.:-P

  • David Portas (10/26/2010) you are not comparing like for like

    David

    I understand your opinion, but I do not agree. I think the most difficult thing when talking about performance is exactly what to compare.

    Performance is affected by so many things that it is difficult (for me at least) to isolate factors and make sure you're comparing oranges to oranges. I agree the designs I considered are oversimplified, but I believe we need this simplification as a first step in any study.

    When you say the article mix up keys and indexes, this is kind of intentional, as most DBMSs use primary keys as clustered indexes as a default. So, although not ideal, this is what is implemented in a large portion of databases we have nowadays. (I know this single paragraph might cause a whole new thread with comments, complaints and strong words).

    And btw, thanks for your comments

  • TheMOTU (10/26/2010)


    The default of having PK as a clustering one is a very BAD choice in most cases - because it creates Hot Spot on INSERT and table scan on SELECT.

    The hotspot contention issue for monotonically increasing clustered indexes hasn't been a problem for SQL Server since version 7.0. I remember quoting the exact same thing to my junior devs ... back around 1999 or so.

    IIRC, SQL 2005 *briefly* had the problem due to some changes in how they split pages. But again, that was quickly corrected.

  • wagner crivelini (10/26/2010)


    When you say the article mix up keys and indexes, this is kind of intentional, as most DBMSs use primary keys as clustered indexes as a default. So, although not ideal, this is what is implemented in a large portion of databases we have nowadays. (I know this single paragraph might cause a whole new thread with comments, complaints and strong words).

    In every DBMS I know of, the database implementor and not the software decides what indexes to implement. Surely that's an important choice that the readers ought to be made aware of when considering performance.

    Regardless of that, to mix up the different issues of keys and indexes still doesn't make sense in practical terms. Choice of keys determines correctness of the results you get from the database (which data can be duplicated and which not). Indexes deal with performance. To conflate those two different things is a very bad habit. Business owners, project managers, develeopers and testers should surely agree that correct results and fast results are very different aspects of software development!

  • TheMOTU (10/26/2010)


    Guys,

    As usual - the amount of difficulty never disappears...

    Surrogate keys are excellent for Referential Integrity - small PK / FK.

    The default of having PK as a clustering one is a very BAD choice in most cases - because it creates Hot Spot on INSERT and table scan on SELECT.

    The choice of clustering index is to have data stored logically together to optimise SELECT and spread contention on INSERT, UPDATE, and DELETE.

    So, in most cases a PK constraint SHOULD NOT be a clustering one - just a UNIQUE index.

    A clustering index has to be chosen carefully... and the last column in the index should be a surrogate (PK) column to ensure uniquiness. And this is an "arty" part of table design...

    That's all.:-P

    I thought the surrogate as a PK can be a good choice for Clustered. Isn't it good for inserts since they are sequential rather than forcing SQL Server to reorganize when trying to insert in the middle? Isn't it good for Selects since the joining of tables is done through the PK?

  • ebaya (10/26/2010)


    Robert Frasca (10/26/2010)


    Tables are the product of normalization.

    True-- when starting from a clean slate. In practice, the word "normalization" has come to mean aligning an existing schema with the appropriate normal form.

    Since when? With all due respect that's pretty much a nonsensical statement. What difference does it make whether it's a new schema or an existing one? Either way table definitions are still the product of normalization whether it's a new schema or a revised schema.

    "Beliefs" get in the way of learning.

Viewing 15 posts - 121 through 135 (of 178 total)

You must be logged in to reply to this topic. Login to reply