Do You Really Need Surrogate Keys? Part 1 – Concepts and Methods

  • I agree with Quazibubbles post as well. The surrogate primary key needs to be paired with a unique key in order to maintain data integrity. This concept seems to be unknown to many people posting here.

  • tfriedman71 (10/25/2010)


    I agree with Quazibubbles post as well. The surrogate primary key needs to be paired with a unique key in order to maintain data integrity. This concept seems to be unknown to many people posting here.

    LOL. Now now!:-D

    jwcolby54

  • tfriedman71 (10/25/2010)


    I agree with Quazibubbles post as well. The surrogate primary key needs to be paired with a unique key in order to maintain data integrity. This concept seems to be unknown to many people posting here.

    Why would you need to do that? The primary key IS unique. What is the value in pairing a unique value with another unique value?

    "Beliefs" get in the way of learning.

  • ebay-1087924 (10/25/2010)


    Robert Frasca (10/25/2010)


    I'm a little hazy on what you defined as the primary key. If it's a many-to-many relationship how can the combination of the two foreign keys represent a primary key?

    Hugo's example was actually a FK + a non-FK column. However, there are plenty of cases where two FKs do constitute a natural PK -- a link table implementing a M:M relationship, for example. Consider a table of customers, and a table of airline flights. The table linking customers to flights will have a customer FK and a flight FK, and their concatentation is the natural key for that table.

    He said it was a junction table. There aren't typically any non-fk columns.

    "Beliefs" get in the way of learning.

  • A surrogate key is meaningless by itself, it's just a GUID or Identity. You also need the unique key to inforce the natural key/s relationship in the table.

  • tfriedman71 (10/25/2010)


    A surrogate key is meaningless by itself, it's just a GUID or Identity. You also need the unique key to inforce the natural key/s relationship in the table.

    You're joking right?

    "Beliefs" get in the way of learning.

  • Robert Frasca (10/25/2010)


    Why would you need to do that? The primary key IS unique. What is the value in pairing a unique value with another unique value?

    You're confusing a unique value with a uniqueness constraint. A natural key typically should be enforced as unique. When you define the natural key as a PK, this is done for you. However, when you substitute a surrogate for the natural key, that uniqueness is no longer enforced automatically.

    Consider a people list originally keyed by SSN. Replacing SSN with a surrogate key means that -- unless you add a secondary uniqueness constraint -- you can then have duplicate SSNs in the database.

  • If it is going to change then it isn't generally speaking a good candidate for a primary key.

    The thing with Surrogate Keys is that they create uniqueness, they don't enforce it. What they may add in join efficiency they may lose in storage efficiency and data quality.

    Obviously a wide table is not going to lose much for a 32bit integer surrogate key but what happens if someone uses a GUID as a surrogate key?

    If I am going to use a compound key then I try to keep it to fixed length data types.

    I am also wary of non-mechanical compound keys i.e those that rely on human input as those are going to be subject to change.

    I am wary of deliberately creating surrogate keys that mean I have to join a number of large tables together in order to run a purge strategy.

    If I had three tables, OrderHeader, OrderDetail and something hanging off OrderDetail then when I purge a huge number of OrderHeader records I want to be careful of creating extra load to get rid of data that is hanging around in the child of OrderDetail. Been there, worn the T-shirt, got called out at 3am etc.

  • Nope, not joking. Actually, it's pretty common - in my 12 years doing this stuff anyways. I didn't invent the concept. You might want to do some searches online for the the meaning of surrogate key.

  • Hugo wrote:

    And consider what so many people call a junction table - a table that represents the many-to-many relationship between two tables. I will usually model these as just the two foreign keys to the two tables in the relationship, and a primary key over the combination of those columns. Again, a PK with business meaning. And again, I see no advantages to be gained by adding a surrogate, only a few disadvantages.

    I can think of a giant advantage--at least to my programs.

    Suppose I have a such a junction table (I call it a link table) with a primary key on the two FKs as you describe. Now suppose the user in the front-end tries to change one of the values in an existing row and unwittingly tries to change one of the FK values to a value that would result in a duplicate row.

    The database will not allow this of course. Here's the problem: If my front-end program does nothing but submit the user's choice to the database, then the user will get a nasty database error that she will not understand.

    On the other hand, suppose I am a good developer and check her data entries before the row is allowed to go to the database. This way, I can see if an error would occur with her choices and give her a nice, understandable error message.

    How do I write such a query without a surrogate key? I have to ask the database: "Look for a row in the database with FK values 'A' and 'B', but don't look in this row." I need to exclude "this" row, because I'm looking for a duplicate and I do not know exactly what changes the user has made to the row. I want to look at other rows to see if the user's choices for this row would create a duplicate. How would I define "this" row without a surrogate key?

    You might suggest that I simply count rows that have FK values A and B. If the query returns a count of 1, then I know that another row out there already has the FK values of interest and the user's choice would create a duplicate. However, what if the count of 1 is for the current row? Unless I have the hassle of writing screen-specific extra code (instead of the more generic code I use now for these situations), I don't know which value(s) in the row the user changed, especially when there are one other fields in the link table (which I know was not exactly your example). All my code knows is that something about the record changed, but not what specific column - or even if the user changed a FK value to something else and then changed it back to the original value before submitting the row to the database. (It happens. Don't ask me to explain users.) In such a case, a simple counting query that does not exclude the current row could simply return the current row. The code would think it is a duplicate when in fact it is not.

    In the end, the surrogate key creates an easy, fool-proof system to be able to know exactly which row the user is modifying. So, then I have an easy way of making sure the user is given proper error messages and only when needed. With that surrogate key in place, I also get the added benefit of having a single column key should the link table ever be a parent in a join relationship in the future. (It happened at least once.)

    I used to listen to the DBAs who advocated against surrogate keys for "link" tables. In practice, it created a problem. I had to go back and add surrogate keys to several tables in order to create proper user-friendly applications in reasonable time frames. Now, I just add those surrogate primary keys up-front and avoid the headaches down the line.

  • Robert Frasca (10/25/2010)


    He said it was a junction table. There aren't typically any non-fk columns.

    For a child table, the table's natural key will almost always be the concatenation of one or more FKs, sometimes with the addition of non-FK columns.

  • admin

    I had 2 options here: either I used different SQL statements based on the primary key of each table (surrogate or natural key) or used one single SQL statement based on another candidate key which had the same functionality in any scenario.

    I chose the 2nd option. You are 100% right this is not the best approach from a performance standpoint, but I felt it would make it easier to compare the scenarios, since I was applying the very same conditions on both tests.

  • JJ B (10/25/2010)


    How do I write such a query without a surrogate key? I have to ask the database: "Look for a row in the database with FK values 'A' and 'B', but don't look in this row." I need to exclude "this" row, because I'm looking for a duplicate and I do not know exactly what changes the user has made..

    You're trying to short-circuit out the benefits of a uniqueness constraint in the database. Rather than writing custom code in the application to "pre-check" whether or not the insert is a duplicate, you simply intercept the constraint violation and translate it into a more friendly error message. Your "fix" winds up costing extra programmer effort, duplicating code, introducing an additional failure point, forces extra maintenance if business rules change, and fails to solve the program at all if a second application tries to use the same database.

    If uniqueness must be enforced - let the database do it, not the application.

  • Robert Frasca (10/25/2010)


    tfriedman71 (10/25/2010)


    I agree with Quazibubbles post as well. The surrogate primary key needs to be paired with a unique key in order to maintain data integrity. This concept seems to be unknown to many people posting here.

    Why would you need to do that? The primary key IS unique. What is the value in pairing a unique value with another unique value?

    I understood him to mean unique constraint. Maybe I misunderstood. The surrogate key is simply (and nothing more than) the pointer used in the child table FK to get back to the parent. The unique constraint (in any table) guarantees that any record is unique, iow that you cannot put the same data in the table twice.

    When you use a surrogate the unique constraint is applied to the candidate key that, were you not using a surrogate, would in the absence of the surrogate become the PK.

    Both the surrogate and the unique constraint are required, and neither has anything to do with the other.

    Surrogate = pointer between tables

    Unique constraint = constraint ensuring uniqueness.

    The confusion comes when you attempt to use a natural key. Then, in some peoples mind, the function of the constraint and the function of the PK get confused.

    Even in the case of a natural key, the PK (when taken as a PK) is just a pointer between tables. The natural PK just happens to have useful data (the surrogate does not) and the natural PK also has the (or a) unique constraint applied to it. As it happens, if you create your surrogate PK using the SQL Server tools, SQL Server automatically applies a unique constraint to that field as well as it builds the PK. So in the case of using a surrogate, there should be at least two unique constraints, one for the surrogate (created usually by the system) and one for the candidate key (you must manually create this one).

    jwcolby54

  • I may have missed a reply here but it sounds like the only discussion is around the use of Surrogate Keys in transactional models, which as someone that cares about the knowledge to gain from data, I don't really focus on. In a dimensional model (where knowledge is gained from data) however, it is required.

    The main reason why surrogate keys are required in dimensional models is to support Referential Integrity. Suppose we have a dimension in our data mart for customer information. This dimension may have many sources including text files, web services, and other DBMS systems, all of which have their own 'natural' key for a customer.

    Through the ETL process all of these sources are matched together to create a complete view of the customer, supporting the 'single version of the truth' montra. So when loading this into the DW it is required that we come up with a new key value to represent the customer inside the DW with it's own key and to protect against garbage source system data.

    So as this discussion progresses, my only point would be that the use of surrogate keys needs to be discussed in both context of transactional and dimensional models.


    Cheers,

    Ben Sullins
    bensullins.com
    Beer is my primary key...

Viewing 15 posts - 31 through 45 (of 178 total)

You must be logged in to reply to this topic. Login to reply