Do You Really Need Surrogate Keys? Part 1 – Concepts and Methods

  • Michael

    I totally agree with your comments on the FKs and the need to add these remarks on the conclusion of this article.

    I'll make sure to comment about FKs when I write the next articles for this series.

    Thx for your comments

  • yeah, I should have said unique constraint.

  • Ben Sullins-437405 (10/25/2010)


    The main reason why surrogate keys are required in dimensional models is to support Referential Integrity. Suppose we have a dimension in our data mart for customer information. This dimension may have many sources including text files, web services, and other DBMS systems, all of which have their own 'natural' key for a customer...When loading this into the DW it is required that we come up with a new key value to represent the customer

    You're not quite correct. If the "natural key" isn't unique, then its not a natural key, period. In a DW situation, it might have been a natural key in its original setting. But in the DW, if it no longer uniquely identifies the row, its not a key, natural or otherwise.

    What you have stated is a specific case of one of the primary cases for using a surrogate -- when no natural key otherwise exists. This happens rather often in DW situations, but also happens outside it as well.

  • tom,

    I agree.... I heard many times that the best advantage of surrogate keys is the performance improvement and this is not always true, as I intended to show in the article.

    The real benefit is to have a stable PK!

  • Hugo

    I respectfully disagree.

    Yes, no question all alternate keys should be implemented as unique indexes.

    But once no unique key is implemented at all, (but only the primary key in each table), then comparisons will show performance based on these different PKs.

    From a db modeling standpoint, this implementation is wrong, I agree.

    But from an analysis point of view, this approach allows me to see how good the PK is by itself.

    Notice that the examples I show are not based on either PK.

    Therefore, I would not say the results are meaningless.

  • Dan

    I got the point. This first article shows a simple example to illustrate the use of the approach I describe, but my goal is to analyze the use of a surrogate key on fact tables.

    On the second article of the series, I'll show the study of surrogate versus compound keys on a table with 100 million rows.

  • I understand this is article deals with an important subject on database design and it is natural to see some strong positions and different ideas.

    I'd like to thank everyone who took the time to read this article and express your comments, either compliments or criticism. They are all welcome and I'll do my best to consider them all when writing the other articles about this subject.

    I also want to apologize I wasn't able to reply all posts, since I'm having a busy day down here at the company 🙂 I'll try to reply more posts later tonight

    Thx

  • wagner crivelini (10/25/2010)


    tom,

    I agree.... I heard many times that the best advantage of surrogate keys is the performance improvement and this is not always true, as I intended to show in the article.

    The real benefit is to have a stable PK!

    With all due respect, surrogate keys have very little to do with performance although that is frequently a byproduct.

    The whole point of using surrogate keys is to create an additional layer of abstraction, an outer framework, as it were, that is independent of the business data. This becomes the framework for establishing relationships between entities that is not dependent on the natural relationships, i.e. the natural keys. This way, subtle changes to the natural relationships of the entities do not affect the relational integrity of the overall data structure. This promotes flexibility and scalability. It doesn't matter whether it's an OLTP or a dimensional data model, either way, the relational integrity is independent of the business problem.

    "Beliefs" get in the way of learning.

  • Robert Frasca (10/25/2010)


    The whole point of using surrogate keys is to create an additional layer of abstraction

    True.

    This promotes flexibility and scalability.

    It promotes neither. A database using surrogate keys is no more flexible or scalable than one without.

    Surrogate keys promote referential stability: a different concept altogether. However, in the case where a table already has a highly stable natural key, a surrogate can actually work against stability, doing nothing but creating an additional failure point.

  • i use surrogate keys all the time. over the last 2 years i've been building a reporting system to track event log data. the Windows event logs have a record number but i don't know if it ever gets reused, so i just created a column called pk_id with a bigint data type and set it up as identity (1,1) and as a non-clustered PK. i could create a PK on a combination of columns but i don't want to risk a duplicate PK entry and possibly losing a day's worth of log data being captured.

  • Robert Frasca (10/25/2010)


    I'm a little hazy on what you defined as the primary key. If it's a many-to-many relationship how can the combination of the two foreign keys represent a primary key? There is no guarantee of uniqueness in a many-to-many relationship. That's why we created the junction table after all. Perhaps I'm misunderstanding you. I like to use surrogates in this instance so I can have a legitimate primary key as well as a clustered index to avoid having the table be a heap. For me, being able to create a clustered index is where the surrogate key provides value.

    Hi Robert,

    I'll try to give an example. Extremely simplified, using two tables (Products and Suppliers), and a junction table (ProductSuppliers) for the many-to-many relationship. Just for fun (and to illustrate that I believe the surrogate or no surrogate choice should be made on a per-table bases), I'll use a surrogate in the Suppliers table, but not in the Products and SupplierProducts tables.

    CREATE TABLE Products

    (ProductCode varchar(15) NOT NULL,

    -- other columns,

    CONSTRAINT PK_Products PRIMARY KEY (ProductCode)

    );

    CREATE TABLE Suppliers

    (SupplierID int NOT NULL IDENTITY,

    SupplierName nvarchar(60) NOT NULL,

    -- other columns,

    CONSTRAINT PK_Suppliers PRIMARY KEY (ProductCode),

    CONSTRAINT UQ_Suppliers UNIQUE (SupplierName)

    );

    CREATE TABLE ProductSuppliers

    (ProductCode varchar(15) NOT NULL,

    SupplierID int NOT NULL,

    CONSTRAINT PK_ProductSuppliers PRIMARY KEY (ProductCode, SupplierID),

    CONSTRAINT FK_ProductSuppliers_Products FOREIGN KEY (ProductCode)

    REFERENCES Products(ProductCode)

    ON DELETE CASCADE ON UPDATE CASCADE,

    CONSTRAINT FK_ProductSuppliers_Suppliers FOREIGN KEY (SupplierID)

    REFERENCES Suppliers(SupplierID)

    ON DELETE CASCADE

    );

    As you see, the ProductSuppliers table has the expected two foreign keys, and a primary key over the combination of those columns. As long as there are no references to this table, there is absolutely no need for a surrogate key. If there are references to ProductSuppliers, then we'll have to weigh the factors and decide whether or not this table would benefit from a surrogate.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • I'm interested about your article and your aproach however in my experience if you want to be able to clearly identify benefit in terms of performance, you need huge table IE: million or hundred million records. For tables with 10000 records, the benefit might not be clearly visible...

  • This article misses the fact that the tables will most likely be referenced by other tables. For instance, I have started working for a company where they didn't do surrogate keys. We have 4 part candidate keys and to do joins we put them in all the tables. So for a header you might have Company,Date,Job say. Then for the detail you have Company,Date,Job,LineNum. So, the article leaves out the JOIN piece where now the Detail table has 3 extra columns making it bigger and your code has extra complexity where each JOIN is 3 or 4 lines, and when you JOIN 6 or 7 tables, what a mess, my query fills a full screen. Also, if you need to modify those fields say Job becomes a bigint instead of an int, now you are modifying them in 40 to 60 tables (something that came up here, I cried a little inside).

  • JJ B (10/25/2010)


    Hugo wrote:

    And consider what so many people call a junction table - a table that represents the many-to-many relationship between two tables. I will usually model these as just the two foreign keys to the two tables in the relationship, and a primary key over the combination of those columns. Again, a PK with business meaning. And again, I see no advantages to be gained by adding a surrogate, only a few disadvantages.

    I can think of a giant advantage--at least to my programs.

    Suppose I have a such a junction table (I call it a link table) with a primary key on the two FKs as you describe. Now suppose the user in the front-end tries to change one of the values in an existing row and unwittingly tries to change one of the FK values to a value that would result in a duplicate row.

    The database will not allow this of course. Here's the problem: If my front-end program does nothing but submit the user's choice to the database, then the user will get a nasty database error that she will not understand.

    On the other hand, suppose I am a good developer and check her data entries before the row is allowed to go to the database. This way, I can see if an error would occur with her choices and give her a nice, understandable error message.

    How do I write such a query without a surrogate key? I have to ask the database: "Look for a row in the database with FK values 'A' and 'B', but don't look in this row." I need to exclude "this" row, because I'm looking for a duplicate and I do not know exactly what changes the user has made to the row. I want to look at other rows to see if the user's choices for this row would create a duplicate. How would I define "this" row without a surrogate key?

    You might suggest that I simply count rows that have FK values A and B. If the query returns a count of 1, then I know that another row out there already has the FK values of interest and the user's choice would create a duplicate. However, what if the count of 1 is for the current row? Unless I have the hassle of writing screen-specific extra code (instead of the more generic code I use now for these situations), I don't know which value(s) in the row the user changed, especially when there are one other fields in the link table (which I know was not exactly your example). All my code knows is that something about the record changed, but not what specific column - or even if the user changed a FK value to something else and then changed it back to the original value before submitting the row to the database. (It happens. Don't ask me to explain users.) In such a case, a simple counting query that does not exclude the current row could simply return the current row. The code would think it is a duplicate when in fact it is not.

    In the end, the surrogate key creates an easy, fool-proof system to be able to know exactly which row the user is modifying. So, then I have an easy way of making sure the user is given proper error messages and only when needed. With that surrogate key in place, I also get the added benefit of having a single column key should the link table ever be a parent in a join relationship in the future. (It happened at least once.)

    I used to listen to the DBAs who advocated against surrogate keys for "link" tables. In practice, it created a problem. I had to go back and add surrogate keys to several tables in order to create proper user-friendly applications in reasonable time frames. Now, I just add those surrogate primary keys up-front and avoid the headaches down the line.

    I really fail to see the point. Sure, a good front-end developer will not throw a database error in the users face. But who said that a good developer should never use TRY CATCH in either the T-SQL or the front-end code? It is so easy to catch errors and return more user-friendly messages to the user that this argument makes no sense at all.

    The whole story about finding duplicates makes no sense to me. The client should never execute needless code. You only submit an UPDATE query when the values have changed. If there are key values among the changed values, and you don't want to use (for whatever reason) TRY CATCH error handling, you can simply check for existence of the new key value with EXISTS. If a row with the new key values exists, the change is invalid, for it would create a duplicate. If it does not exist, the change is permitted. (This does introduce some potential concurrency issues, so I'd probably prefer the TRY CATCH approach).

    Your argument that the row to be modified would mess up the test makes no sense; you check for the NEW key values, and the row to be changed at that point still has the OLD key values. If the new and old key values are the same, there is no need for an update, hence no need for a test either.

    Your final point about the extra hassle this would require seems odd in a message that starts off about being a good developer. You can't be a good developer and a fast developer at the same time. Either you take the time to do it good, or you use a standardised ultra-fast approach and accept that the result is not perfect. (And even in a standardised ultra-fast approach, the front end code should never just throw a query at the database show the error (if any) to the user. There simply is no excuse for that.)


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • gryphonsclaw (10/25/2010)


    The article leaves out the JOIN piece where now the Detail table has 3 extra columns making it bigger and your code has extra complexity where each JOIN is 3 or 4 lines, and when you JOIN 6 or 7 tables, what a mess, my query fills a full screen.

    There is a flip side to that argument. When you extract out the natural columns from child tables, it can force you to perform a join where none had to be done before, which is not only extra code, but a large performance hit also.

    For instance, in the example you gave, if "company, invoice, linenum" is the natural key, then extracting company and invoice out, and replacing with a surrogate key means you replace the code:

    SELECT Company, Invoice, Linenum, LineTotal FROM ChildTable[/i]

    with:

    SELECT p.Company, p.Invoice, c.Linenum, c.LineTotal

    FROM ChildTable c JOIN Parent p ON c.ParentKey = p.ParentKey[/i]

    This just isn't a "one size fits all" situation. Surrogate keys have their place. They're not a panacea, nor an ultimate requirement.

Viewing 15 posts - 46 through 60 (of 178 total)

You must be logged in to reply to this topic. Login to reply