Do You Really Need Surrogate Keys? Part 1 – Concepts and Methods

  • ebay-1087924 (10/25/2010)


    Huh? What extra int column are you talking about? We're talking about a junction table with two FKs, period. The question is whether you have a clustered index or not on that table....and whether or not you do, you still have the same column count.

    The question of whether or not you add a surrogate to that table is a different story entirely. That WILL hurt performance...and you don't need any profiling to understand that.

    Okay, I wasn't sure which point you were arguing against. On the cluster vs heap question, it really depends on the use patterns. Often the activity on a junction table isn't truly random. If you imagine a table that maps people to their addresses, most of your activity is likely to still fall into ascending ID order on both FK's. But if you assume truly random inserts, updates, and deletes (a pair of GUIDs for example), my benchmarks came up consistently in favor of the heap.

  • ebay-1087924 (10/25/2010)


    However, if you want to see the best point made, it is the one by JCamburn at 05/23/2003 : 23:19:21

    See http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=6136&whichpage=6

    He actually gets it.

    Maybe you referenced the wrong post, but that one completely misunderstands what normalization even is (he believes it applies only to single tables), then compounds his error by claiming surrogate keys "make normalization more difficult". Total nonsense.

    Uh... no, he's right. You normalize an individual table, no normal forms work on multiple tables simultaneously. And he's also right about surrogate keys - Codd's idea of the tuple didn't not allow for duplicates. Note that he never says surrogate keys are problems in themselves, instead he merely says "If this is true, then how would the introduction of a surrogate key cause any problems from a theoretical point of view?"... if anything he's saying it won't.

    What he does say is:

    "With that said, let me explain what I have found through my own experiences. The main problem with a surrogate key is that it introduces another candidate key. This can make the process of normalization more difficult for those that are inexperienced."

    I don't really see how that is an invalid observation!

    Random Technical Stuff[/url]

  • cdesmarais 49673 (10/25/2010)


    But if you assume truly random inserts, updates, and deletes (a pair of GUIDs for example), my benchmarks came up consistently in favor of the heap.

    You might want to alert MS then, because their published benchmarks show the opposite result. See below for just one example:

    http://msdn.microsoft.com/en-us/library/cc917672.aspx#_Test_Procedure

    Heaps were faster up to SQL2000. The prevalence of clustered indexes, however, has caused MS to heavily optimize for this case.

  • wagner crivelini (10/25/2010)


    Denis

    As I said on this forum, I'm also a fan of surrogate keys.

    Actually, my goal in this first article is to show the concepts and methods I'm using to compare a surrogate key against any other candidate key.

    I pick the product table (surrogate versus char natural key) just to illustrate the use of those methods.

    And, by the way, I built 4 new tables from scratch, so I would not have to consider the impact of extra indexes. Ok, this is not a good implementation on the real world.

    There are so many issues impacting performance that I tried to isolate things and make the comparison a bit easier to understand.

    Fair enough - I just thought the safest thing would be to rebuild your indexes before running your tests Ex:

    ALTER INDEX ALL ON <table> REBUILD WITH (FILLFACTOR = 90)

  • ebay-1087924 (10/25/2010)


    Consider a people list originally keyed by SSN. Replacing SSN with a surrogate key means that -- unless you add a secondary uniqueness constraint -- you can then have duplicate SSNs in the database.

    A) Does the person have an SSN? It's more common than you thing with foreign workers. Being in the US but a Brit, I've had a terrible time getting things sorted without an SSN, often having to use made up SSN's so they've been able to enter me into the system (they made it up, not me! But they had to enter SOMETHING to get the system to take it. That's broke if you have to make things up to get data in) Maybe they've just not got it at that time, or are unconscious and can't provide it (real world experience here...)

    B) Has their SSN changed (ID Fraud the main reason it'd change, something that's becoming more common).

    C) Can a dupe occur? (It shouldn't, but if the system spits out it already exists, is it the new one being entered that's the problem so you can't enter it at that time, or the old one?) You'd not be able to enter any data until resolved, or, have to put in a fake one 'just to get around it for now'.

    Just use a PK and you'll not worry about that. SSN's an attribute of the entity, not the unique value in real life alas.

    Again, read Quazibubble's posts.

  • ebay-1087924 (10/25/2010)


    cdesmarais 49673 (10/25/2010)


    But if you assume truly random inserts, updates, and deletes (a pair of GUIDs for example), my benchmarks came up consistently in favor of the heap.

    You might want to alert MS then, because their published benchmarks show the opposite result. See below for just one example:

    http://msdn.microsoft.com/en-us/library/cc917672.aspx#_Test_Procedure

    Heaps were faster up to SQL2000. The prevalence of clustered indexes, however, has caused MS to heavily optimize for this case.

    I'm familiar with that article. It's a not a junction table, and it's not based on sequential random inserts. When you are dealing with +/- a couple of percent, it matters. Also, I don't need to tell MS anything, they are quite aware of the limits of their general statement.

    "Please note that these are general recommendations that apply to the test used in this study. They may not be well suited for your particular application."

  • ta.bu.shi.da.yu (10/25/2010)


    Uh... no, he's right. You normalize an individual table, no normal forms work on multiple tables simultaneously.

    I'm sorry, but this isn't the case at all. Several normalization rules require information in one table to be decomposed into multiple tables, the 1NF repeating-group rule, the BCNF superkey rule, or the 5NF join dependency rule for instance.

    Normalization is an attribute of the database, not a table.

  • Robert Mahon-475361 (10/25/2010)


    A) Does the person have an SSN? .... Just use a PK and you'll not worry about that.

    Robert, you've misunderstood my post. I wasn't advocating an SSN as a natural key, but simply illustrating to the OP that when you replace a SSN with a surrogate, you then have to add a separate constraint to enforce uniqueness. The PK isn't doing it for you any longer.

  • Hugo Kornelis (10/25/2010)


    Robert Frasca (10/25/2010)


    I'm a little hazy on what you defined as the primary key. If it's a many-to-many relationship how can the combination of the two foreign keys represent a primary key? There is no guarantee of uniqueness in a many-to-many relationship. That's why we created the junction table after all. Perhaps I'm misunderstanding you. I like to use surrogates in this instance so I can have a legitimate primary key as well as a clustered index to avoid having the table be a heap. For me, being able to create a clustered index is where the surrogate key provides value.

    Hi Robert,

    I'll try to give an example. Extremely simplified, using two tables (Products and Suppliers), and a junction table (ProductSuppliers) for the many-to-many relationship. Just for fun (and to illustrate that I believe the surrogate or no surrogate choice should be made on a per-table bases), I'll use a surrogate in the Suppliers table, but not in the Products and SupplierProducts tables.

    CREATE TABLE Products

    (ProductCode varchar(15) NOT NULL,

    -- other columns,

    CONSTRAINT PK_Products PRIMARY KEY (ProductCode)

    );

    CREATE TABLE Suppliers

    (SupplierID int NOT NULL IDENTITY,

    SupplierName nvarchar(60) NOT NULL,

    -- other columns,

    CONSTRAINT PK_Suppliers PRIMARY KEY (ProductCode),

    CONSTRAINT UQ_Suppliers UNIQUE (SupplierName)

    );

    CREATE TABLE ProductSuppliers

    (ProductCode varchar(15) NOT NULL,

    SupplierID int NOT NULL,

    CONSTRAINT PK_ProductSuppliers PRIMARY KEY (ProductCode, SupplierID),

    CONSTRAINT FK_ProductSuppliers_Products FOREIGN KEY (ProductCode)

    REFERENCES Products(ProductCode)

    ON DELETE CASCADE ON UPDATE CASCADE,

    CONSTRAINT FK_ProductSuppliers_Suppliers FOREIGN KEY (SupplierID)

    REFERENCES Suppliers(SupplierID)

    ON DELETE CASCADE

    );

    As you see, the ProductSuppliers table has the expected two foreign keys, and a primary key over the combination of those columns. As long as there are no references to this table, there is absolutely no need for a surrogate key. If there are references to ProductSuppliers, then we'll have to weigh the factors and decide whether or not this table would benefit from a surrogate.

    You're right and, in this case, I'm not sure what value a surrogate key would provide. As I said, this is why I usually default to using surrogate keys and then evaluate on a case by case basis. In this case, it probably wouldn't make sense. Fortunately, junction tables are fairly rare, relatively speaking. Of course, I would have had surrogate keys on the Product and Supplier table. 🙂

    "Beliefs" get in the way of learning.

  • ebay-1087924 (10/25/2010)


    ta.bu.shi.da.yu (10/25/2010)


    Uh... no, he's right. You normalize an individual table, no normal forms work on multiple tables simultaneously.

    I'm sorry, but this isn't the case at all. Several normalization rules require information in one table to be decomposed into multiple tables, the 1NF repeating-group rule, the BCNF superkey rule, or the 5NF join dependency rule for instance.

    Normalization is an attribute of the database, not a table.

    1NF takes a table with repeating groups and splits them into several tables. It doesn't take repeating groups from several tables... I fail to see your point here. When I normalize data, I don't look at multiple tables, I look at the one table and normalize that. This leads to multiple tables.

    If you can show me an example of where you find repeating groups on multiple tables, please feel free to correct me!

    BCNF also relates to an individual table - it cares about dependencies within the one table.

    5NF, again, relates to join dependencies in the one table.

    Random Technical Stuff[/url]

  • cdesmarais 49673 (10/25/2010)


    I'm familiar with that article. It's a not a junction table, and it's not based on sequential random inserts.

    I'm not sure what you mean by an insert that's simultaneously sequential and random, but I'd be interested in seeing what usage pattern you tested that gave a heap preference on a junction table. I've tested this myself on real-world tables having more 1B rows, and the results always *heavily* favored clustered indexes.

  • ta.bu.shi.da.yu (10/25/2010)


    1NF takes a table with repeating groups and splits them into several tables. It doesn't take repeating groups from several tables... I fail to see your point here. When I normalize data, I don't look at multiple tables, I look at the one table and normalize that. This leads to multiple tables.

    If you can show me an example of where you find repeating groups on multiple tables, please feel free to correct me!

    BCNF also relates to an individual table - it cares about dependencies within the one table.

    5NF, again, relates to join dependencies in the one table.

    You're not properly understanding normalization. It's not about "taking one table and splitting it into multiples" (though most examples do illustrate the rules in that manner).

    Put simply, normalization is about expressing a fact without modification anomalies. If a PK uniquely identifies a person, and sex is an attribute of the person, then having that sex column in more than one table anywhere in the database is a violation of normal form. Period. Redundancy violates 2NF .. regardless of what table that redundancy is in.

    Normalization analysis operates across all tables in your database. It's not something that can be fully applied on a table-by-table.

  • ebay-1087924 (10/25/2010)


    Robert Frasca (10/25/2010)


    The whole point of using surrogate keys is to create an additional layer of abstraction

    True.

    This promotes flexibility and scalability.

    It promotes neither. A database using surrogate keys is no more flexible or scalable than one without.

    Surrogate keys promote referential stability: a different concept altogether. However, in the case where a table already has a highly stable natural key, a surrogate can actually work against stability, doing nothing but creating an additional failure point.

    Ebay,

    I cited an example earlier where the underlying natural key changes. (It happens all the time by the way.) This could be a major pain in most cases. You might have to drop and re-create a primary/clustered key as well as drop and re-create some foreign keys. You might also have to change code that queries the table using what had been a unique natural key that could now return more than one row. I'm sure, with some thought, I could come up with some other objects that might be affected.

    If you had a surrogate key you wouldn't have to do anything except possibly modify a non-clustered index. It would be completely transparent to any apps and the rest of the data model. It doesn't get any more flexible or scalable.

    Maybe I should write an article...:-D

    "Beliefs" get in the way of learning.

  • Robert Frasca (10/25/2010)


    I cited an example earlier where the underlying natural key changes..If you had a surrogate key you wouldn't have to do anything except possibly modify a non-clustered index. It would be completely transparent to any apps and the rest of the data model. It doesn't get any more flexible or scalable.

    I understand the perils of a non-stable PK. (I've mentioned them myself in my prior postings). You're simply misusing the buzzword "scalable". A surrogate key makes (in most, not all cases) your design less brittle, and more maintainable. "Scalable" indicates your design more gracefully handles larger workloads. That has nothing to do with surrogate keys, which can increase your design's ability to adapt to changes in the workload requirements, but do not increase its ability to handle more work in general.

  • ebay-1087924 (10/25/2010)


    ...but do not increase its ability to handle more work in general.

    What does that even mean?!?

    Random Technical Stuff[/url]

Viewing 15 posts - 76 through 90 (of 178 total)

You must be logged in to reply to this topic. Login to reply