Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12345»»»

Do You Really Need Surrogate Keys? Part 1 – Concepts and Methods Expand / Collapse
Author
Message
Posted Monday, October 25, 2010 4:26 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Wednesday, November 19, 2014 3:56 PM
Points: 6,086, Visits: 8,354
When I saw the subject of today's featured article in the daily newsletter, I was enthusiastic. But after reading the article, I am quite disappointed.

I was enthusiastic because there are many misconceptions and bad practices around surrogate keys, and I hoped this article would help eradicate them. But instead, the article itself falls victim to one of the most common misconceptions.

A surrogate key should never replace the "business" key. It should be provided as an additional key, side by side with the original one. If ProductNumber identifies a product in the first table (without the surrogate key), it is also a candidate key in the second table (with surrogate). To prevent duplicate data entry, it should be declared as such, using a UNIQUE constraint. The table design screenprints posted don't carry this information, but the posted details from the execution plan of the UPDATE statement clearly show that such a constraint is not in place - otherwise, the execution plan would have been much more efficient.

This renders all the comparisons in the article completely useless. A properly implemented surrogate key always goes hand in hand with a UNIQUE constraint on the "business" key, so such an implementation should be the basis of all comparisons.
(And that being said, I also question the usefullness of a comparison that is based on one single query and one single update. Plus, the use of the estimated plan instead of the actual execution stats also make the exercise less meaningful, as the estimated plan can sometimes be completely off).

But I'll remain positive, and hope that the next part(s) of this article will be better!



Hugo Kornelis, SQL Server MVP
Visit my SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
Post #1009868
Posted Monday, October 25, 2010 4:31 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, October 22, 2013 6:36 AM
Points: 7, Visits: 26
An interesting start, but the discussion doesn't go far enough. There needs to be a distinction: what is the target? Is the target OLTP systems? or is the target a data warehouse or an OLAP cube? The function will determine part of (not all of) the need

There is not enough data here to make formal statements. The optimizer will change it's path (as will the results change) for the following points:

The optimizer will change...
* at 500,000 rows, 1M, 5M, 10M, 100M rows.
* what it's doing when joins are introduced across two tables each at 100M rows.
* based on the type of indexes or compound indexes used on the table.
* based on statistics being up-to-date (also the type and depth of statistics run).
* based on the DOP (degree of parallelism)
* based on the amount of RAM
* based on 64 bit or 32 bit core
* based on the version of SQLServer
* based on the size of the Database Block
* Size of the surrogate (not all surrogates can be integer driven)
* Width of the row in the table
* based on the logging
* based on the settings of the database
* based on the partitioning across I/O channels

I have seen too many tests run at "low volume" (76k rows, or 245 rows) and stand up and say: Eureka! I have performance.
Then when the queries make it to production (even a year later), they die a miserable death because of the lack of foresight to test at high volume levels.

I think a test like this should be done, but it should be done to scale (80M rows in 1 table, then 2 tables joined, then 4 tables, then 8 tables joined).
Then it should be done again with different settings (as indicated above).

Great start, keep up the good work, just carry it further.

Cheers,
Dan Linstedt
http://danLinstedt.com
Post #1009871
Posted Monday, October 25, 2010 6:02 AM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Tuesday, November 11, 2014 2:51 AM
Points: 1,070, Visits: 912
Robert Mahon-475361 (10/25/2010)
kasisriharsharao (10/25/2010)
Thanks for your post... If the Table has already has INTEGER type Natural Key, would you still recommend the Surrogate Key ...


Read;
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=6136&whichpage=2

All of quazibubble's posts, including his later posts from another account when the first one gets banned and you'll have your answer.

Out of curiosity though, what integer natural key are you thinking?


You seriously embrace his inane ramblings? Good luck with that!



Post #1009924
Posted Monday, October 25, 2010 6:12 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, January 31, 2013 1:23 PM
Points: 15, Visits: 112
michael vessey (10/25/2010)
An interesting article, but there seems to be one aspect that has not been covered.

The article suggests that there is a slight overhead using a surrogate key (as an extra column is created ), however actually this can lead to a massive saving where foreign keys are concerned.



Precisely. Until you actually use the PK as an FK 4 levels down you cannot begin to appreciate how much crap is carved out of the great grandchild table by the use of a surrogate key. All of that crap has to be used in joins if you use a natural key. So with a natural key there is a storage overhead as well as performance overhead. And the overhead of the indexes (storage and maintenance) required on all that crap in each succeeding generation.

The PK is (or is intended to be) nothing more or less than a pointer from parent to child. It is not what defines uniqueness. The column or columns which define uniqueness can change on a moment's notice. The PK/FK should never change. With a surrogate that is in fact the case, it will never change. With a natural key Murphy says it absolutely will change, often and at the worst possible time.

Surrogate keys were introduced long ago for a very good reason. They allowed us as designers to have a tool which the DB engine uses to maintain referential integrity in the most efficient manner possible, while simultaneously removing the herculean (and impossible) task of selecting a candidate key that would never change. We get to let the db engine do its job while we do ours, defining correctly normalized entities.

There is nothing about using a surrogate key that removes the responsibility of proper normalization, or defining candidate keys, or of creating the constraints used to enforce uniqueness. That is absolutely part of our job, and if we don't do all of that stuff every single time we are not doing our job. Telling the db engine how to do its job is not my job however.

Surrogate PK / FK. Set it and forget it.
Post #1009930
Posted Monday, October 25, 2010 6:42 AM
Mr or Mrs. 500

Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500

Group: General Forum Members
Last Login: Wednesday, November 19, 2014 12:30 PM
Points: 554, Visits: 672
kasisriharsharao (10/25/2010)
If the Table has already has INTEGER type Natural Key, would you still recommend the Surrogate Key ...


What if the natural key changes? (It can happen and it DOES happen.) This is the primary reason why I always use surrogate keys. Performance is a nice bonus but I'd be willing to take a small performance hit for consistency and concurrency. Why is it that so many people think that a key needs to add value to the data? Why can't its value just be that it's a key?


"Beliefs" get in the way of learning.
Post #1009948
Posted Monday, October 25, 2010 6:55 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Saturday, November 27, 2010 10:41 PM
Points: 58, Visits: 113
However, if you want to see the best point made, it is the one by JCamburn at 05/23/2003 : 23:19:21

See http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=6136&whichpage=6

He actually gets it.
Maybe you referenced the wrong post, but that one completely misunderstands what normalization even is (he believes it applies only to single tables), then compounds his error by claiming surrogate keys "make normalization more difficult". Total nonsense.

The reason the "surrogate vs. natural" debate even exists is because there is no one right answer in all situations. It depends not only on the data being modeled, but a whole host of other factors: whether you're OLAP vs. OLTP, query vs. update ratio, and even things like code maintenance costs.

Surrogate keys often require additional indexes. And, because you're eliminating natural columns in child tables as join keys, additional joins as well. You may need to define additional unique constraints because your key no longer enforces business uniqueness. Finally, they make your tables slightly larger, which for very narrow tables, may have performance and space implications. They used to be prone to index contention issues in some RDBMs (though modern products have pretty much entirely corrected this).

That said, in **practice** these caveats are usually (but not always) minor, often to the point of insignificance. Surrogate keys reduce cascading updates and other annoying problems. They can save space when a natural key is very wide. But most importantly, they tend to make a system more resilient to changes in business logic. In nearly all firms, programmer time is a resource far more valuable than cpu cycles or disk i/o's.

The rule of thumb I always tell junior developers is, if you can clearly and concisely state why a natural key should be preferred in a specific instance -- use it. If not, then a surrogate key should be your default.
Post #1009959
Posted Monday, October 25, 2010 7:10 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, July 10, 2014 9:44 PM
Points: 12, Visits: 40
RichB (10/25/2010)

You seriously embrace his inane ramblings? Good luck with that!


Totally! Sure, it takes a bit of time to get the good stuff out, but everything said is solid and shows that it's written from experience. Harsh experience where someone's followed the info from the original article and created a nightmare for this guy to pick apart and repair. That'd tend to make me slightly bitter too.
But... Is there really much to disagree with? Not his writing style, but the actual information provided?
Post #1009966
Posted Monday, October 25, 2010 7:14 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Saturday, November 27, 2010 10:41 PM
Points: 58, Visits: 113
What if the natural key changes? (It can happen and it DOES happen.) This is the primary reason why I always use surrogate keys.
Any rule that has the word "always" in it is usually a bad rule.

Surrogate keys are generally preferable, but there are countless cases where they are not. Consider an audit table that contains only a single column: the timestamp an event occurred. That value will *never* change. Period. Adding a surrogate will vastly bloat the table (by 100% if you're using a smalldatetime), double the number of required indexes, force page lookups whenever you need to perform a join (since your PK is no longer a covering index). And in return, gives you -- nothing whatsoever.

The only reason to ever surrogate-key a table like that is if multiple rows might possibly have the same timestamp. But in cases where business logic forbids this, there is no reason to do so.

Now you may rightfully ask "what if your business logic changes in the future". If there's even a remote possibility of that, then you likely want to bite on a surrogate key anyway. But in cases where that just isn't possible because of the data being modelled, then a "always use surrogate rule" is pointless and damaging.
Post #1009967
Posted Monday, October 25, 2010 7:29 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Sunday, April 10, 2011 11:42 AM
Points: 25, Visits: 37
this article serves a perfect example of data bases designed by those DBAs who understand little of the real business worlds and tend to take a static view of the data at hand.
If the world could stand still, the DBA can do a superb job of optimizing a database. Unfortunately, that fantasy only exists in toy databases such as "AdventureWorks".
Never use a candidate key that has any business meaning in it as your PK, period!



Post #1009978
Posted Monday, October 25, 2010 7:38 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Yesterday @ 11:36 AM
Points: 985, Visits: 1,838
Another issue that is milder but chafes me is the concept that the PK field is usually the clustered index. Certainly this is default, and certainly this sometimes makes sense, but in large tables it may well be that other fields make more sense. If one table joins to another table frequently and that join returns multiple rows, like a claim table joined to a notes table on claim ID, it may make sense to cluster the notes table on claim ID since that will allow the query that paints the notes screen on the claim system to work far faster (and since note ID is almost never used for any queries). Again, one needs to think about what makes sense performance-wise and not just blindly make all primary keys clustered.
Post #1009983
« Prev Topic | Next Topic »

Add to briefcase ««12345»»»

Permissions Expand / Collapse