﻿<?xml version='1.0' encoding='UTF-8'?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>SQLServerCentral / Article Discussions / Article Discussions by Author / Discuss content posted by Marvin Elder  / Enforcing Data Quality while using Surrogate Keys / Latest Posts</title><generator>InstantForum.NET v2.9.0</generator><description>SQLServerCentral</description><link>http://www.sqlservercentral.com/Forums/</link><webMaster>notifications@sqlservercentral.com</webMaster><lastBuildDate>Fri, 24 May 2013 10:32:46 GMT</lastBuildDate><ttl>20</ttl><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>@baconmIf I understand Brian's post he's not entirely out in left field, although I think running with *no* data integrity is a huge mistake.I think what he's saying is designing an application so that it's impossible for bad data to get to the SQL engine is the best solution.And I half-way agree with him--but I'm a belt, suspenders and jumpsuit kind of guy myself... :)I run a one man shop and it's very nice to be developer, DBA, and systems analyst all rolled into one. Of course most folks don't understand the impact of that "tiny" new feature, but hey, that's life...At any rate I think letting the SQL engine handle basic integrity tasks is a much better solution. Things like referential integrity, bounds checking, unique constraints are no brainers for the DB, no question. Auditing's another DB level task that as a developer I'm *happy* to let the DB handle. The coding overhead is minimal and the results damn near foolproof. Howsomever:Some things are better handled by the application. There's no need to bother the DB with things like checking to see if input to a numeric field is actually numeric! Using combo boxes to make it impossible to choose an impossible field value is another way to leverage local processing power.The database is busy enough, no need to make it work harder than it must. Not to mention eliminating excessive talk on the wire, which is a good thing too.If you're a DBA you tend to want *everything* in the back end. Not a good idea. Your server is a limited resource being devoured by a swarm of users. Those users have abundent local resources. The best designs take advantage of those local resources.It's all about balance. Let the DB do what it's good at and the application do what it's good at.</description><pubDate>Fri, 11 Sep 2009 09:06:52 GMT</pubDate><dc:creator>roger.plowman</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>Wow, never had a problem!  Are you a one man shop?  How big is you largest database?  How much data?  What tool do you use to develop applications?  How many applications and what is the biggest?  Why do you use access instead of just storing the data in a OS file with your own structure?Today's RDBMS's have a lot of capabilities that you do not have to use but I prefer to have a safety net.  I like to have constraint enforcement in the application to give the user prompt feedback when they enter something incorrectly and in the db so that when a developer forgets to enforce a data rule the mistake is caught before bad data is stored in the database.  Even if we had perfect programmers, our users are very creative and they can find navigation paths that get around application constraints but they have never been able to bypass database constraints.  As a matter of fact we have some data errors that we cannot reproduce and we had to add a unique constraint to prevent the problem.  Also, the data rules are easy to find if they are located in one place, the database, and application developers can reference this information to make sure they are implementing all the data rules.I have never had a severe wreck but I don't intend to disable the airbags in my car since there are idiots out there that I have no control over.</description><pubDate>Fri, 11 Sep 2009 07:39:15 GMT</pubDate><dc:creator>baconm-1145631</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>[quote][b]baconm (9/8/2009)[/b][hr]This article makes it sound like you have to choose between a natural key with a unique index or a surrogate key with a unique index.  Every time I use a surrogate primary key I create a unique key on the columns that comprise the natural key.  Surrogate and natural keys each have their role and work best when used for the purpose they were intended.  Natural keys are the basis of relational database theory and if you do not have them I do not Know why you would use a relational database.  Surrogate keys can be used to simplify joins and allow the natural key to change without having to cascade all the dependent foreign keys. I have to agree that surrogate keys have no role during logical modeling.  All entities should have a unique identifier and at the physical level you can add surrogate keys for all tables that have children and multiple column unique identifiers.  I have seen "designers" that immediately switch to surrogate keys in the logical modal as soon as the unique identifier goes beyond 2 or 3 attributes without ever identifying the full natural key.  It always leads to duplicates and application failure when more than one row is returned.  Yes, I agree, your application is perfect and doesn't allow duplicates but users are devious and will always find that odd navigation path that will allow them to insert duplicates!  It is also difficult to find the surrogate key for a given row if you do not know and are not enforcing uniqueness on the natural key. How often to you SELECT * FROM blahblah WHERE surrogate_key_id = 2468?  Usually your first query is SELECT surrogate_key_id FROM blahblah WHERE COMPANY_NAME = "ACME" AND LOCATION = 'CINCINATTI' to get the surrogate key to use in subsequent queries.   Company name and location are the natural key in BLAHBLAH and should be indexed.  Using my method they are indexed as a result of being in a unique key and the surrogate key would also be indexed as the primary key.As a practical matter, carrying all the unique identifier columns down through all generations can lead to a natural key of 10's of columns.  Here is an extreme example using what I believe are natural keys for a person and a department.  I propose the natural key of a person is first, middle and last name, suffix, date and time of birth, location of birth, father and mother (who also have the same natural identifier) of the person.  In this case the father and mother natural keys are comprised of 8 columns which is 16 columns for two parents plus the other 6 columns for a person makes 22 columns to identify one employee.  I propose that the natural key for a department is company name, company location, date incorporated, incorporation location, (natural key for company) department name, department location and date department created.  That is 6 columns.  Now associate an employee with a department and you have 28 columns in the natural key for the employee department intersection table.  You are looking at 34 predicates in your where clause to join these 4 tables.On the other hand, if you add a surrogate key to the person table (employee and parents) you drop the number of columns in the employee table to 2 surrogate columns for the parents and the 6 others for a total of 8.  Now if the company table has a surrogate key the department table drops to 4 columns and if department has a surrogate key the intersaction table now has 2 columns.  To join the same 4 tables there are only 3 predicates in the join clause.I have a question because I am not a Business Rules person.  What is this Unification Business Rule?  I do not understand the purpose.  I searched the web and didn't get any hits in the first few pages.  Can anyone give me some references so I can understand?[/quote]I think exactly the same.</description><pubDate>Thu, 10 Sep 2009 10:35:38 GMT</pubDate><dc:creator>herreragab</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>I can't say I spend any time on enforcing referential data integrity since I moved from Access to SQL years ago.  I've never had a problem in 10 years.  As I see it, enforcing data integrity is to compensate for something that should never happen in the first place.  It would be like adding if/then/else statements in .NET that would never be executed unless somewhere else in your program you did something incorrect.  What it comes done to, is that it is hard to justfy spending time on it on a per application basis.</description><pubDate>Thu, 10 Sep 2009 08:52:14 GMT</pubDate><dc:creator>Question Guy</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>If you are doing data warehousing then I think surrogate keys are a must for two reasons1. Storage - particularly in large fact tables, a 4 byte int is going to be smaller and easyier to work with and so will the corresponding index you'll probably put on it.2. Unknown Records - You have a nice easy way to reference unknown dimension valuesI don't see any problem in a data warehouse of having some additional indexes to enforce integrity rules.OLTP maybe a different story but thats not my area of expertise. :-)</description><pubDate>Tue, 08 Sep 2009 15:41:02 GMT</pubDate><dc:creator>Martin N</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>[quote][b]G² (9/8/2009)[/b][hr]I'm a big fan of using surrogate keys when designing databases, but they are not intended to enforce data integrity.  They should be used to simplify joins and used solely to establish relationships to other entities in the database.  Example 1 should never happen because any table with a surrogate key should also have a unique constraint placed on the natural key of the table to enforce data integrity.Example 2 should also never happen because it's not a good practice to use a surrogate key on a linking/intersection table.  Those tables should always consist of the combination of the two surrogate keys from the two tables that you are establishing the relationship between.  Thank you for the article.  It was definitely a interesting read.  I've just never encountered either of the issues that you describe because of the practices that I've listed above.Greg[/quote]+1</description><pubDate>Tue, 08 Sep 2009 15:29:49 GMT</pubDate><dc:creator>Paul White</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>I agree with you that you should not use a count of the physical insertion attempts in a schema.  At best it is metadata and usually it is a "pseudo-pointer chain" that you maintain by hand. At least back in the days of TOTAL, IDMS, IMS, etc. the DB handled the pointer chains for us.  I first thought was something like this skeleton:CREATE TABLE Organizations (org_id INTEGER NOT NULL PRIMARY KEY, &lt;&lt; organizational data &gt;&gt;);CREATE TABLE OrgDatabases(org_id INTEGER NOT NULL    REFERENCES Organizations (org_id)    ON DELETE CASCADE    ON UPDATE CASCADE, db_id INTEGER NOT NULL, &lt;&lt; database data &gt;&gt; PRIMARY KEY (org_id, db_id));CREATE TABLE OrgDatabaseGroups(org_id INTEGER NOT NULL    REFERENCES Organizations (org_id)    ON DELETE CASCADE    ON UPDATE CASCADE, db_id INTEGER NOT NULL,   FOREIGN KEY (org_id, db_id)    REFERENCES Organizations (org_id, db_id)    ON DELETE CASCADE    ON UPDATE CASCADE, grp_id INTEGER DEFAULT '{unassigned}' NOT NULL, &lt;&lt; user groups data &gt;&gt; PRIMARY KEY (org_id, db_id, grp_id));CREATE TABLE DatabaseUsers(org_id INTEGER NOT NULL     REFERENCES Organizations (org_id)    ON DELETE CASCADE    ON UPDATE CASCADE, db_id INTEGER NOT NULL, FOREIGN KEY (org_id, db_id)     REFERENCES OrgDatabase (org_id, db_id)    ON DELETE CASCADE    ON UPDATE CASCADE, grp_id INTEGER DEFAULT '{unassigned}' NOT NULL, FOREIGN KEY (org_id, db_id, grp_id)     REFERENCES OrgDatabaseGroups(org_id, db_id, grp_id)    ON DELETE CASCADE    ON UPDATE CASCADE, user_id INTEGER NOT NULL, &lt;&lt; user data &gt;&gt; PRIMARY KEY (org_id, db_id, grp_id, user_id));This is a sequence of nested compound keys.  This should avoid what Tom Johnston called Non-Normal Form redundancies.  It is also fast in products like Sybase where a REFERENCE is implemented as a pointer chain back to the unique row in the referenced table; it is more expensive in MS SQL Server thanks to the redundant copies of the FK values.  People are afraid of overlapping UNIQUE constraints, but they can be very useful.</description><pubDate>Tue, 08 Sep 2009 13:47:56 GMT</pubDate><dc:creator>CELKO</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>I would have to agree with most of the commenters on this post.  I am currently using SQL server 2005 for most of my development, and yes I do create logical data models and identify what the unique entity keys should be, and yes I use surrogate keys as my primary keys in all my tables and add a unique constraint on the entity key.  And until someone shows me a SQL Server 2005 benchmark test that proves that joining two tables using three varchar(10) fields is faster, I'll most likely continue this practice.  For a significant number of database records in one or more tables being joined together, surrogate keys are vital for performance.  The same applies, I'd argue, with a standard SQL Server Analysis Services data source.</description><pubDate>Tue, 08 Sep 2009 13:13:41 GMT</pubDate><dc:creator>kpatrick</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>Thanks for the reference.  Now I understand but I have rarely seen this pattern.  As a DBA with 20 years experience I want as much data integrity in the database as possible but I think some business rules are more understandable in application code than in tables and foreign keys.  I apologize that I do not have more time to study your model in detail but the business rule that only people in a department may access a database owned by the department does not require data to be persisted.   I think it is easier in the application get the department of the user logged into the system and then allow access to database owned by the department.  The information in the lower tables in your diagram have data that does not need to be persisted since it is easily inferred from data in the top tables.</description><pubDate>Tue, 08 Sep 2009 13:13:07 GMT</pubDate><dc:creator>baconm-1145631</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>You point out, with painful illustration, the "achilles heel" of using migrated primary keys in database design:  the keys can multiply rapidly, making it very tiresome to construct "join clauses".But as my article points out, you lose the relational database's capability of enforcing powerful "business rules" such as the Unification Rule, as I illustrate in my article.Here is an article that explains and illustrates the Unification Rule.  The article was written by a data modeling guru, Bert Scalzo (chief architect of the popular TOAD data modeling product):http://www.toadworld.com/LinkClick.aspx?fileticket=hZvoqN6j0js=&amp;tabid=321Marvin Elder</description><pubDate>Tue, 08 Sep 2009 12:21:53 GMT</pubDate><dc:creator>marvin.elder</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>I'm in agreement with G[sup]2[/sup]. Surrogate keys (which my company calls RIDs) are the only way to go for primary keys in entity tables--tables which contain entities. Aside from eliminating the need for cascade updates on primary keys (shudder) RIDs give far higher performance in joins and even RBAR situations. Our RIDs are 4 byte integers but even 8 byte integers would be faster links than long varchar keys.Tables can also have natural keys--which I've always heard referred to as Alternate Keys. AKs are usually a name of some sort. AKs for us are almost always to enforce a unique constraint, sorting, and the like. Nearly every entity will have both a PK and an AK.As for intersection tables the only reason I could see to give an intersection table its own surrogate key would be for auditing purposes in a combined audit table. If you don't care about auditing the intersection then the intersection table can (and should) consist soley of two surrogate keys.I'm somewhat confused by the author's thinking that AKs wouldn't be used as a matter of course in conjunction with surrogate keys. What am I missing?</description><pubDate>Tue, 08 Sep 2009 11:42:36 GMT</pubDate><dc:creator>roger.plowman</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>This article makes it sound like you have to choose between a natural key with a unique index or a surrogate key with a unique index.  Every time I use a surrogate primary key I create a unique key on the columns that comprise the natural key.  Surrogate and natural keys each have their role and work best when used for the purpose they were intended.  Natural keys are the basis of relational database theory and if you do not have them I do not Know why you would use a relational database.  Surrogate keys can be used to simplify joins and allow the natural key to change without having to cascade all the dependent foreign keys. I have to agree that surrogate keys have no role during logical modeling.  All entities should have a unique identifier and at the physical level you can add surrogate keys for all tables that have children and multiple column unique identifiers.  I have seen "designers" that immediately switch to surrogate keys in the logical modal as soon as the unique identifier goes beyond 2 or 3 attributes without ever identifying the full natural key.  It always leads to duplicates and application failure when more than one row is returned.  Yes, I agree, your application is perfect and doesn't allow duplicates but users are devious and will always find that odd navigation path that will allow them to insert duplicates!  It is also difficult to find the surrogate key for a given row if you do not know and are not enforcing uniqueness on the natural key. How often to you SELECT * FROM blahblah WHERE surrogate_key_id = 2468?  Usually your first query is SELECT surrogate_key_id FROM blahblah WHERE COMPANY_NAME = "ACME" AND LOCATION = 'CINCINATTI' to get the surrogate key to use in subsequent queries.   Company name and location are the natural key in BLAHBLAH and should be indexed.  Using my method they are indexed as a result of being in a unique key and the surrogate key would also be indexed as the primary key.As a practical matter, carrying all the unique identifier columns down through all generations can lead to a natural key of 10's of columns.  Here is an extreme example using what I believe are natural keys for a person and a department.  I propose the natural key of a person is first, middle and last name, suffix, date and time of birth, location of birth, father and mother (who also have the same natural identifier) of the person.  In this case the father and mother natural keys are comprised of 8 columns which is 16 columns for two parents plus the other 6 columns for a person makes 22 columns to identify one employee.  I propose that the natural key for a department is company name, company location, date incorporated, incorporation location, (natural key for company) department name, department location and date department created.  That is 6 columns.  Now associate an employee with a department and you have 28 columns in the natural key for the employee department intersection table.  You are looking at 34 predicates in your where clause to join these 4 tables.On the other hand, if you add a surrogate key to the person table (employee and parents) you drop the number of columns in the employee table to 2 surrogate columns for the parents and the 6 others for a total of 8.  Now if the company table has a surrogate key the department table drops to 4 columns and if department has a surrogate key the intersaction table now has 2 columns.  To join the same 4 tables there are only 3 predicates in the join clause.I have a question because I am not a Business Rules person.  What is this Unification Business Rule?  I do not understand the purpose.  I searched the web and didn't get any hits in the first few pages.  Can anyone give me some references so I can understand?</description><pubDate>Tue, 08 Sep 2009 10:43:22 GMT</pubDate><dc:creator>baconm-1145631</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>I'm a big fan of using surrogate keys when designing databases, but they are not intended to enforce data integrity.  They should be used to simplify joins and used solely to establish relationships to other entities in the database.  Example 1 should never happen because any table with a surrogate key should also have a unique constraint placed on the natural key of the table to enforce data integrity.Example 2 should also never happen because it's not a good practice to use a surrogate key on a linking/intersection table.  Those tables should always consist of the combination of the two surrogate keys from the two tables that you are establishing the relationship between.  Thank you for the article.  It was definitely a interesting read.  I've just never encountered either of the issues that you describe because of the practices that I've listed above.Greg</description><pubDate>Tue, 08 Sep 2009 07:20:48 GMT</pubDate><dc:creator>JestersGrind</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>Surrogate keys are still the devil. The problems they prevent pale in comparison to the problems they cause.That being said, this is far and away the best solution if you are going to use them.~BOT</description><pubDate>Tue, 08 Sep 2009 07:12:21 GMT</pubDate><dc:creator>SQLBOT</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>Some comments.The article talks about unique indexes, but formally they are alternake keys or uniquenss constraints.Everybody should be aware that there are situations in which surrogate keys fail to keep RI working. IMO RI is an physicalisation step of a data model and should not be initially modeled at all (in eg a logical model). Only when implementing you could add them to tune the database model.</description><pubDate>Tue, 08 Sep 2009 04:19:37 GMT</pubDate><dc:creator>DM Unseen</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>Thats exactly what Codd intended when he invented the relational model. He wanted the database to protect the integrity of the data, and hence created the relational model to mathematically prove it could actually be implemented. The contructs around RI are not based on a developers whim or choosing to ignore it, but on Codds profound insight in how to maintain data quality.</description><pubDate>Tue, 08 Sep 2009 04:09:19 GMT</pubDate><dc:creator>DM Unseen</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>A DWH is a different beast altogether and should not be confused with constructing OLTP database models. Your approach is good for a DWH but less apropraite for an OLTP system, while the author is focusing on OLTP datamodel design and not on DWH design.</description><pubDate>Tue, 08 Sep 2009 04:04:12 GMT</pubDate><dc:creator>DM Unseen</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>"Example 1. (Introduces risk of record duplication):..."Imo the database should not be making decisions the application should be taking.  Whilst your point is technically valid, it's encouraging application logic into the database, where it is difficult to tell a duplicate record from a valid second entry.</description><pubDate>Tue, 08 Sep 2009 03:31:13 GMT</pubDate><dc:creator>josephmulhall</dc:creator></item><item><title>RE: Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>It's an interesting take on surrogate keys.I've designed a few data warehouses over the years and I've not really regarded duplicity as a problem, rather inherent in the data, if it is by nature going to provide historical analysis. Here's why...- I will often include metadata in the design at the physical level to help understand the lifecycle of the data. These fields include created dates, modified dates, current flag and so on...- Multiple versions of the same ID would be identified by the "created" and "modified" dates, as well as a flag indicating it is the current version of the record; in essence, this is a slow changing dimension/entity.- ETL processes check if there is a current record for the incoming data and assign the existing key or create a new one as necessary.- I *always* move away from intelligent keys for the physical structures; many times I have joined an existing ETL project where the model has relied on the business keys, only to find that the data quality/constraint enforcement causes more problems than it solves.I'm liking your work, Marvin. I'll keep an eye out for more of your posts.Cheers</description><pubDate>Tue, 08 Sep 2009 00:46:07 GMT</pubDate><dc:creator>surreydude.</dc:creator></item><item><title>Enforcing Data Quality while using Surrogate Keys</title><link>http://www.sqlservercentral.com/Forums/Topic784006-1643-1.aspx</link><description>Comments posted to this topic are about the item [B]&lt;A HREF="/articles/Data+Modeling/67684/"&gt;Enforcing Data Quality while using Surrogate Keys&lt;/A&gt;[/B]</description><pubDate>Tue, 08 Sep 2009 00:11:52 GMT</pubDate><dc:creator>marvin.elder</dc:creator></item></channel></rss>