Using a Surrogate vs Natural Key

  • The examples left out some very import points.

    1.  "Filter Predicates in the WHERE statements".  If the client applications always use the SS# number to lookup information then they will ALWAYs be required to examine the parent table first to get the surrogate key to reference the child tables.   This could have a major impact to performance where a natural key could be used to directly access the child table.

    2. If the natural key field is to reside in the parent table, then it will require a unique index(Unique Key) as well as to prevent duplication.  Many data models or database miss this very simple requirement and don't create the supporting index.   So your index numbers in the example will be doubled, which will effect the insert and delete numbers reported.

    3.For the above cases I would not recommend surrogate key when a singe column natural key would work.   For a multiple column natural key then I would recommend the surrogate key as in the case of address table.  Here the natural(Unique) key is LINE1, LINE2, CITY, STATE, & ZIP with supporting unique index.  This key is way to big to carry over the child table so support surrogate key in this would preform better, even with better.  Here a join is require and no direct lookup could used on the child table.

     

    So the debate continues.

  • Apart from performance considerations, I prefer surrogate keys because they abstract the relationship of tables (columns used for RI) from the data itself. Then a separate unique index must be created for the natural key to maintain data integrity. I find this the ideal.

  • It often occurs that we need to collect partial information to be completed later in a subsequent step when it becomes available. Haven't we all stood in line behind someone who can't remember what to put into a required field? Surrogate keys help mitigate this annoyance.

    Example. I would like to change my display name on this site, but can't. Is it my static primary key? Although I can imagine other legitimate reasons....

  • I am also in the "surrogate key" camp ... BUT ...

    Always make sure that the column or columns that compose the business key have a unique index put on them, to prevent natural duplicates with different surrogate keys (for instance people from "the business" add the same employee twice).

    If the business key column(s) would become too long or too many, at least from a performance or storage perspective (unique index), alternatively, you could compute a hashvalue on all the business keys, and put a unique index on only the hashvalue column.

    Some additional considerations:

    • make the hashkey preferably a computed column to enforce that it will always be up-to-date.
    • Use MD5 or stronger hashing algorithm (SHA1, SHA256) to make the changes on hash collissions astronomicly small.

    In datawarehousing hash values have become popular when you use Data Vault, then the Hashkey is even becoming primary key of the Hub, for OLTP systems I would not recommend this, because of (as mentioned before) potentially changing business keys.

  • Dr. Codd defined a surrogate key as being generated by the database engine and never exposed to the users. What he had in mind would be more like a hashing algorithm than an IDENTITY table property (please note a table property is not a column) or a GUID that the user has to manipulate.

    Your definition of a surrogate into lack validation and verification properties. These are essentially physical locators, like the pointers we used in network databases before RDBMS. As you pointed out, we need to have the actual key of the table stored and cannot just use the physical locator. However, is another problem. How do we verify the integrity of the physical locator? IDENTITY will not stay the same when I moved the database to another machine and the physical count of the physical records put on the new disk changes. When I'm teaching this, I use the analogy of numbered parking spaces and a garage. The parking space number is certainly much shorter than a full VIN number. But I have to have the VIN in my data to get insurance, licenses, or get on Carmax.

    Many years ago I ran into a cute data integrity problem because of this. If you're familiar with the medallion system of taxis that exist in many cities will appreciate it. The city issues a limited number of taxi medallions that have to be displayed so they can be checked by the police. A taxi company in New York City identified its cabs by a company taxi number stenciled on the fenders, the VIN on the engine block for legal insurance purposes and the medallion number limit

    Checker cab fenders unbolt very easily for bodywork and the company found they could rotate the medallions and the taxi numbers. This let them shuffle their fleet around and play fast and loose with the repair and mileage reports. It wasn't until a sharp police officer noticed that these three identifiers did not agree as to what vehicle they actually belong to. The give-away was mismatched fender numbers. Too many identifiers without any validation or verification!

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • I see this mistake a lot unfortunately. Using a hashed value based on a multiple column natural key as a primary key.

    Algorithms used for hashing values are never guaranteed to be unique.  NEVER.  Two completely different set of values can create the same hashed value.   Also a hashed value will generally require more storage than that of integer based field Than than that used for surrogate key.

  • doug.partch 82735 wrote:

    I see this mistake a lot unfortunately. Using a hashed value based on a multiple column natural key as a primary key.

    Algorithms used for hashing values are never guaranteed to be unique.  NEVER.  Two completely different set of values can create the same hashed value.   Also a hashed value will generally require more storage than that of integer based field Than than that used for surrogate key.

     

    I worked for a company that did this - row hashes using checksum and binary checksum. - it is evil - we tried to export certain elements of the data to cloud NoSQL and found that we had had a lot more collisions than we expected.

    MVDBA

  • jcelko212 32090 wrote:

    Dr. Codd defined a surrogate key as being generated by the database engine and never exposed to the users. What he had in mind would be more like a hashing algorithm than an IDENTITY table property (please note a table property is not a column) or a GUID that the user has to manipulate.

    Your definition of a surrogate into lack validation and verification properties. These are essentially physical locators, like the pointers we used in network databases before RDBMS. As you pointed out, we need to have the actual key of the table stored and cannot just use the physical locator. However, is another problem. How do we verify the integrity of the physical locator? IDENTITY will not stay the same when I moved the database to another machine and the physical count of the physical records put on the new disk changes. When I'm teaching this, I use the analogy of numbered parking spaces and a garage. The parking space number is certainly much shorter than a full VIN number. But I have to have the VIN in my data to get insurance, licenses, or get on Carmax.

    Many years ago I ran into a cute data integrity problem because of this. If you're familiar with the medallion system of taxis that exist in many cities will appreciate it. The city issues a limited number of taxi medallions that have to be displayed so they can be checked by the police. A taxi company in New York City identified its cabs by a company taxi number stenciled on the fenders, the VIN on the engine block for legal insurance purposes and the medallion number limit

    Checker cab fenders unbolt very easily for bodywork and the company found they could rotate the medallions and the taxi numbers. This let them shuffle their fleet around and play fast and loose with the repair and mileage reports. It wasn't until a sharp police officer noticed that these three identifiers did not agree as to what vehicle they actually belong to. The give-away was mismatched fender numbers. Too many identifiers without any validation or verification!

    Surrogate keys are inherently local to the table they are part of. If you move the table to a different database part of the import process would have to be resetting the identity value to the max surrogate value + 1 of course.

    Now, if you actually meant merging several tables containing (local) surrogate keys into a master table that's another kettle of fish altogether.

    Some while back I was working on a D&D game product with similar needs. It needed the ability for all users (world wide, theoretically) to be able to export their own created items to other users without keys colliding.

    The obvious solution was a GUID, however I hate GUIDs for a number of reasons. One is the huge size--16 bytes does not a useful physical pointer make! My solution was to have a pair of keys per object, one global GUID stored in a table with a local surrogate key being an INT. Thus every player's object could have a surrogate of say 156, but object 156 would be a different object in each database.

    The table containing the GUID surrogate also contained the local surrogate, which was actually a foreign key to the object table. That way I kept all the advantages of an INT identity with the global uniqueness of a GUID without the GUID chewing up massive amounts of performance and space. This was important because back then players couldn't afford lots of either.

     

  • roger.plowman wrote:

    The obvious solution was a GUID, however I hate GUIDs for a number of reasons. One is the huge size--16 bytes does not a useful physical pointer make! My solution was to have a pair of keys per object, one global GUID stored in a table with a local surrogate key being an INT. Thus every player's object could have a surrogate of say 156, but object 156 would be a different object in each database.

    I work with a lot of "multitenancy" databases,  I doubt I would use a GUID in those situations (even using sequentialGUID) .. typically we use AccountCode (int) and oldID (varchar based on the import data) as the clustered index and import from our source systems... but we still put an autonumber PK  field in so that our database design is not constrained by our customers data files.

    this keeps our FK size down and our code neat - it's a lot easier to do "delete from table where id=5" than using multiple columns... even better it plays into the hands of windowing functions where you can  row_number() (partition by x order by id) rather than using datetimes and other non-reliable fields (even in a cursor, datetimes can be duplicated if they are within 3 milliseconds)

     

     

    MVDBA

  • Wrong. Look up "perfect hashing" and "minimal perfect hashing" in the literature, In particular, the work of R. J. Cichelli. Then find someone who has worked with Teradata and talk to them.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jcelko212 32090 wrote:

    Dr. Codd defined a surrogate key as being generated by the database engine and never exposed to the users. What he had in mind would be more like a hashing algorithm than an IDENTITY table property (please note a table property is not a column) or a GUID that the user has to manipulate.

    I agree with the first sentence of that, but not the second.  Please cite the document an paragraph and provide the actual quote where Dr. Codd even alluded to such a mistake.

    jcelko212 32090 wrote:

    Wrong. Look up "perfect hashing" and "minimal perfect hashing" in the literature, In particular, the work of R. J. Cichelli. Then find someone who has worked with Teradata and talk to them.

      You're the one citing things... please provide the links so that we don't have to second guess on your references like we've had to in the past.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I've worked with terabytes of data for years, so I have to disagree with joe.

    you can often expose your artificial key to the end user if needed, things like "lineitem"

    it often helps also then you need a reference number for support calls.(order number 55621 is a lot easier to identify than "the one I placed at 12:15pm and my address is xxxx")

    MVDBA

  • Database developers would be wise to stop thinking in terms of end users directly accessing data just as .NET developers (or any other high level platform/language)don't assume a user might want to casually peruse their source code.

     

  • Let me quote from my own book (SQL FOR SMARTLIES, 4.1.3). The ACM source is at the end.

    Finally, an appeal to authority, with a quote from Dr. Codd: ".. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them...".

    This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in code, DRI or anything else that a user writes.

    Codd also wrote the following:

    "There are three difficulties in employing user-controlled keys as permanent surrogates for entities.

    (1) The actual values of user-controlled keys are determined by users and must, therefore, be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).(2) Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security numbers, while the other uses employee serial numbers) and yet the entities denoted are the same.

    (3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. an applicant for a job and a retiree).These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd, E.; "Extending the Database Relational Model to Capture More Meaning", ACM Transactions on Database Systems, 4(4), pp. 397-434; 1979).

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • Thanks, Joe.  As you've just shown, while Dr. Codd does express that it should be similar to an index, it does NOT, however, say anything explicit about it intending to be a hash, which would be a mistake for many reasons.  Even a GUID would be better than a hash.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 34 total)

You must be logged in to reply to this topic. Login to reply