Overusing Identities

  • Not if you reseed the column or use IDENTITY_INSERT to insert values. If an IDENTITY column doesn't have a unique constraint or index then it may not be unique.

    As my granddaddy used to say,

    You can make things fool-proof but you can't make 'em dang-fool proof . . . some dangeed fool will work at finding a way to break anything you make.

    If you just arbitrarily do things like using IDENTITY_INSERT or reseed the Identity column, then, IMHO, you deserve to have problems. If you don't screw around with the Identity column, then it won't have those problems. 😉

    Ralph D. Wilson II
    Development DBA

    "Give me 6 hours to chop down a tree and I will spend the first 4 sharpening the ax."
    A. Lincoln

  • RalphWilson (3/2/2010)


    If you just arbitrarily do things like using IDENTITY_INSERT or reseed the Identity column, then, IMHO, you deserve to have problems. If you don't screw around with the Identity column, then it won't have those problems. 😉

    You mean you've never done those things? And you've never written buggy code either? In Murphy's world, if you have an IDENTITY column without a unique constraint then you can expect duplicate values sooner or later. IDENTITY is merely a means of generating a default value, not a way to guarantee uniqueness.

  • As for the email address one, I (and a lot of other people) got dumped by an ex-free provider without any warning a few years ago. Instantly no access to the email account. Suddenly, I am registered at sites and identified by an email address that must change.

    This caused me a load of headaches. Some for those maintaining the websites. Some places were more helpful than others and some systems made it easier for me or their administrators than others.

    Either way, I think using a natural key as a PK when it is reasonable to expect it's value to change is not a good implementation. Suddenly, you either have to duplicate the natural key or lose that value in time where it might be appropriate to keep for auditing purposes - I guess it may only need to be duplicated where the historical value is required.

    PS Thanks again Steve. (Yep, this was one of the helpful places!!!)

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • David Portas (3/2/2010)


    RalphWilson (3/2/2010)


    If you just arbitrarily do things like using IDENTITY_INSERT or reseed the Identity column, then, IMHO, you deserve to have problems. If you don't screw around with the Identity column, then it won't have those problems. 😉

    You mean you've never done those things? And you've never written buggy code either? In Murphy's world, if you have an IDENTITY column without a unique constraint then you can expect duplicate values sooner or later. IDENTITY is merely a means of generating a default value, not a way to guarantee uniqueness.

    Dunno about Ralph, but *I* never ever do those things. Buggy code won't change identity column seeds, and the only time to use Identity_Insert is when you're porting legacy data to a new system--and it's dead simple to compare the old columns against the new ones to make sure no shenanigans occurred. Oh, and also? Unique constraints on an identity column go without saying. There would be no purpose to an identity column that wasn't used as the PK.

    David and Len seem to be from a mathematics background. From that perspective their points about the relational model are entirely valid.

    However. 🙂

    From an implementation standpoint, using SQL Server, their rigourous approach will produce sub-optimal results. Page fragmentation being the most glaring, along with the performance hit of having to constantly shuffle a clustered index. The whole point of clustering is *speed*, and it's always going to be faster to append a record than insert it.

    In the same vein using a 4 byte key (assuming you can tolerate the 2GB limit of int32 (assuming positive only keys) is the clear winner in both space-consumption and processing speed over some multi-field of strings.

    Likewise, from an *implementation* perspective the PK is (or should be) application transparent, unchanging, and unique. It's only purpose is to link rows in different tables. As a *link* you want the above characteristics. Now, for human retrieval purposes obviously you want some kind of natural key, preferably unique. Any given table can of course have multiple alternate keys that uniquely identify the record, but these keys will usually be mutable and random (meaning you don't want them clustered), large (meaning you don't want them as FK's) and not necessarily unique (meaning they can't be uniquely constrained).

    Theory is one thing, but reality always applies additional constraints. 🙂

  • roger.plowman (3/3/2010)


    David and Len (Les actually :cool:) seem to be from a mathematics background. From that perspective their points about the relational model are entirely valid.

    Well, initially music on my part, but switched to CS (BIT, MIT, and working on DCS-DSS at present...only took me 30yrs in the trenches to decide what I wanted to be). Still, music and mathematics are so intertwined as to be transparent, and set theory, well... kissing cousins with RA.

    However. 🙂

    From an implementation standpoint, using SQL Server, their rigourous approach will produce sub-optimal results.

    Page fragmentation being the most glaring, along with the performance hit of having to constantly shuffle a clustered index. The whole point of clustering is *speed*, and it's always going to be faster to append a record than insert it.

    Oh contraire... great book I often refer to... "Database Tuning" by Dennis Shasha... a must read for any aspiring DBA.

    In the same vein using a 4 byte key (assuming you can tolerate the 2GB limit of int32 (assuming positive only keys) is the clear winner in both space-consumption and processing speed over some multi-field of strings.

    Clearly, long strings rarely make good candidate keys, but not necessarily because of length (hashing does wonders), but usually because they simply are not good candidates and are seldom immutable. In many cases, UID's are the only realistic option, short of stamping a class of such objects with a VIN, or inserting/implanting an RFID :w00t:

    Theory is one thing, but reality always applies additional constraints. 🙂

    Without a doubt...

    Dr. Les Cardwell, DCS-DSS
    Enterprise Data Architect
    Central Lincoln PUD

  • roger.plowman (3/3/2010)


    From an implementation standpoint, using SQL Server, their rigourous approach will produce sub-optimal results. Page fragmentation being the most glaring, along with the performance hit of having to constantly shuffle a clustered index. The whole point of clustering is *speed*, and it's always going to be faster to append a record than insert it.

    I'm not sure what "rigorous" approach you are referring to but I said nothing at all about clustered indexes. I have been talking about candidate keys but I would hope you know that keys and clustered indexes are different and unrelated things. I don't believe I have said anything that will produce sub-optimal results. Quite the reverse.

  • I inherited a system where the database designer believed that identity columns should be used as the primary key because joins were faster that way.

    This particular database design was a poster child for a good idea totally mis-applied.

    The database was an operational data store for reporting. Lots of data. Lots of tables.

    A given table might have a hundred columns in it, eighty of which might be codes. That's what the source system was doing, and that's what was required by the customer in this database.

    So, each of those 80 codes in the table was turned into a numeric fk column pointing to the corresponding code table pk column. Each code table had an alphanumeric code column too.

    Of course, being a reporting system, the users didn't want to see 80 columns of meaningless id columns. They wanted to see the code values.

    So, 500,000 data rows in the data table. 80 joins per row. That's 500,000 * 80 = 40,000,000 joins when querying that data table, all to show rows with the 80 alpha codes in it that we started with as part of the initial data load.

    The users didn't need to see the code descriptions, they were long-time military or retired military enlisted staff that could recite the codes and their meanings in their sleep.

    Each of those 40,000,000 joins was individually faster than it would be with an alphanumeric code as the fk; but none of those 40,000,000 joins needed to be done at all with the alphanumeric fks!

    Had the original reporting table just used the 80 alpha codes as fk columns, the system would have been awesomely simpler to code and run way faster.

    I know, because that's what I changed it to do.

  • I agree with that... reporting tables are typically denormalized especially when rebuilt on a regular basis.

    So far as the 80 joins go, did you need all 80 joins for every report?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/10/2010)


    I agree with that... reporting tables are typically denormalized especially when rebuilt on a regular basis.

    I don't think David was suggesting denormalization. He was describing the benefits of using a business key as foreign key instead of a surrogate. What you use for foreign keys has nothing to do with normalization / denormalization.

    I would expect that if the table's contents has to be rebuilt on a regular basis then it is better not to denormalize because that will normally be more expensive for data loads (because you have to generate the redundant data each time).

  • As usual, it doesn't matter what guidelines and best practices are it is the informed decision that is required to see where and when the application is suitable and, more crucially, when applying best practice is unsuitable.

    A case of man over machine (or rather man over man's current ability to automate machines).

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • So, in debugging a report to find a needle in a haystack last week, I was reminded of this topic/thread. One of the LOB apps our PUD uses tracks Utility Poles, and handles Billing to companies who attach their items to the Poles (fiber, tv cable, etc). Unfortunately, the app was written using all surrogate keys, and especially troublesome is that all their attribute (lookup) tables also use surrogate keys (they all start with 'LU' below). So, to trace all the variable dependencies, it required queries to join against all the surrogates for the attribute tables such as...

    SELECT ...

    FROM (((((((((Attachment

    INNER JOIN AttachmentSpace ON Attachment.AttachmentSpaceID=AttachmentSpace.AttachmentSpaceID)

    INNER JOIN LUAttachmentType ON Attachment.LUAttachmentTypeID=LUAttachmentType.LUAttachmentTypeID)

    INNER JOIN Permit ON Attachment.PermitID=Permit.PermitID)

    INNER JOIN Pole ON AttachmentSpace.PoleID=Pole.PoleID)

    INNER JOIN LUAttachmentSpaceSupportEquipType ON AttachmentSpace.LUAttachmentSpaceSupportEquipTypeID =LUAttachmentSpaceSupportEquipType.LUAttachmentSpaceSupportEquipTypeID)

    INNER JOIN LUUnitOfMeasure ON AttachmentSpace.LUUnitOfMeasureID=LUUnitOfMeasure.LUUnitOfMeasureID)

    INNER JOIN ResourceLocation ON Pole.ResourceLocationID=ResourceLocation.ResourceLocationID)

    INNER JOIN LURegion ON ResourceLocation.LURegionID=LURegion.LURegionID)

    INNER JOIN LUSubRegion ON ResourceLocation.LUSubRegionID=LUSubRegion.LUSubRegionID)

    INNER JOIN Agreement ON Permit.AgreementID=Agreement.AgreementID

    WHERE ...

    ...which isn't terribly inconvenient to trace/join/expose for a single view or query, but we were looking for a needle in a haystack (lets just say that this app is short on constraints at the dbms level, supposedly enforced at the UI level, but not so much :crazy: ) For every query we built, we had to create otherwise redundant joins to get at the variable data.

    Because of the numerous ways we were going to have to slice/dice the data to sort through the issue, it was far more efficient to at least arrest the surrogates from the attribute tables, and refactor the entity tables to keep us from going blind. On a separate copy and a bit of quick work recasting the data/modifying the schema, the above was reduced to...

    SELECT ...

    FROM ((((Attachment

    INNER JOIN AttachmentSpace ON Attachment.AttachmentSpaceID=AttachmentSpace.AttachmentSpaceID)

    INNER JOIN Permit ON Attachment.PermitID=Permit.PermitID)

    INNER JOIN Pole ON AttachmentSpace.PoleID=Pole.PoleID)

    INNER JOIN ResourceLocation ON Pole.ResourceLocationID=ResourceLocation.ResourceLocationID)

    INNER JOIN Agreement ON Permit.AgreementID=Agreement.AgreementID

    WHERE ...

    ...which only hints at the amount of time it saved and the simplicity it added... far more than just reducing it from 10 JOINS to 5.

    As to the argument for the possibility of 'changing attribute values'... it's moot. Cascade updates on attribute tables such as these have a depth of one, hence are not recursive, nor overly taxing on the system when they are modified, which is rare.

    Simplicity is king.

    AAR, JM2c... 😎

    Dr. Les Cardwell, DCS-DSS
    Enterprise Data Architect
    Central Lincoln PUD

Viewing 11 posts - 61 through 70 (of 70 total)

You must be logged in to reply to this topic. Login to reply