Generate UserName from Existing table

  • ScottPletcher - Thursday, March 8, 2018 10:17 AM

    True enough. However, one nitpick. Nothing wrong with using identities as a clustered and primary key. Mind you, my experience is designing "small to medium" OLTP systems (read, around 10 million records per table) so I can't speak to designing the true behemoths (1+ billion records), which I'm sure have far more severe performance constraints.

    Roger.Plowman

    The clustered index is the single most important overall performance factor for that table.  Thus, it is wrong to just slap an identity on (nearly) every table and automatically make that the clustering key.  Major mistake.  And it's often done far too early in the design process, corrupting that process.  Candidate keys are never determined and evaluated -- why bother, when you've already determined the "key"?  But you can't verify that data within a table directly relates to "the key, the whole key and nothing but the key" if the key is just some meaningless number!

    Now, there is nothing wrong with having a "sequential number" in the logical design as the identifying attribute of an entity.  For example, a customer number (since it's easy to see that customer name nor email address is a unique or permanent key).  But it is not an "identity", because that is an implementation method, not a logical design feature.  The seq value could come from a sequence, or a pre-filled table of values, or any other number of ways.  That need not be determined at the logical phase, only that a seq num will be the identifier for that entity.

    Since you agree with "sequence number" as an identifier I'm not sure why you object to an identity as that sequence number. The standard 4 byte integer has all sorts of useful traits as a primary key (and a clustered one). First, it's (physically) short with a large domain, second appends always happen at the end of the table, third (at least in most of the apps I deal with) transactions tend to be sequential anyway (in other words, header/detail transactions tend to one-time creation with little alteration to line counts afterward) so using an identity tends to reduce fragmentation too. Finally, an identity provides a simple mechanism to create the sequence number.

    Concerning your point about candidate keys, I am always careful to identify alternate keys and make sure they have unique constraints where appropriate. This is useful for finding individual records for searches. For any other kind of linking (e.g. reporting) a short primary key and the type of transactions typical in applications I'm involved with (physically contiguous or near contiguous based on sequential serial numbers) makes performance extremely good.

    So, what's not to like?

  • roger.plowman - Thursday, March 8, 2018 11:05 AM

    ScottPletcher - Thursday, March 8, 2018 10:17 AM

    True enough. However, one nitpick. Nothing wrong with using identities as a clustered and primary key. Mind you, my experience is designing "small to medium" OLTP systems (read, around 10 million records per table) so I can't speak to designing the true behemoths (1+ billion records), which I'm sure have far more severe performance constraints.

    Roger.Plowman

    The clustered index is the single most important overall performance factor for that table.  Thus, it is wrong to just slap an identity on (nearly) every table and automatically make that the clustering key.  Major mistake.  And it's often done far too early in the design process, corrupting that process.  Candidate keys are never determined and evaluated -- why bother, when you've already determined the "key"?  But you can't verify that data within a table directly relates to "the key, the whole key and nothing but the key" if the key is just some meaningless number!

    Now, there is nothing wrong with having a "sequential number" in the logical design as the identifying attribute of an entity.  For example, a customer number (since it's easy to see that customer name nor email address is a unique or permanent key).  But it is not an "identity", because that is an implementation method, not a logical design feature.  The seq value could come from a sequence, or a pre-filled table of values, or any other number of ways.  That need not be determined at the logical phase, only that a seq num will be the identifier for that entity.

    Since you agree with "sequence number" as an identifier I'm not sure why you object to an identity as that sequence number. The standard 4 byte integer has all sorts of useful traits as a primary key (and a clustered one). First, it's (physically) short with a large domain, second appends always happen at the end of the table, third (at least in most of the apps I deal with) transactions tend to be sequential anyway (in other words, header/detail transactions tend to one-time creation with little alteration to line counts afterward) so using an identity tends to reduce fragmentation too. Finally, an identity provides a simple mechanism to create the sequence number.

    Concerning your point about candidate keys, I am always careful to identify alternate keys and make sure they have unique constraints where appropriate. This is useful for finding individual records for searches. For any other kind of linking (e.g. reporting) a short primary key and the type of transactions typical in applications I'm involved with (physically contiguous or near contiguous based on sequential serial numbers) makes performance extremely good.

    So, what's not to like?

    I didn't state my objection clearly enough.  I'm not objecting to identity when it's applicable to that specific table, I strongly object to it automatically being applied to (almost) every table.  

    I also just wanted to point out that in the logical design phase, it should not be referred to as identity because that's dependent on the physical implementation, and is thus outside the scope of the logical design.

    As to clustering on it, that's most often a mistake.  Literally.  I've found that more than 50% of the tables clustered on identity should be clustered on something else for best overall performance.  Even when identity does reduce fragmentation somewhat (often it doesn't, and never will if the issues causing splits don't have to do with the sequence of INSERTs).

    Also important is that neither SQL nor you can rely on an identity being time sequential.  I know, you think it is, but anyone can insert any identity value at any time in the table.  SQL has to allow for that.  

    Probably the most common identity-clustering mistakes are log tables and child tables.  Almost always log tables are searched first by datetime.  Thus, they should be clustered first on datetime, period.  If you really feel the need to add an identity to the end of the key, do that, but don't cluster the table on identity.  It's useless for a datetime search.

    Along those same lines, an Order_Items should be clustered first on order_id and only then on identity (to keep it in sequence, while also making it unique).  Inevitably you join from Orders to Order_Items on order_id, but far too often I see the O_I table clustered on item's identity column, which is useless.  Then immediately a nonclus index on order_id is built ... such a waste!  And inevitably more and more INCLUDEd columns get added to that index, and other indexes spring up headed by order_id. 

    When you get multiple nonclus indexes with the same lead column and different second keys / included columns, it's a great signal that you should review the clus index on that table and consider changing it.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher - Thursday, March 8, 2018 11:36 AM

    roger.plowman - Thursday, March 8, 2018 11:05 AM

    ScottPletcher - Thursday, March 8, 2018 10:17 AM

    True enough. However, one nitpick. Nothing wrong with using identities as a clustered and primary key. Mind you, my experience is designing "small to medium" OLTP systems (read, around 10 million records per table) so I can't speak to designing the true behemoths (1+ billion records), which I'm sure have far more severe performance constraints.

    Roger.Plowman

    The clustered index is the single most important overall performance factor for that table.  Thus, it is wrong to just slap an identity on (nearly) every table and automatically make that the clustering key.  Major mistake.  And it's often done far too early in the design process, corrupting that process.  Candidate keys are never determined and evaluated -- why bother, when you've already determined the "key"?  But you can't verify that data within a table directly relates to "the key, the whole key and nothing but the key" if the key is just some meaningless number!

    Now, there is nothing wrong with having a "sequential number" in the logical design as the identifying attribute of an entity.  For example, a customer number (since it's easy to see that customer name nor email address is a unique or permanent key).  But it is not an "identity", because that is an implementation method, not a logical design feature.  The seq value could come from a sequence, or a pre-filled table of values, or any other number of ways.  That need not be determined at the logical phase, only that a seq num will be the identifier for that entity.

    Since you agree with "sequence number" as an identifier I'm not sure why you object to an identity as that sequence number. The standard 4 byte integer has all sorts of useful traits as a primary key (and a clustered one). First, it's (physically) short with a large domain, second appends always happen at the end of the table, third (at least in most of the apps I deal with) transactions tend to be sequential anyway (in other words, header/detail transactions tend to one-time creation with little alteration to line counts afterward) so using an identity tends to reduce fragmentation too. Finally, an identity provides a simple mechanism to create the sequence number.

    Concerning your point about candidate keys, I am always careful to identify alternate keys and make sure they have unique constraints where appropriate. This is useful for finding individual records for searches. For any other kind of linking (e.g. reporting) a short primary key and the type of transactions typical in applications I'm involved with (physically contiguous or near contiguous based on sequential serial numbers) makes performance extremely good.

    So, what's not to like?

    I didn't state my objection clearly enough.  I'm not objecting to identity when it's applicable to that specific table, I strongly object to it automatically being applied to (almost) every table.  

    I also just wanted to point out that in the logical design phase, it should not be referred to as identity because that's dependent on the physical implementation, and is thus outside the scope of the logical design.

    As to clustering on it, that's most often a mistake.  Literally.  I've found that more than 50% of the tables clustered on identity should be clustered on something else for best overall performance.  Even when identity does reduce fragmentation somewhat (often it doesn't, and never will if the issues causing splits don't have to do with the sequence of INSERTs).

    Also important is that neither SQL nor you can rely on an identity being time sequential.  I know, you think it is, but anyone can insert any identity value at any time in the table.  SQL has to allow for that.  

    Probably the most common identity-clustering mistakes are log tables and child tables.  Almost always log tables are searched first by datetime.  Thus, they should be clustered first on datetime, period.  If you really feel the need to add an identity to the end of the key, do that, but don't cluster the table on identity.  It's useless for a datetime search.

    Along those same lines, an Order_Items should be clustered first on order_id and only then on identity (to keep it in sequence, while also making it unique).  Inevitably you join from Orders to Order_Items on order_id, but far too often I see the O_I table clustered on item's identity column, which is useless.  Then immediately a nonclus index on order_id is built ... such a waste!  And inevitably more and more INCLUDEd columns get added to that index, and other indexes spring up headed by order_id. 

    When you get multiple nonclus indexes with the same lead column and different second keys / included columns, it's a great signal that you should review the clus index on that table and consider changing it.

    Concerning your objection to identities not being time clustered, well, SET IDENTITY_INSERT is best left to emergency administration. Enforce that by restricting developers to executing stored procedures only (forbidding them direct access to tables) and that issue is pretty much handled.

    As for log files, well, it's entirely possible events might be simultaneous, meaning you can't make the timestamp unique. Also consider having an identity (assuming discipline using SET IDENTITY_INSERT) you can use the identity and the timestamp to authenticate that log entries ARE sequential. Nothing like having an anomalous timestamp in a sequence to make the auditor smell a dead rat. 🙂

    Personally, I tend to be generous when creating indexes. At the scale of my applications having an header/detail item index as well as having a detail line item identity doesn't tax inserts/appends, and having both is useful in different situations.

    Identities are extremely useful for audit tables, for instance. And again, given the nature of our company's  applications, detail items are almost always created sequentially in a single session. This reduces the cost of grouping by header records, since the header records are also created sequentially, at the same time as the line items.

    Finally, I tend to skip the creation of a purely logical model. In our company I find the physical model is nearly identical to the logical model anyway, especially when using a modeling tool like ModelRight or similar. Thus when I speak of a sequential number I am speaking of identities, because in our case, the only sequence number we use is an identity.

    Sequential numbers (no matter what the implementation) are wonderful for creating "true names", an identifier that will never change over the lifetime of the application's data. Every single natural key (name, part number, SSN, etc.) cannot be a "true name" because it is possible they can change over time. A RID will never change (in our company's vocabulary RID is simply short for Record ID, basically the identity field used as the primary key, not RID as SQL Server defines the term).

    Very handy, true names... 🙂

Viewing 3 posts - 31 through 32 (of 32 total)

You must be logged in to reply to this topic. Login to reply