Identity column as Primary Key - good or bad?

Question

Identity column as Primary Key - good or bad?

SSCertifiable

Points: 5969

November 14, 2019 at 1:29 pm

I'm building a web app to maintain some data, which is mostly flat but will benefit from having some static look-up tables. Typically these look-up tables with have at most a dozen rows and are unlikely to change over time.

Consider the two different DDL scripts. The first creates a look-up table CustomerType with an IDENTITY primary key, and the Customer table then has a Foreign Key of the CustomerType's Identity column:

/****** Object:  Table [dbo].[Customer]    Script Date: 14/11/2019 13:19:53 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Customer](
[Id] [int] NOT NULL,
[CustomerName] [nvarchar](50) NOT NULL,
[CustomerTypeId] [int] NOT NULL,
 CONSTRAINT [PK_Customer] PRIMARY KEY CLUSTERED 
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
/****** Object:  Table [dbo].[CustomerType]    Script Date: 14/11/2019 13:19:54 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[CustomerType](
[Id] [int] NOT NULL,
[CustomerType] [nvarchar](50) NOT NULL,
 CONSTRAINT [PK_CustomerType] PRIMARY KEY CLUSTERED 
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Customer]  WITH CHECK ADD  CONSTRAINT [FK_Customer_CustomerType] FOREIGN KEY([CustomerTypeId])
REFERENCES [dbo].[CustomerType] ([Id])
GO
ALTER TABLE [dbo].[Customer] CHECK CONSTRAINT [FK_Customer_CustomerType]
GO

The second maps the Customer to the CustomerType table on the CustomerType column:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Customer](
[Id] [int] NOT NULL,
[CustomerName] [nvarchar](50) NOT NULL,
[CustomerType] [nvarchar](50) NOT NULL,
 CONSTRAINT [PK_Customer] PRIMARY KEY CLUSTERED 
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
/****** Object:  Table [dbo].[CustomerType]    Script Date: 14/11/2019 13:24:40 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[CustomerType](
[CustomerType] [nvarchar](50) NOT NULL,
 CONSTRAINT [PK_CustomerType] PRIMARY KEY CLUSTERED 
(
[CustomerType] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Customer]  WITH CHECK ADD  CONSTRAINT [FK_Customer_CustomerType] FOREIGN KEY([CustomerType])
REFERENCES [dbo].[CustomerType] ([CustomerType])
GO
ALTER TABLE [dbo].[Customer] CHECK CONSTRAINT [FK_Customer_CustomerType]
GO

I wondered if anyone had any strong feelings one way or another about these approaches? The CustomerType table will not be exposed to any Create, Update or Delete operations so integrity should be maintained.

At my last place (a month ago) many devs (we are all full-stack .NET devs) chose not to enforce data integrity in the database, but in the code that surfaced the data. I'm old school and prefer to enforce relational integrity, but I'm wondering whether a relatively meaningless Id IDENTITY PK is nowadays deprecated.

Many thanks

Edward

Viewing 15 posts - 1 through 15 (of 36 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 399418 More actions · Answer 1

Oh boy.

Are there strong opinions on natural keys versus artificial keys?

Yes.

You may get some very strongly worded responses to this question, depending on who sees it and how they respond. If any of it comes off as offensive, first, I apologize for them. For whatever reason, this topic makes people crazy. Second, try to let it slide off your back because, again, this topic makes people crazy.

Let me start by saying this is my opinion, backed up by 30 years of hands-on experience in IT as a developer and DBA. However, there can, and probably will be, honest disagreement.

Go with the first option. What you've got there are artificial keys. In my opinion these are safer mechanisms for designing your databases. They perform better than natural keys. They're easier to maintain than natural keys. They will make dealing with things like the GDPR and the CPPA easier (not an issue here I think, but worth mentionning). Yes, if you're using an artificial key, you should also have a unique constraint on the natural key, CustomerType in the example, which means a little added storage and processing overhead as you have two indexes instead of one, but that's worth all the other benefits. It means that when you change the CustomerType values, and yeah, I know, they'll never, ever change, pinkie swear, until they do, you can change the text without then having to update all the associated records in the table. It's a win all the way.

Further, please, oh please, do enforce foreign key constraints in your database. If for no other reason, it's about belts and suspenders. Good engineering practices dictate backups and safeties. Rather than rely on a single point of failure, the always flawless code written by the development team (and no, not picking on them, just pointing it out, they may screw up), you have a second set of constraints, the foreign keys. However, enforced foreign keys helps ensure clean data which is vital down the road when it gets consumed for reports and analysis. Also, in SQL Server, enforced referential integrity is actually a performance enhancement. The optimizer can take advantage of those constraints in the choices it makes.

Putting on my asbestos undies for the coming flames.

This reply was modified 6 years, 8 months ago by Grant Fritchey. Reason: typo

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

edwardwill SSCertifiable Points: 5969 More actions · Answer 2

Thank you very much Grant. A very useful and comprehensive answer that chimes with my much more limited experience (only 25 years a dev!) and also with my innate caution and feeling that normalisation, foreign key constraints etc. were created and persisted for very good reasons which don't take any notice of the fads of developers (most of the devs I worked with were infatuated with javascript frameworks, most of which I regard as the work of the devil!).

I don't have to make a decision for a few days so I'll see if anyone else answers, but I'm certainly leaning toward the enforced constraint, artificial key pattern.

Grant Fritchey SSC Guru Points: 399418 More actions · Answer 3

I'm weird as a DBA. I'm all in on new stuff. I love Entity Framework and other ORM tools. I'm good with semi-structured and unstructured data management systems. I love the cloud. I especially love the platform as a service offerings. Containers are awesome and rearranging the way I work. Embrace and enhance is the way to go.

However, none of that means that any of the old stuff just gets tossed aside, simply because it's old. Primary and foreign keys and relational storage do exist for a reason. Further, if we're using a relational storage engine, instead of one of the others, then we should use it as if it were, you know, a relational storage engine. Trying to ignore that or force it to behave in other ways, then being surprised when the relational storage engine doesn't do non-relational stuff well just seems a tad ignorant and short-sighted. If we don't need relational storage, let's not use it. If we do need it, let's use it correctly.

Yeah, I know. Weirdo.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

x SSC-Insane Points: 23660 More actions · Answer 4

I have a slight counterpoint. Obviously the surrogate key will work, and also it looks like your "customer type" is a particularily wide column that will have performance issues in the foreign key tables.

Additionally I'm not at all against the integer surrogate key but I understand the objections, and they're really very subjective but I don't think subjective arguments are all that bad.

What I've found useful now, is when a primary key has the particular attribute of representing "groups", or "statuses" or otherwise being a bit more than an instance of an object, I go for mnemonic codes, which integers are not one. A customer "type" will classify customers in a one to many relationship, ie., many customers will share the same type, so I spend a few bytes to make the customer type "code" mnemonic in nature, ie., I can eyeball the row and have a better guess at what it means. Its surely not as bad performancewise as the 50 char customer type (which I would now call a type "description"), but not meaningless like an integer.

But then again, I'm really a fan of code tables so I like to spend a few bytes in making the codes themselves be a bit more clear than an anonymous integer and again like I said, I understand the terrifically subjective nature of the debate, and therefore I am not totally against working with the anonymous integer sort of thing, even with records at the top of the hierarchy, like a "type table." Integers work well with INSTANCES of objects, like customer rows, but for entire classes that will "contain" or reference larger groups of records, I like the mnemonic type codes.

But yeah Grants right, we here at SSC have historically not been all that great at discussing this stuff, and while I'd rather use mnemonic codes that classify many instances, I do work with the other design and its not as oppressive or defective as some make it out to be.

Jonathan AC Roberts SSCoach Points: 18405 More actions · Answer 5

I think you mean "surrogate" key, as there is no identity column on your table. I would use a surrogate identity column as the PK so create the table like this:

CREATE TABLE [dbo].[CustomerType]
(
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [CustomerType] [nvarchar](50) NOT NULL,
    CONSTRAINT [PK_CustomerType] PRIMARY KEY CLUSTERED ([Id] ASC)
);

It would also be helpful to create a unique index or constraint on the CustomerType column.

I would also make Customer(Id) an identity column.

Grant Fritchey SSC Guru Points: 399418 More actions · Answer 6

Just noticed this.

Minor nit.

I'm all about clarity. The more the better. So, with that in mind, naming two different columns in two different tables, ID, especially when we're going to start associating these tables across to each other through foreign key constraints, doesn't naturally lend itself to clarity. I'd go with something like (and I'm not dogmatic here) CustomerTypeID instead.

Again, a nit.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Grant Fritchey SSC Guru Points: 399418 More actions · Answer 7

X,

You're not wrong. It's not the choices I would make, but the way you've laid it out, they're not bad choices.Others may not agree, but I'd say that others can be frequently disagreeable.

The only pushback I'd have is that when it comes time to obfuscate or mask the data, whether through logical deletes in support of a GDPR right to be forgotten request, or while prepping the production data for use in a non-production environment, mnemonic data might, maybe, require masking in and of itself, which will hurt.

Wow, that's a sentence. Where are my editors? Ha!

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

edwardwill SSCertifiable Points: 5969 More actions · Answer 8

Thanks Jonathan AC Roberts. In fact there was originally a key such as you suggest but it disappeared in a cut 'n' paste job. The tables were created purely for illustrative purposes and have since been dropped.

Grant - I absolutely agree about using the generic 'Id' as the name of the Primary Key (as opposed to CustomerId, CustomerTypeId); however in my last place where I was dealing almost entirely with legacy code, where the Primary Key was by convention named Id for no good reason that I could fathom, I went with the flow, feeling that there were other, more crucial hills to die on.

As I'm in my new position, and dealing with greenfield projects, I will start enforcing my own naming conventions!

x SSC-Insane Points: 23660 More actions · Answer 9

Jonathan AC Roberts wrote:

I think you mean "surrogate" key, as there is no identity column on your table. I would use a surrogate identity column as the PK so create the table like this:
CREATE TABLE [dbo].[CustomerType]
(
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [CustomerType] [nvarchar](50) NOT NULL,
    CONSTRAINT [PK_CustomerType] PRIMARY KEY CLUSTERED ([Id] ASC)
);
It would also be helpful to create a unique index or constraint on the CustomerType column.
I would also make Customer(Id) an identity column.

This I would offer a bit of cautionary advice, having a type as an identity column makes it even more arbitrary. I've run across this on a system I work on, there are two organizations that have a type table, with the type as integer identity, and if you aren't careful you can have one value mean different things depending on which server you are looking at, ie., the integer value is entirely dependent on the order its entered so if you then want to report both organizations at once, you need a translation that depends on which organization is sourcing the data, resulting in an additional complication. Whereas with mnemonic codes, its more apparent to what you are doing and serves as a reminder of sorts, ie., the code is again mnemonic in nature.

Heck, if you are sold on the space used by integers, you can oftentimes produce a decent mnemonic with 4 bytes which actually doesn't even cost anything.

Jonathan AC Roberts SSCoach Points: 18405 More actions · Answer 10

It can also make life easier for developers if all the primary keys are integers, it's a bit like having a design pattern that will make coding faster and more consistent. Also, plus 1 to Grant's renaming the column to CustomerTypeId. But if the rest of the system is written and has it as ID it is best to follow the project standards. There is sometimes nothing worse or confusing than someone having a good idea and partially implementing it in a system.

x SSC-Insane Points: 23660 More actions · Answer 11

Grant Fritchey wrote:

X,
You're not wrong. It's not the choices I would make, but the way you've laid it out, they're not bad choices.Others may not agree, but I'd say that others can be frequently disagreeable.
The only pushback I'd have is that when it comes time to obfuscate or mask the data, whether through logical deletes in support of a GDPR right to be forgotten request, or while prepping the production data for use in a non-production environment, mnemonic data might, maybe, require masking in and of itself, which will hurt.
Wow, that's a sentence. Where are my editors? Ha!

Certainly a valid objection, however oftentimes the application really does have to operate based on code tables, say a customer type is "CORP" ie., a perfectly performant 4 character mnemonic. However to your point, if we scramble the corporations name and address, we would still leak that the particular record is corporate in nature. On the other hand, this would also mean that in the OP's case of integers, we'd still have to scramble the 50 char "type" (or my preferred "type description") if we're obfuscating to that level right? And then we'd have to prevent viewing of source code that took active actions based on the value of the key it took, like if "1" meant "CORP", we'd have to prevent developers from discovering that 1 meant "CORP" but this contradicts with the desire to actually EDIT source code in development, so its hard to conclusively prevent this information leakage in any case. I do concede that if hackers got the database but not the source code then 1 would be an anonymous value unless the customer type table was not obfuscated also.

Complete 100 percent anonymization is tough even with obfuscations.

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 12

I"m going to weigh in on the side of artificial keys. I have wavered on this issue for a long time, but what ultimately convinced me was the annoying tendency of natural keys to change or become redundant. Client company names and abbreviations change with acquisitions.. Street names and addresses, product codes, license plate numbers, the list goes on and on. When they change, it instantly alters the results of any historical reports based on information stored with the old key. With identity integers, client #10006 remains client #10006, no matter what changes you make to the name or abbreviation columns. An impermanent key shouldn't be a key.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Mr. Brian Gale One Orange Chip Points: 25026 More actions · Answer 13

I am chiming in just to say I like the artificial keys when it is being used for a foreign key. As The Dixie Flatline says, client #10006 will be client #10006 no matter what other data changes.

That being said, if the ID is being created just for the sake of having an ID on a table, I would recommend excluding it. No point storing useless data. If the ID is just there because "all of the other tables have an ID column", might as well exclude it. In your example, presuming that the customer table ID column is not a foreign key for a different table, I would exclude it. If the customer type and customer name can be duplicated in the table, then it may be good to keep the ID as a uniquifier.

On a different note, it might not hurt to set up some coding standards for your team. Work with the other DBA's and come up with a good set of rules if none exist. If they do exist, come at them as the "new DBA" and question them and get their team to defend their decisions. Our SQL coding standards for the longest time had "don't use CHAR or NCHAR, always use VARCHAR or NVARCHAR" and the reasoning for it was "for performance reasons". At the time, I was a VERY new DBA so I didn't question it. Years later as I grew as a DBA, I questioned all of the coding standards and most of them were that way because the previous DBA's wrote the document and nobody questioned it. So I rewrote them and got the team to agree to them and everyone is happier overall... not 100% following them, but it takes time to break habits.

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 14

bmg0002:

Just to clarify, I am not advocating a unique int (or bigint) column simply as a row identifier. It would become the primary key which all foreign keys would refer to.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills