SQLServerCentral Article

Lookup Table Madness


Lookup Table Madness

Over the years I have had a number of "discussions" about the practice of creating generalized "lookup" tables for storing multiple code types in a single table.I won't keep you in suspense about my view of this practice; the idea stems from a failure to learn and appreciate the fundamentals of good database design. Accordingly, I developed an appropriate name for these tables: Massively Unified Code-Key tables, or MUCK tables.

Let's take a look at the reasons why some are tempted to use this approach, but more importantly I hope to explain why you shouldn't. Many "lookup" tables tend to look something like these examples:

CREATE TABLE EmployeeStatusCode (
  EmployeeStatusCode int IDENTITY(1,1)
  , Description varchar(50))
CREATE TABLE CustomerStatusCode (
  CustomerStatusCode int IDENTITY(1,1)
  , Description varchar(50))
CREATE TABLE FacilityStatusCode (
  FacilityStatusCode int IDENTITY(1,1)
  , Description varchar(50))

*For the record, I don't like the term "Lookup" tables, nor any of its cousins (Code Table, Domain Table, Reference Table, etc...) the term is far too ambiguous to be useful. Exactly how do you define them? Is it just any table that is referenced by a foreign key? Obviously not. How about any table that is referenced by a FK and doesn't reference any other table? Nope... Even though I use the term in this article, I want to point out that these tables are fundamentally no different than any other tables in your database and should therefore receive no special treatment.*

** While I'm clarifying things..., I am not advocating the across-the-board use of IDENTITY columns (or of surrogate keys in general), that subject would require much more detail than I am prepared to go into in this article. I use the example only because it is so common.**

Since the Employee, Customer, and Facility status codes look the same, it is argued that when viewed from a "certain perspective" they are of the same type and should therefore be in the same table. This seems sensible enough until you realize that sharing a set of common attributes is only one of the criterion that should be used to determine such things. Unfortunately that "certain perspective" in which your code tables can appropriately be combined also makes them ineffective for use in their original purpose; ensuring data integrity. Even a cursory examination should tell us that EmployeeStatusCode is not the SAME thing as CustomerStatusCode no matter how SIMILAR they are in appearance.We would never want a Customer status to be "Fired", or an Employee status to be "Vacant" regardless of how appropriate that description might be for some ;-)There are several alternatives available to prevent this from happening, but as I will show, they each have major problems.

Basically, during the process of normalization (and no, normalization is not a dirty word) these codes are removed from the Employee, Customer, and Facility tables and appropriately placed in their own tables. Then, someone notices the similarity among the code tables and decides that the database would be simpler if they were combined into a single table. Normalization eliminates redundant data, thus making the task of enforcing data integrity vastly simpler, but the process of creating a MUCK is something else entirely.MUCK's do not eliminate redundant data, rather they are an elimination of what are PERCEIVED to be redundant tables, but as I will demonstrate, fewer tables does not equal simplicity.

You may be saying: "This guy doesn't know what he's talking about! MUCK's do help eliminate redundant data in those cases where an EmployeeStatusCode shares the same description as a CustomerStatusCode and so on." Nice try, but this just further serves to illustrate why you shouldn't treat these codes as the same type of thing. It shouldn't take too much contemplation to realize that this logic quickly leads us right back into the ugly world of update anomalies and other non-relational ugliness. I.e. What if the business decision is made to change that particular description but only for Employees? Can it be done? Sure, but what a pain.

As mentioned above, once you have a MUCK, the problem that presents itself is: just how do you constrain your data to ensure that codes of type X are only applied to X? The first option that many people think of is triggers. Add a CodeType column to the MUCK and write your triggers appropriately. Simple!,...but before you settle on this approach, read on. If you don't want to use triggers to enforce integrity you could try including a CodeType column in the Employee table with a column constraint that prevents any status codes that are not of the employee type.

Let's look at an example:

  CodeID int IDENTITY(1,1)
  , CodeType int
  , Description varchar(255))

Obviously, we would have a CodeType table...but then again, why not just shove that into our Code table too? After all, it looks the same as all our other codes! There's room for everyone aboard this recursive roller-coaster ride!

Our Employee table would look something like this:

 EmployeeID int IDENTITY(1,1)
 , FirstName varchar(50)
 , LastName varchar(50)
 , CodeType int
 , CodeID int
 , etc... )
 REFERENCES Code (CodeID, CodeType)

It should work, right? Yes, but you have to be willing to ignore the elephant in the living room, our employee table is not properly normalized because CodeID is partially dependent on CodeType which is not part of the key. (and no, you can't just make CodeType part of the key because FirstName, LastName etc... would not be dependent on the whole key).Then think of what a mess this becomes if an employee can have more than one Code/CodeType combination. Further, what if one of those multiple Code/CodeType pairs is dependent on another one? Before you go running back to the "simplicity" of the trigger option you should realize that it has exactly the same problem, but it's hidden and likely hard-coded in the logic of the trigger. In fact it could be argued that trigger option is worse because CodeID has a functional dependency on a column that isn't even there!The picture just gets uglier if you consider the performance overhead of triggers vs. DRI.

The reasoning that leads to MUCK tables, if taken to its logical conclusion would have every entity in the database reduced or "generalized" down to a "Thing." The Thing table would have a key (it would naturally be an IDENTITY column) and a few other generalized attributes like Name, Type, and Description etc...:

  PKey bigint IDENTITY(1,1) --We better use a bigint since there will be a lot of rows in this baby...
  , ThingType int
  , Attribute1 varchar(8000) 
     -- We can't provide any domain integrity so we have to go with what I call the "Garbage Heap" approach.
  , Attribute1Type int
  , Attribute2 varchar(8000)
  , Attribute2Type int
 , etc...)

Congratulations, we have now worked our way right back to managing data in a spreadsheet, albeit one with a handy query language. Oh, wait...better still would be to just use an IDENTITY column and an XML string, that way we don't need to define more than a single table with two columns in the database, and we just won't bother with the whole data integrity thing; "isn't that stuff best handled by the application anyway?" Now that's what I call a flexible database design! (OK, I apologize for the sarcasm, sort of...but those who were thinking "Yeah!" might want to cut your losses and stop reading now.) And, yes I have personally dealt with people who thought that those options were not only viable, but superior to a normalized database.

Some might say that these examples are far-fetched and that no reasonable person would go that far. But that is exactly the point, how do you know when you have gone too far? When you abandon the fundamental principles upon which good database design is predicated, what principles are you using to guide your efforts? The difference between a MUCK and the "Thing" table is one of degree, not of kind, and they are both wrong. Now, before you say "there is no 'right' way to design a database", it is true that given a set of fairly complex requirements, two competent individuals might arrive at superficially different database designs, but rest assured; the variety of incorrect designs that the ignorant will pull out of their hats is virtually limitless!

Generalization is a good thing, in the right context, but it is a tool and should not be used indiscriminately. Generalization is something that can be very useful in the Object Oriented (OO) programming world where you are concerned with what things do, or how they act. If two things act the same way they can be brought together under a single class. Even if they differ significantly, they might be able to inherit their common characteristics from a parent class, relying on subclasses to provide their unique behaviors. In the world of relational database design we really don't care so much about how things act, we care about what they are. In that context, generalization is very seldom useful. In the world of OO programming generalization helps improve code reuse, modularity, and maintainability. By contrast, generalization in a database leads to ambiguity, or loss of meaning. I suspect that this practice got its start with those OO programmers who mistakenly think that a table is analogous to a Class when in reality a table is a variable. Astute readers will recognize this mistake as what C.J. Date called "The First Great Blunder" but that is yet another subject for another day...

Be careful that you don't generalize your database into irrelevance, building in so much flexibility that your system degenerates into chaos (chaos being the ultimate in flexibility). Remember that the primary reason to use a database is not to "persist data"; that can be done more efficiently with file-based systems. The purpose of a relational database is to enforce the rules that govern how data is to be created, maintained and used; in other words the database enforces the rules that give the data meaning.Without those rules your data becomes a meaningless tangle of 1's and 0's. In database design, the first and foremost consideration should always be logical correctness; all other concerns are secondary.After all, just what is a database? Hugh Darwen provided the most useful definition of a database that I have come across:

"A database is a set of axioms. The response to a query is a theorem. The process of deriving the theorem from the axioms is a proof. The proof is made by manipulating symbols according to agreed mathematical rules. The proof (that is, the query result) is as sound and consistent as the rule are."

In other words you need only concern yourself with the logical correctness of your database design if you want correct answers to your database queries. If you just want answers you don't have to bother with all this "theoretical" stuff.

By Don Peterson Wednesday, July 28, 2004


4.48 (27)

You rated this post out of 5. Change rating




4.48 (27)

You rated this post out of 5. Change rating