Lookup Table Madness

  • My personal reflection on this article:

    It is my opinion that the author may, or may not be right. It is difficult to ascertain due to the pompous arrogance that is exhibited in the author's writing style. This piece reads like a diatribe -- a political, biased preach session -- when it should read as simple, factual, and direct. In future writings, please exercise brevity; cut the personal reflections and focus on the technology. Also, please be careful to adequately represent competing ideas on issues like this one that fall into debate. Other posters have noted that, in some easily identified cases, your "MUCK tables" can be ideal. The mark of an authority is accuracy. Strive for it.

    [A pompous, arrogant writing style was used in this reply on purpose.]

  • Shakespeare seems to have stated it best,

    "The [gentleman] doth protest too much, methinks."

    --Hamlet (III, ii, 239)

    ====== 

    I have to wonder if the vehemence of the numerous personal attacks on the author might have more to do with the fact that the writers feel the sting of criticism over a practice that has little to recommend it than it does with the actual style of the original piece.

     

  • I agree, in general with the author’s premise as long as the database is accessible directly. When developers are able to access the database directly (i.e. not through a robust object model), the database structure itself becomes the true foundation of the business model and thus breaking up “lookup” tables into appropriately named tables is the correct design.

     

    However, if you *require* that *all* access to the database goes through an object model, then the structure of the database is wholly secondary to the object model. From the perspective of the developers, all that matters is the API. As far as they know, the state behind the object model might be text files, binary files, a normalized database, a de-normalized database or all of the above. In this situation, the database cannot exist without the object model to make sense of it. Thus, term “lookup” only has meaning to the object model and therefore one-“lookup”-table-to-rule-them-all-and-in-the-darkness-bind-them might very well be the best way to manage that state information. In this context, the database is primarily being used as an efficient organizer and searcher of large amounts of data and only secondarily used as an enforcer of business rules. A file system *could* be used but a database system could be used as well.

     

    So, while not a Hoyle use of a database it is nevertheless a valid one. Here’s a question for you: what are modern database systems (SQL Server, Oracle etc.) best at doing: allowing users to search, store, and update large amounts of data or maintaining “correctness” of data through constructs such as check constraints? I would argue the former as there are numerous complicated designs that database rules are wholly inadequate at handling and thus the reason we have object-oriented middle tiers in the first place.

     

    Much as the world is not black-and-white, the manner in which databases systems are used should not be confined to a narrow set of precepts developed 30 years ago when people wrote in COBOL. We should strive for normalization but should be open to deviating from those rules when it helps create solutions.

  • Congratulations to Don. Even though your writing style has taken a bit of time for readers to understand, you have created a hurricane among the readers.  I won't be surprised if this article gets the most number of replies!!

    on the topic, I have a usage:

    We have document management system.  Users can upload documents and classify them in the way they want. Classification could be based on some of the attributes of the documents (like document type, owner etc) for which we have different lookup tables.

    we have an additional functionality: Users can create their own classifications and add values to each classification. We have a theoretical maximum of classifications (99) but I am sure we will never reach this.

    We have accomplished this with "MOCK" tables:

    1. I have a table called document which will have all the standard foreign keys (OwnerID, DocTypeID etc)

    2. I have another table called ClassificationType Which has ClassificationTypeID and ClassificationTypeName

    3.  Classification Table has ClassificationTypeID (referenced to ClassificationType), ClassificationID (int) and ClassficationValue varchar(255) 

    4.  The final table is DocumentClassification which keeps the many to many relationship ( DocumentID, ClassificationID)

    To avoid "MOCK" tables either I have to go ahead with 99 Classification tables (out of which many will not be used) or I have to go with dynamic table creation, ad-hoc queries and dynamic table modifications (which will be handled during run time).

    Dear Don, Tell me, what should I do?

     

    Cheers,
    Prithiviraj Kulasingham

    http://preethiviraj.blogspot.com/

  • I mostly object to the use of "good"/"bad"/"correct"/"incorrect" in this instance without a lot of qualifiers (i.e., what are you trying to optimize for). It can be "better" to combine thousands of lookup tables into one E-A-V table; no loss of model consistency should happen. That said, I can never see a situation where it is good to combine 2 separate, independent things into one entity.

    As far as object model/db model goes, I can't see a case where they would be different. They may look different (i.e., hierarchical vs. relational), but there should be an exact mapping between the two

  • I really couldn't care less if you call me pompous and arrogant, just don't call me wrong!

    HELLO, this article IS a biased "preach session."  I'm glad you at least grasped that point.  Having been exposed to the problems caused by MUCK tables I wanted to do my part to discourage their use, and having suffered long exposure to the willfully ignorant who insist that MUCK tables are "ideal", my tone was deliberately intended to shock some who are used to the "I'm OK, You're OK" approach to data modeling articles, and to leave no doubt that MUCK tables are a poor choice with serious consequences.  Generally, the use of MUCK tables is a sure sign that the cart (application) has been placed ahead of the horse (data). 

    I suspect that part of your problem is that I did exactly what you say I didn’t do.  I presented the main arguments that are used to justify MUCK tables, and proceeded to show them to be the meaningless drivel that they are. 

    Of course, I didn't expect to make much headway with the brain-dead "databases are just a persistence mechanism" crowd, but my tone in that case was deliberately calculated to offend them.  Judging from some of the responses, it was a spectacular success in that regard.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • You contradict yourself: How can it be OK to "combine thousands of lookup tables into one E-A-V table" but then in the next sentence you say "I can never see a situation where it is good to combine 2 separate, independent things into one entity."  Those two statements are totally incompatible.

    There is a huge difference between a hierarchical model and a relational model, and there cannot exist an exact mapping between the two. 

    A hierarchical model cannot possibly represent the "has a" relationships that are possible in the relational model without a large amount of redundancy.  On the other hand, the relational model is able to accurately represent any hierarchies that actually do exist.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • Sorry I was not more clear. I can't see where it is ever beneficial to combine, say, employee type and employee code if they are truly different things. The fact that they might have the same code-decode pairs should not lead you to use employee type for employee code (or vice versa).

    How you store these different entities depends on your app. In a high volume, simply structured transactional app, it would make sense to store these separate code lists as separate tables named, respectively EMPLOYEE_TYPE and EMPLOYEE_CODE. In a clinical trial system, with thousands of code lists and low transactional volumes, it makes sense to store them in a generic lookup table as

    'EMPLOYEE_TYPE','CODE1','DECODE1'

    'EMPLOYEE_TYPE','CODE2','DECODE2'

    'EMPLOYEE_TYPE','CODE3','DECODE3'

    'EMPLOYEE_CODE','CODE1','DECODE1'

    'EMPLOYEE_CODE','CODE2','DECODE2'

    'EMPLOYEE_CODE','CODE3','DECODE3'

    That was my point. Separate entities are separate entities. How you store them depends on the app.

  • "How you store them depends on the app" sums up the problem quite nicely.  Applications come and go, data persists.  If you design your database too much around specific application functionality, you are doing your employer, your co-workers, and yourself a huge disservice.

    Data Independence is at least partly defined as the idea that database design is independent of any application that might make use of the data.  This is the heart of the problem of MUCKs. 

    Most of the problems that I have seen with system interoperability come right back to the lack of data independence.  When the database becomes a black box that can only be deciphered by the application, then making use of that data in another application for another purpose becomes a nightmare.  Upgrades to the application become a pain that will frequently involve at least a partial, if not wholesale reworking of the database.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • And who has the time to design, code and test 300 different User Interfaces under the Setup portion of the Application?  We're talking about values that seldom if ever get changed here folks.

    I've done it both ways.  I like the theoretical idea of each table being separate, but in the real world (at least our real world), the DBA and the folks doing the program that will end up using the data are likely both me.  I don't want an endless series of CASE statements in one form (in the EXE) to determine what table to access to get the 3 or 7 values for this particular "Keyed List table".

    When someone makes up an acronym like "MUCK" to talk about a "Keyed List" or "Lookup Table", you know you're never going to convince them anyway.

     


    Student of SQL and Golf, Master of Neither

  • > How you store them depends on the app" sums up

    > the problem quite nicely.  Applications come and go, data persists.

    See, I feel that this statement is too simplistic. If by "application" you mean simply the front end GUI's then I'd agree. However, if we are talking about a robust middle-tier then that may not be the case. If developers are exposed to nothing but an API, then the how the data is stored should be irrelevant to them. Further, it may very well be the case because of high-level abstrations, that it is the middle-tier that gives the data meaning. For example, let's look at a common scenario: data encryption. If encryption is done at the middle-tier, then validation and meaning of the data *cannot* be done in the database (e.g. check constraints) and *must* be done in a middle-tier component. To say that "the data must be independent" is to ignore how the data is used. Especially in a service-oriented structure, it is the *services* (or at least sets of services) that should strive for independence not the data per se. How each service maintains and stores its data should be up to that service.

    Most of the problems I have seen with system interoperability are from generally bad conceptual designs not data independence per se. In fact, I have seen problems with data independence when too much independence is given such that the interpretation of a given abstraction was different between two different systems but, according to the DBA, it was the same.

    The database is just another service or cog in the wheel of the greater set of services within an enterprise. If people are connecting to the database directly, then certainly it needs a high degree of independence. However, if people are only connecting through services which, from their perspective may or may not have databases behind them, then the database is subserviant to the public service in function and design.

  • "Data Independence is at least partly defined as the idea that database design is independent of any application that might make use of the data. This is the heart of the problem of MUCKs."

    I guess that this is where I respectfully disagree. For my example, how in the world could you ever make clinical trial data independent of the clinical trial domain (as opposed to transactional application). I think that "data independence" is a chimera, similar to the "One programming language to rule them all" chimera (see Java, Ada, any others?). Data is not data is not data. I see it crumbling whenever you hit the real world of infinite complexity.

    Do you have any idea of the people time it would take to administer literally thousands of 3 column, 3, 4, 5, 10 row tables? In the clinical trial world, if a query takes 5 or 10 minutes to run, who cares? It's not a credit card approval system.

  • It depends on the information you are trying to store.  If the clinical trial data has any meaning at all, it has to be defined, and if it is defined data, then it must be modled if it is to have any useful persistence.

    If all you have is a generic property bag collection, then store it there, in unstructured format, in whatever technologic choice, flat-file format you choose. But, if you choose a "Relational" DBMS, the only advantage to this particular platform is the Relational part of that technology.

    There are a myriad of other data storage mechanisms out there, many much faster, but none of the equivalent to what an RDBMS can give you, but ONLY give you if you use the foundation upon which it was developed.

    If you throw that away, out of convenience or neglect (laziness?), then you have an overstuffed, overcomplicated system for no useful gain.  If that is your tact, then choose a more appropriate tool.

    Sincerely,

    Anthony Thomas

     

  • "Further, it may very well be the case because of high-level abstrations, that it is the middle-tier that gives the data meaning."

    It is true that the middle-tier contains the Business Rules, but as so often the case, this is typically interpreted by many to imply ALL the rules, and incorrectly.

    The database contains the relationship rules, the integrity rules, the data constraints.  The sole purpose of a Relational System is to DEFINE the data relations and the data values those relations should be CONSTRAINED to.  Those definitions should be modeled after "real world" items that exist, with certain known and knowable characteristics.  These are DISCRIPTIVE, DECLARATIVE statements of fact.

    The middle-tier is to the application process what a DBMS is to the data storage: an attempt to abstract the process out of the front-end API, to alleviate the API from the calculated process characteristics.  DO NOT CONFUSE THIS WITH DATA INDEPENDENCE.  It is similar, and is actually modeled from the success of the DBMS' ability to abstract the data, that it was built from.

    But, this DOES NOT mean they are the same thing.  The DBMS could care less if when an attribute is multiplied by two, another must be decremented, it only cares whether or not the are both numbers, each constrained within valid, modeled, acceptable ranges.

    The database is but a description of the world.  A particular process, which MUST SHARE this data, is concerned with how that data is manipulated to construct new data.  It is the path between.

    In fact, when many attempt to embed the complicated PROCESS rules within the DBMS, most of it must be constructed from procedural type languages, strong code within stored procedures or triggers, which were NEVER apart of the Relational Model.

    Do not confuse the two.  By equating them, your arguments are as muddled as the MUCK tables this discussion is attempting to refute.

    Sincerely,

    Anthony Thomas

     

  •  

    > The database contains the relationship rules, the integrity

    > rules, the data constraints. The sole purpose of a

    > Relational System is to DEFINE the data relations and the

    > data values those relations should be CONSTRAINED to.

     

    The database may *also* contain rules regarding relationships, integrity rules and data constraints. However, those relationships are merely one aspect of the system as a whole. It is very possible that there are relationships, integrity rules and data constraints that the database simple cannot handle. I am not suggesting that using a database for things like MUCK tables is part of the Hoyle relational system. I am saying that there are times when the relational system impedes the ability to solve problems and in those circumstances, under careful and deliberate analysis it is fine to deviate from it.

     

    > Those

    > definitions should be modeled after "real world" items that

    > exist, with certain known and knowable characteristics.

    > These are DISCRIPTIVE, DECLARATIVE statements of fact.

     

    I can think of many elements I have modeled which have no "real world" item outside the context of the system or situations where a given item is very loosely defined. Take the example of a customer. What is a customer? It might be a person. It might be a business. It might be another division or department in the same company. Its "real world" definition is entirely dependent on the modeler and the system in which it plays a part. Without the context of that system, it has no meaning. The problem is even worse if there are numerous types of customers which have the same data elements but different requirements for different systems. System A requires customers to have a first name, last name and address. System B, using the same database as System A, only requires a first name. System C, also using the same database as A and B only requires a last name. The relational model as it applies to maintenance of this type of data begins to break down somewhat.

     

    RE: Data independence…

     

    Perhaps the problem is that we differ in our meaning of “data independence.” To me, “data” is wholly dependent on the context in which it is used and thus dependent on the system to which it belongs and may be something that is entirely intangible to the “real world”.

     

    > But, this DOES NOT mean they are the same thing. The DBMS

    > could care less if when an attribute is multiplied by two,

    > another must be decremented, it only cares whether or not

    > the are both numbers, each constrained within valid,

    > modeled, acceptable ranges.

     

    Now we are splitting hairs as to what it means for the DBMS to “care.” For example, you mentioned that “acceptable” values are something for which the DBMS might care. Yet, this is a loaded word. Suppose for a given value, “acceptable” involves numerous other data elements in the system and the meaning of “acceptable” changes instance-by-instance in a given table based on complicated relationships. DBMSs are terrible at handling these types of complicated rules as they can only handle them through stored proc code or through really complicated triggers. Furthermore, the meaning of “acceptable” might change from system to system using the same DBMS. While DBMSs are designed to preserve some degree of data integrity, they are far from a complete solution. We could quibble all night about this “data integrity” means different things to different developers.

     

    > The database is but a description of the world. A

    > particular process, which MUST SHARE this data, is

    > concerned with how that data is manipulated to construct

    > new data. It is the path between.

     

    I’m not sure what you mean by “the world”. A database stores information applicable to a particular system. Without knowledge of the system, the data is useless. Suppose you see the number 5551234567. Is that phone number? A security identifier? A storage box number? A transaction number? That number *only* has meaning in the context of the particular system that generated it. Likewise, in an enterprise system, numerous abstractions may exist which have no tangible aspect and only have meaning within the given enterprise.

     

    > Do not confuse the two. By equating them, your arguments

    > are as muddled as the MUCK tables this discussion is

    > attempting to refute.

     

    My argument is actually quite simple: use the appropriate technique and available tools to solve the problem at hand. If the database is entirely subservient to a model or service which itself contains abstractions, then the developer has more leeway in deviating from the Hoyle rules of the relational model. If other systems are going to be allowed to interact with the database directly, then interoperability is clearly more of a concern and thus you might be best served by sticking closer to the relational model.

Viewing 15 posts - 136 through 150 (of 202 total)

You must be logged in to reply to this topic. Login to reply