Home Forums Database Design Relational Theory Is having a lookup table that is for multiple entities a common practice? RE: Is having a lookup table that is for multiple entities a common practice?

  • jcelko212 32090 - Wednesday, February 15, 2017 8:34 AM

    Doctor Who 2 - Wednesday, February 8, 2017 9:57 PM

    Let me cut & paste a chapter from my book, THINKING IN SETS 😀

    04.04. OTLT or MUCK Table Problems

    I think that Paul Keister was the first person to coin the phrase "OTLT" (One True Look-up Table) for a common SQL programming technique that is popular with Newbies. Don Peterson (www.SQLServerCentral.com) gave the same technique the name "Massively Unified Code-Key" or MUCK tables in one of his articles.

    The technique crops up time and time again, but I will give Paul Keister credit as the first writer to give it a name. Simply put, the idea is to have one table to do all of the code look-ups in the schema. It usually looks like this:

    CREATE TABLE Look-ups
    (code_type CHAR(10) NOT NULL,
    code_value VARCHAR(255) NOT NULL, -- notice size!
    code_description VARCHAR(255) NOT NULL, -- notice size!
    PRIMARY KEY (code_value, code_type));

    So if we have Dewey Decimal Classification (library codes), ICD (International Classification of Diseases), and two-letter ISO-3166 Country Codes in the schema, we have them all in one, honking big table.

    Let us start with the problems in the DDL and then look at the awful queries you have to write (or hide in VIEWs). So we need to go back to the original DDL and add a CHECK() constraint on the "code_type" column. Otherwise, we might "invent" a new encoding system by typographical error.

    Notice that we are already in trouble because a data element cannot be both a "<something>_code" and a "<something>_type"; it must be one or the other and it must be the code or type of some specific attribute. One of the nice features of the ISO-11179 rules is that they prevent this mixing of data and meta-data in a way that can be checked mechanically. Ignore this comment and continue heading for the edge of the cliff.

    The Dewey Decimal and ICD codes are digits and have the same format -- three digits, a decimal point and more digits (usually three); the ISO-3166 Country Codes are alphabetic. Oops, we now need another CHECK constraint that will look at the "code_type" and make sure that the string is in the right format. Now the table looks something like this, if anyone attempted to do it right, which is not usually the case:

    CREATE TABLE OTLT
    (code_type CHAR(10) NOT NULL
       CHECK("code_type" IN ('DDC','ICD','ISO3166', ..),
    code_value VARCHAR(255) NOT NULL,
      CHECK
      (CASE code_type
      WHEN 'DDC'
        AND code_value
         SIMILAR TO '[0-9][0-9][0-9].[0-9][0-9][0-9]'
      THEN 1
      WHEN 'ICD'
        AND code_value
         SIMILAR TO '[0-9][0-9][0-9].[0-9][0-9][0-9]'
      THEN 1
      WHEN 'ISO3166'
        AND code_value SIMILAR TO '[A-Z][A-Z]'
      THEN 1 ELSE 0 END = 1),
    code_description VARCHAR(255) NOT NULL,
    PRIMARY KEY (code_value, code_type));

    The "SIMILAR TO" predicate is the SQL-92 version of a regular expression parser based on the POSIX Standards, if you are not familiar with it. Since the typical application database can have dozens and dozens of codes in it, just keep extending this pattern for as long as required. Not very pretty is it? In fact, there is a good chance that you might exceed the number of WHEN clauses allowed in a CASE expression in a major corporation. That is why most OTLT programmers do not bother with this absolutely vital constraint.

    Now let us consider adding new rows to the OTLT.

    INSERT INTO OTLT (code_type, code_value, code_description)
    VALUES
    ('ICD', 259.0, 'Inadequate Genitalia after Puberty');

    and also

    INSERT INTO OTLT (code_type, code_value, code_description)
    VALUES
    ('DDC', 259.0, 'Christian Pastoral Practices & Religious Orders');

    If you make an error in the "code_type" during insert, update or delete, you have screwed up a totally unrelated value. If you make an error in the "code_type" during a query, the results could be interesting.

    This can be really hard to find when one of the similarly structured schemes had unused codes in it.

    The next thing you notice about this table is that the columns are pretty wide VARCHAR(n), or even worse, that they are NVARCHAR(n) which can store characters from a strange language. The value of (n) is most often the largest one allowed in that particular SQL product.

    Since you have no idea what is going to be shoved into the table, there is no way to predict and design with a safe, reasonable maximum size. The size constraint has to be put into the WHEN clause of that second CHECK() constraint between "code_type" and "code_value". Or you can live with fixed length codes that are longer (or fatally shorter) than what they should be.

    These large sizes tend to invite bad data. You give someone a VARCHAR(n) column, and you eventually get a string with a lot of white space and a small odd character sitting at the end of it. You give someone an NVARCHAR(255) column and eventually it will get a Buddhist sutra in Chinese Unicode.

    Now let’s consider the problems with actually using the OTLT in a query. It is always necessary to add the "code_type" as well as the value which you are trying to look-up.

    SELECT P1.ssn, P1.lastname, .., L1.code_description
    FROM OTLT AS L1, Personnel AS P1
    WHERE L1."code_type" = 'ICD'
     AND L1.code_value = P1.disease_code
     AND ..;

    In this sample query, you need to know the "code_type" of the Personnel table disease_code column and of every other encoded column in the table. If you got a "code_type" wrong, you can still get a result.

    You also need to allow for some overhead for data type conversions. It might be more natural to use numeric values instead of VARCHAR(n) for some encodings to ensure a proper sorting order. Padding a string of digits with leading zeros adds overhead and can be risky if programmers do not agree on how many zeros to use.

    When you execute a query, the SQL engine has to pull in the entire look-up table, even if it only uses a few codes. If one code is at the start of the physical storage, and another is at the end of physical storage, I can do a lot of caching and paging. When I update the OTLT table, I have to lock out everyone until I am finished. It is like having to carry an encyclopedia set with you when all you needed was a magazine article.

    Now consider the overhead with a two-part FOREIGN KEY in a table:

    CREATE TABLE EmployeeAbsences
    (..
    "code_type" CHAR(3) -- min length needed
     DEFAULT 'ICD' NOT NULL
     CHECK ("code_type" = 'ICD'),

    code_value CHAR(7) NOT NULL, -- min length needed
    FOREIGN KEY ("code_type", code_value)
     REFERENCES OTLT ("code_type", code_value),
    ..);

    Now I have to convert the character types for more overhead. Even worse, ICD has a natural DEFAULT value (000.000 means "undiagnosed"), while Dewey Decimal does not. Older encoding schemes often used all 9’s for "miscellaneous" so they would sort to the end of the reports in COBOL programs. Just as there is no Magical Universal "id", there is no Magical Universal DEFAULT value. I just lost one of the most important features of SQL!

    I am going to venture a guess that this idea came from OO programmers who think of it as some kind of polymorphism done in SQL. They say to themselves that a table is a class, which it is not, and therefore it ought to have polymorphic behaviors, which it does not.

    Please post DDL and follow ANSI/ISO standards when asking for help.