Lookup Table Madness

  • May as well flog this horse skeleton into dust....

    "Good" & "Correct" in terms of data integrity within a relational database system are absolutely not in the eyes of the beholder. It's a Yoda moment when you get down to data integrity, do or do not.

    That's one of the main reasons that MUCK tables can, do, and will, fail so miserably. While they're "flexible" in terms of development, they will make the data in the system completely meaningless in very short order. That's experience from two stand points speaking. First, I build my own MUCK system about seven years ago. It was so cool when it went out the door. Then the maintenance headaches started. Second, I've inherited a MUCK system where I now work that is almost completely disabled. It's so flexible, it can't be touched for fear it will break the code. The data that was supposed to be infinitely flexible has become a maintenance nightmare. The problem is, when anyone can put anything they want into the system in any way that they term "right" you can't maintain data integrity. So again, it's do or do not, there is no MUCK.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Oh, I absolutely agree, the data model comes first. It is the heart of any system. The "application model" comes on top of that. No arguments there.

    The reason why I couldn't show you the data models of the clinical trials is the sheer size of the metadata. 250 * 35, say, distinct table structures--do you want to go through that?

    Within each individual trial, the design is generally fine. It comes when you want to create reports, etc. that cut across trial lines. Maybe I wasn't clear on that point. The maintenance nightmare, along with incredibly onerous and stupid validation requirements, really calls for those otherwise despised MUCK tables.

    I bring up the clinical trial environment because it is an outlier. It's like asking which way a clock turns--it depends on your perspective!

  • It sounds like you've got a pretty good handle on the operatioanl components, but the drudgery of the reporting is what you are finding the MUCK solution for.

    I'm not sure how long you retain this information (I suspect for some time), but what you are describing certainly sounds like a warehouse scenario.

    Do you do distributed queries against all of your operational databases, or do you go through some sort of ETL to a centralized, long-term repository?

    Even in this environment, I have found that the RM is a better methodology.  Part of the headache for a warehouse, however, is the "T" in ETL: trying to define the common data elements.

    Also, you never report off of your warehouse; you create Data Marts to tailor the analysis.

    Sincerely,

    Anthony Thomas

     

  • "I'm not sure how long you retain this information (I suspect for some time), but what you are describing certainly sounds like a warehouse scenario."

    You need to retain clinical data for 20 years, by law. 

    It was indeed for a clinical data warehouse, I believe the very first one ever attempted.

    "Do you do distributed queries against all of your operational databases, or do you go through some sort of ETL to a centralized, long-term repository?"

    Me and my buddy wrote a dynamic ETL tool to move the data to a centralized repository. We read the metadata repository and a metameta configuration table to create gigantic SELECT statements, which we saved as an Oracle view.

    The SELECT statement: the ultimate ETL tool.

     

  • Well, that could certainly spawn off into a whole other argument: using relational or dimensional warehouse structures.

    E-gad.

     

    Anthony Thomas

     

  • "That said, "good", "correct" is in the eye of the beholder. If you read back a ways in this thread, you'd read where yes, development time is much more important optimization than sheer speed in some application spaces. In the clinical trial world, if a routine takes 10 or 20 or 30 minutes, it usually doesn't matter a bit. However, if the developer wastes a day or two optimizing those 20 minutes away, that's not good."
     
    E-gad is right.  You know, the scary thing is that we're talking about software that could concievably end up KILLING PEOPLE, and it's more important that development happens quickly instead of accurately?  'Cause referential integrity is what we're really discussing, not performance.  No personal attacks meant here, but you're freakin' me out, man.
     

    "OK, lemme ask you. There are 250 clinical trials, each heterogensously structured, each with about 30-40 different table structures. The analysis software is run maybe 3 or 4 times in production. The analysis can easily be run in the background as the user does other things."

    Yeah, this really isn't enough information, but as far as I'm concerned a database can have thousands of tables or ten, but it's a pile of crap if referential integrity is not maintained.  If you do want to pursue this design with the community, maybe another thread is in order?

     
     
     
     

    Signature is NULL

  • Once, we had a round-table database design question with the Senior IT leadership, based on this exact topic, data integrity and RI versus the development effort surrounding a MUCK design—yes, the "discussion" escalated that high.

    In that meeting, the Enterprise Architect actually stated, "Speed of delivery is more important than accuracy."  In the same room, at the same table as the Directors of Data Center Operations and Information Services.

    At that point, the discussion ended—what was the point of continuing a "rational" discussion in the face of such obsured stupidity.

    Later, he recanted, of course.  But, this goes to show you just how far the "prevailing" wisdom travels.

    Sincerely,

    Anthony Thomas

     

  • "E-gad is right.  You know, the scary thing is that we're talking about software that could concievably end up KILLING PEOPLE, and it's more important that development happens quickly instead of accurately?  'Cause referential integrity is what we're really discussing, not performance.  No personal attacks meant here, but you're freakin' me out, man."

    Let me be clear about something. Within each clinical trial, referential integrity is extremely important. Most of the code written to support a clinical trial that is not statistical analysis is devoted to data validation. Lots of referential integrity is built into the databases created, then lots more batch applications on top of that. Lots and lots and lots of resources were spent on referential integrity within a single trial. I'm sorry if I conveyed another impression.

    The problem I was dealing with was when the users wanted to run reports over several different trials. Every researcher wanted to think outside the box and they all used different structures to hold the same exact data. Different table and column names, datatypes, and sizes, different code lists, etc. Bringing that all together to run cross trial, let alone cross compound reports, was a monumental task.

  • "Within each clinical trial, referential integrity is extremely important."

    OK, good; it wasn't clear from your previous posts. If you go back over the forum posts for this topic, you'll see the entire discussion was around referential integrity. The assumption was you were continuting this discussion.

    I guess that's incorrect, though; what you are talking about is Data Warehousing. It has it's own set of challenges that have nothing to do with referential integrity, and really don't belong in this discussion about a MUCK table.

    Standardizing dimensional data from multiple sources can produce structures SIMILAR to MUCK, but they are a whole 'nother ball game related to OLAP instead of OLTP. IE: you're OLAP schema may be completely different than your OLTP schema, as their function is completely different.

    But you probably already know this...

    cl

    Signature is NULL

  • I expect articles to be somewhat subjective and enjoyed the author's opinions. I prefer not to mix lookup codes strictly from the FK's perspective.

    The author mentioned a dislike of the term "lookup" and other similar terms. I am curious what the author's opinion and other reader's opinions are of the technically correct term, as well as examples of good "lookup" table names.

    For example, if I have a Cars table and the ColorCode column looks up to another table with the code and description, what would you name the table? I typically go with the much disparaged table qualifier, e.g. ddColor, where dd means "DataDictionary" so that all my lookup tables are grouped together.

    I always experience inner torment when coming up with good table names and was just wondering what others do ...

  • Allow me to say right up front, I don't like this standard. I inherited it. I fought against it. I lost.

    We name our simple lookup tables 'C_WhatEver' and the complicated ones 'R_Whatever.' 'C' stands for Code. 'R' stands for Rule. A simple lookup table, or code table, is defined as an ID, Code, Description set. No other columns such as relationships to other code tables, etc. Rules tables are just code tables that have all the other stuff in them, self-referential constraints, hierarchy relationships with other tables, etc.

    Stupid? Yes. A pain the butt? Yes. Not my choice? Absolutely.

    Hey we also have to use abbreviations from a set list including, my personal favoriate: Deductible = ddltbl. How do we shorten a word and a letter that isn't there or implied? I don't know, but that's how I'm expected to type deductible in table & column names.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I have yet to hear a truly useful definition for Domain\Lookup\DataDictionary\etc... tables.  Why is it deemed necessary to treat them differently than any other tables?  And by extension, why is it deemed necessary to name them any differently?

    I would call that table PaintColor or maybe CarColor, or even just Color just depending on what best conveys the meaning of the data that is actually held there.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • The only answer I have for you is the one that I'm dealing with at this company. They want the "code" tables to be the same across all applications and all environments. So we have 10 or 15 different applications, not counting the EWH & Datamarts, in at least four environments, Dev, QA, Staging & Production(not counting Performance Test or Training environments of the various others), and these tables, unlike all the other tables in the system, are the same everywhere. So any place you go, CoverageId = 52 gets you the same coverage. You may get a different value for Policy on any given system because the data was generated by test plans or by actual users or by developers, but those "code" tables are managed across the enterprise.

    I could still do that calling it Coverage rather than C_Coverage, but I was told I had to designate it that way. There was a fight. I lost.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • It makes perfect sense to have a single set of data available across the enterprise and things like Color, or EmployeeStatus etc... are very likely candidates for this, but in many cases, highly transactional data like PurchaseOrder would be managed on an enterprise wide manner too. 

    In Data Warehousing circles they like to call these tables Dimensions.  And again, I don't like the idea of naming tables like dim_Customer and f_Sales for dimension and fact tables respectively.

    It all depends on the business and the design of the system(s) involved.  But I haven't been able to define, nor have I ever heard anyone else define, a useful generalized definition for "lookup tables". 

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • I have experimented with distinguishing 'code'/lookup tables from other tables in two different ways: 1) adding text at the beginning of the table name and 2) adding text at the end. The system I liked best was to putting the designation at the end. I inherited a system where lookup tables were 'type' tables and had the text 'Typ' appended to the end. In the example above, it would be CarColorTyp. I've continued that system, though I might have used some other text like 'dd' if I were starting from scratch now.

    The nice thing about a suffix as opposed to prefix text is that related tables tend to be next to each other. For example, if I have tables like Car and CarColorTyp and CarEngineTyp, etc, all my 'car' tables would be next to each other in a list. I don't need all my type tables grouped together.

    But this begs the question of why make the distinction at all. I could just have Car, CarColor and CarEngine as table names. Here's why I add the suffix: I find the suffix text useful from a human perspective when trying to get the big picture of understanding a database. A distinguished table name very quickly helps me to identify what is going on with that table. I quickly get the function at a glance. It doesn't bother me in the least that there is a subjective/human element to deciding when a table is a 'type'/code/lookup table and when it is a 'main' table. Having the distinction is helpful to me.

    Since all my 'typ' tables (with additions when needed) have the exact same columns: code, name, abbreviation and sort order (ex for a PhoneTyp table: phnCd, phnNm, phnAbbr, phnSortOrd), one could easily argue that the 'Typ' suffix is simply not needed. I could look at the structure of the table to understand it. But again, from a human perspective, I get a much quicker understanding of the table by being able to deduce it's purpose from just it's name. I'm looking at a list of table names more often than I'm looking at a data model.

    I will say one nice thing about having 'Typ' as the suffix/text rather than something like 'dd' for data dictionary or 'lu' for lookup. The 'typ' text often seems to create meaningful names in an of itself. Take my Phone table example. I can have a Phone table and a PhoneTyp table. I instantly know two things about the PhoneTyp table without having to know anything at all about it's columns (I won't dare call them fields after reading some other comments in other articles ). I know that PhoneTyp is a table that gives me phone types. I don't have to name it say 'PhoneTypLU'. And I know that this table is nothing more than a simple list of phone types. When I need that simple list, I know right where to go. When I need to get to main data, I know I can skip that table.

    I think of another reason to have a suffix or prefix, but I can't think of a great specific example right now. Here's the idea anyway: if the lookup tables all have the same text, then it would be much easier to make mass changes to those tables using code. Here is an example of what I am trying to say. Suppose you have a WHOLE bunch of code/type tables configured as I explained above with the same basic 4 columns (with some additional columns when needed). Then your business rules change and you need to add a new column to all your code tables. For example, maybe you need to add a 'Stop Date' column. If all the tables have a unique beginning or ending text, then you can quickly write some code to loop through the tables and add the new column to just the tables you care about. It could save you a lot of work. Admittedly I don't know how often this type of need would arise. I haven't needed it yet--especially since I use ERwin to create my databases. So, changing something like field size for all the 'abbreviation' columns would be easy.

Viewing 15 posts - 181 through 195 (of 202 total)

You must be logged in to reply to this topic. Login to reply