Store Key - Storing Misc Data

  • Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/nraghavendra/storekeystoringmiscdata.asp

  • I don't agree with the content of this article at all. It's a cute idea, and will probably reduce the size of your database schema -- at the expense of performance and data maintainability. The database will be forced to perform needless scans of the "store key" table every time any type of operation is done. The author hasn't discussed what happens when the table grows to millions or billions of rows. How will it be maintained? Will you delete items from it as they age? How will items in the global store key table be searched? What happens when you're storing redundant names, redundant addresses, redundant names of departments, redundant product codes, and twenty other redundant types in the table and suddenly your new business requirement is to write a stored procedure to search for customers by address? Do you want your database scanning all of those extra rows every time? Or needlessly joining the entire customers table to the "store key" table first to reduce the text scanning workload a bit? And what happens when you have two very large entity types stored in the table and you need to boost performance of searching them both at the same time? Good luck partitioning the data onto different physical discs when you have no identifier of entity type on the table.

    The author discusses "normalization" but fails to realize that the idea of a global "store key table" violates the very important concept of seperation of entity types within the database.

    Edited by - amachanic on 12/04/2003 10:36:48 AM

    Edited by - amachanic on 12/04/2003 10:38:07 AM

    --
    Adam Machanic
    whoisactive

  • Hi Everyone,

    I have been working with this sort of database design for several years now. It is very powerful and can be very useful, but I ***STRONGLY*** Recommend AGAINST this concept.

    The problem isn't inherent to the model itself. The problem is that programmers are human beings and timelines are limited. Writing queries and maintaining data is MUCH more expensive AND much more error prone because of the increased complexity.

    Every team has someone whoose SQL skills aren't very good. This person usually is invaluable for their other skills. A team working with this sort of Database design will:

    A) Quickly steal all of the SQL experts from all the other teams.

    B) Still make more SQL errors than other projects

    C) Lack all of the other skills that the team needs as getting the DB to work becomes the only priority.

    I know that I'm painting the doomsday scenario here, but let me assure you that this little boit of cleverness turns a profitable 1000 hour project into a 7000 hour project that almost kills the project team.

    JUST SAY NO

    Steven

  • I read the article thinking that I might find a technique that could make my database designs either easier, faster, or better in SOME way, but I don't believe this technique accomplishes any of this.

    I wrote an article a while back that had hints of this concept, but had a more structured base and had a definitive use. http://www.sqlservercentral.com/columnists/dsumlin/lookupstrategy.asp

    The main problem I see with this concept is that there is no control over which values exist for a particular domain. (I'm sure there's a complicated technical term for this, so I'll explain).

    The technique that he presents appears to be a good place to store lookup type values, OR values that may be used in other places so as to eliminate redundancy. The problem with this is that there is NO way to determine boundaries of that data.

    For example: Let's say you have a Customer table with a CustomerType field in it. The allowable values are (Wholesale, Retail, Distributor). You then have another table called Sales with a ReferredBy field. The allowable values are (Internet, Friend, Distributor, etc)

    How do you (from an application standpoint) limit the values that can be put into a particular field and how do you display lists of allowable values for instance? (e.g. A drop down list of Customer Types?) Do you have to do a DISTINCT/GROUP BY of the Customer table each and every time? And how would you inactivate a selection for one domain, while allowing it to be valid for another domain. (e.g. Distributor might no longer be a valid value for CustomerType, but be valid for ReferredBy)

    The only thing that I see that this concept does besides being confusing to most developers, making the database actually more complicated, and making changes to data more difficult is save some data space. And the space saved isn't worth it in my opinion.

    David

  • sumlin,

    Doesn't your technique still violate 1NF? You are representing different entity types in a single column. How is this any better or less complex than maintaining individual lookup tables? Again, your schema may look simplified, but your life down the road will only be more complex.

    --
    Adam Machanic
    whoisactive

  • amachanic,

    Yes, I realize that it's not 1NF. In fact, I mention that in the article.

    Now, I know that there are probably a number of you who immediately see that this design breaks the 1st form of normalization. I contend that there are always exceptions to the rule based upon applicability of the situation.

    I may be storing multiple types of entities (e.g. CustomerType, OrderStatus, etc) in the same table, but I'm logically separating them via the LookTypeID column. Therefore I CAN determine the domains of each value. And I'm not taking a single value and allowing it to have multiple domains such as nraghavendra suggests doing. (e.g. I can have Distributor in my Look table multiple times, but it'll be identified by a different LookGID, LookTypeFID, and LookConst value for each domain it may be in)

    As far as complicated down the road, I have used this techniqe in literally dozens of applications ranging from e-commerce shopping carts, financial data warehouses, and everything in between. This technique has saved me so much time & effort I can't even imagine building databases without it anymore. I'm not saying there's not some limitations & caveats to it, but in my development experience, they pale next to the time I've saved.

    David

  • sumlin,

    I said it would cause problems down the road, perhaps not initially; unless you've developed these dozens of applications for one gigantic company, I can only assume you weren't around two years later when your bloated combined lookup tables started causing performance problems and confusion amongst developers due to a not-so-well-maintained data dictionary. But perhaps I should thank you for persisting this technique: I have been called in on more than one occasion to undo the damage it ultimately causes. Perhaps sometime soon I'll be earning some additional income thanks to your article!

    Regardless of all of this, I'd still like to know why you feel this technique saves you time. Is the modelling of small lookup tables really all that time-consuming?

    --
    Adam Machanic
    whoisactive

  • amachanic,

    1) This technique is still being used many years after it's creation, and being used by many other developers that may at first balk at the concept due to purity sake, but realize that for clarity and ease of use it is a good idea. In fact, I contend the value of the technique gets better the longer the application is around, and especially with applications that are continually changing.

    2) Why do you assume that there is no well maintained data dictionary? If, as required by some business requirements, they want to see a separate entity per domain, I'll just give them a view (e.g. v_CustomerType).

    3) And what makes you think that there's performance problems? I once had this debate in an architectural design meeting when building a table that had about 25 lookup fields. It was assumed that by joining to the Look table 25 times was going to be slower than joining to 25 different tables. That assumption proved wrong. It turned out it was actually faster (not by much) to use the Look concept. That was a project almost 5 yrs ago, so I don't have the performance data still.

    4) You mention damage it eventually causes? I'm curious as to what kind of damage that this technique causes that you have to be called in to repair.

    5) The time is not necessarily saved on table creation, that's obviously an easy task. The time isn't necessarily saved on the creation of INSERT/UPDATE/DELETE procedures for each of those tables or creation of views for those tables, although the addition of these objects does start to make the list of objects in your database get much longer. The time saved is on the application side. Instead of having to create a class for each table, or functions for each data manipulation process, you only do it once. No matter how many fields, domains, values you add to the database. Yes, you could try to create a generic class that basically took an object name and dynamically determined which procs to call, but that becomes a very fragile piece of code. (I know, I've tried that way too)

    For example, I currently am interfacing with a DB2 database developed by another development team completely, although I have been tasked with reporting functionality. They developed the database with separate lookup tables...all 200 of them. There have actually been problems with their lookup tables (see a question I just posed under a different thread http://www.sqlservercentral.com/Forum/topic.asp?TOPIC_ID=18853&FORUM_ID=23&CAT_ID=2&Topic_Title=How+would+you+do+this%3F&Forum_Title=General) that would have been avoided using my technique. This database has so many lookup tables that just scrolling the table list becomes a pain in the rear. Realize that they have a view for each table and 3 procs for each table. That's 600 procs!!! What did they gain? IMHO, not very much if anything. Did they add complexity and bloat (your term) to a database structure and application architecture? I think so.

    I think that this technique holds more appeal if you have to do both database architecture and application programming. My .02

    David

  • sumlin,

    Perhaps you have implemented these things in the field in different ways than what I've seen. Said performance draining bloat and data dictionary issues are the results that I personally have witnessed.

    In the extreme case, the application was a very large securities data warehouse which used a centralized lookup table. The warehouse was five years old and countless hands had been involved in adding various types of entities to the table. No one had bothered to delete deprecated entities (and there were HUNDREDS), developers had named entities based on a variety of schemes both real and imagined, and no one was left from the time of the original creation of the table. As a result, confusion ruled the day. Do we use the 'CTY' tag to find a city, or the 'CITY' tag? Both have cities! Join performance was horrible. The nightly build also included parallel processes that both needed to write different types of data to the table, causing disc IO issues.

    Solved all issues by spending a lot of time sorting through the data, figuring out what belonged where, and breaking everything into individual lookup tables with properly defined constraints. Join performance shot way up (no more needless sifting through unrelated data), I was able to seperate the tables to different discs to optimize the nightly build, and was able to eliminate a lot of confusion. Yes, had they bothered to maintain documentation the latter wouldn't have happened, but honestly, how many shops have you been to that had even slightly up-to-date documentation (again, years AFTER the initial build phase)?

    As for your situation with the view on each table, that seems a bit unwarranted. What are these views doing, other than selecting straight from the lookup tables?

    --
    Adam Machanic
    whoisactive

  • By the way, just looked at the thread you posted and I feel that there should NOT be two lookup tables in that situation; statuses of court cases are the same logical types of entity, are they not? "OPEN PENDING TRIAL", "CLOSED CONVICTED", etc (assuming these are actual possible statuses), could all go into a single table, thereby eliminating your problem, and you would still not need a different lookup tag to sort through the data in the table. I don't know anything about legal data, so perhaps I'm missing something.

    --
    Adam Machanic
    whoisactive

  • Good to have discussion, but lets keep this good natured, ok? I know this technique isnt for everyone, but used to some degree isn't going to end the world either. I use a variation of it for lookup tables and it works well. Could it be used badly? Sure.

    I'd encourage one or both of you to really make your case in an article. PROVE to me (and everyone else) that your way is better. We're all here to learn!

    Andy

    http://www.sqlservercentral.com/columnists/awarren/

  • I agree with Andy. Any issues with performance are not completely issues with the design, but the administration and implementation also play a part. In the case mentioned, the lack of cleanup and standardization can be an issue.

    I've used this technique, more like what sumlin has described in places where we had sparse storage techniques, or different amounts of data per row. I know, perhaps there were other issues, but in modifying designs in real time to deal with changing business requirements it worked well.

    That being said, we controlled what was put into the tables. True lookup tables are separated into their own tables. Sparse lookups, say things like a credit limit for a customer. Not everyone has one, not a zero limit, but one that doesn't apply, so this type of "attribute" would be stored in a separate table, linking back with an FK to the parent, only for those parents that it applied to. Each attribute of this type would be examined to determine if we thought it was something that would persist, we'd normalize it properly. If it seemed like some half baked idea (lots of those from the marketing and execs) we'd use it in this type of structure. We'd also examine the criteria of whether or not this was something that was queried often. We found lots of data that was very rarely queried, it was also a good candidate for sparse storage.

    You've both brought up good arguements. Be nice to see an article outlining good and bad arguments, maybe work together and build two argumentative articles.

    Steve Jones

    sjones@sqlservercentral.com

    http://www.sqlservercentral.com/columnists/sjones

    The Best of SQL Server Central.com 2002 - http://www.sqlservercentral.com/bestof/

    http://www.dkranch.net

  • quote:


    I think that this technique holds more appeal if you have to do both database architecture and application programming. My .02


    I've done both.

    Perhaps more importantly I've done application programming management where I've been responsible for having many programmers over time maintain applications, fixing the (real or perceived) sins of their predecessors.

    Whenever someone gets "clever" at the expense of architectural purity, there ends up being a cost. It may be simple confusion (leading to bugs or rework), or it may be that certain situations are not handled thoroughly, usually because the "tricks" end up hiding what is going on underneath.

    That "cost" may sometimes be appropriate to pay. The case being made here is that the overhead of separate tables for each concept is a higher cost than the potential cost of cheating on purity of design. Each case has to be weighed individually.

    But to me the biggest problem with creating this architecture is the order of the decision making. Let me compare it to denormalization in general.

    I have programmers who START database designs saying "I need to denormalize these fields and these tables so this will run fast enough". They say it before they ever even figured out cardinality, much less did any experimentation. Most times they are simply wrong -- it isn't significantly faster. By starting off in the direction instead of coming to it only when necessary, they make a mess.

    If you create this infrastructure, then people will tend to go to it first, without (enough) thought of consequence. You lower the "cost" of this step -- or at least the perceived cost -- and it will be used more widely than intended. That this does is hide the architectural (as opposed to implementation) cost of it.

    I've done it both ways, in file systems before rdbms' and in rdmbs', and it has always come back to bite us in loss of referential integrity, in difficulty in implementing after-thought triggers and functionality specific to individual domains, etc.

    It has its plance, but my suggestion is treat it like surgery - try it after everything else has failed.

  • I've used this technique with success on various large databases. It has helped with the performance on lookups and the physical size of tables. However, on the downside you do increase the complexity and the time taken to write queries. I would not suggest using this method across the entire database, but rather for splitting tables which may have a lot of duplicated varchar or char fields mixed in with int, float etc. I refer to this method as partitioning tables rather than using a "store key" but from what I read in the article the concept is the same.

    Overall I think if you have a lot of repeated data due to the lack of orginal thoughful database design (and it happens) this is a method that may be worth trying. Of course test everything before implementing it into a production system.

    Cheers,

    Angela

  • Let me start by saying that I agree with most of the concerns expressed in this forum. I think that this is a technique that should be rarely used, and only in places where truly sparse data exists, not in the usage envisioned by the author.

    One problem I see with this method is that the values in the lookup table are "shared by code" rather than by design. In other words, the user or application enters a text string for "address2" and the "instead of INSERT" trigger looks to see if the value already exists. If it does, the existing value is used.

    Here's the problem:

    Let's say that Joe lives in on 4th street in Seattle. He gets a record with a link to a string for the street address, rather than just storing the address. Now, Mary's record is entered. She lives on 4th street in Portland. Her record is entered and she get's a link to the same text string as Joe.

    Now, Joe calls up and says that his address has to be modified. He lives on 4th Street NorthEast. The application updates the text string.

    Now, Mary's address has changed. Possibly to an address that does not exist in her city.

    I tried to make this discussion tangible to make it make sense, but my concern is that this would occur for a wide array of values.

    I've worked with databases where the developers thought that they were being clever by looking EVERYTHING up, instead of just repeating information that should have been repeated. Maintaining the data was a NIGHTMARE.

    I fear that wide use of this technique would lead to more databases with this problem.

    As a result of this experience, I do not believe that this kind of "automatic lookup" is a valid exercise, and I certainly do not believe that it reduces complexity in the long run.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply