How much database normalization is enough?

  • Comments posted to this topic are about the item How much database normalization is enough?

  • It depends.

    I work with data that is highly denormalized. It works great because of columnstores. The simple structure allows for easy queries and for the use cases of analytics. I can't easily normalize because I work in a MPP database with extremely large datasets. More joins the worse it gets. Thus, I cause redundancy for performance gains oddly enough, especially being I only get one hashing key per table and thus, I replicate data on different hashing keys for that perfomance increase.

  • It depends entirely on the use case.
    If you have a dimensional data mart then it's probably heavily denormalised
    For systems such as SAS analytics they thrive on heavily denormalised data.
    In an OLTP system you probably want it normalised because it deals with atomic objects.  A product, a sale, an order, a line item etc.
    In a system that brings together disparate data sources you probably want it highly normalised to ease the data integration.
    In a data lake you probably want someone with undauntable optimism

  • I agree Ben.  Small systems usually have limited resources on hardware.  Joining 6 tables instead of 10 gives you quite a performance increase.  Also, and I know many people will not agree, but foreign keys slow a system.  Doing everything via stored procedures still gives you the control when what happens AND THE SPEED.

    5ilverFox
    Consulting DBA / Developer
    South Africa

  • Most definitely depends on use cases AND common sense.  Take 1NF.  It seems we can always divide something into even smaller units.  A street address has a number, street name, and suffix such as RD, AVE, BLVD, etc.  There can also be a compass direction.  E.g., 3704 E. Main St.  But, why would we ever want to break the street address up into these smaller parts?  We almost never want to query on these parts (Give me a count of addresses grouped by suffix?).  Even a simple string can be broken up into words, then letters, and even vowels, consonants, etc.   But it would be absurd to design a database this way.  

    Conversely, we can go too far the other way, as this post pointed out.  I'm currently working for a company that recently moved off a Main Frame computer.  Everything is now in SQL Server.  For the younger folks, Main Frame architecture was largely file based.  Step one was to move the architecture to the new technology.   JCL became Windows .bat files, COBOL became machine generated Java, and the Adabase/File system was translated, as best they could, into a relational database.  But relational in name only.  As these projects go, we never got to step 2.  Step 2 would have been reworking the architecture into a relational model.  Of course, the original Main Frame/COBOL developers remained and continued maintaining the system with little knowledge of relational modeling principles.  We have many tables with 100-200+ columns.  It's truly genius how they got everything to work.  But very inefficient.  I work in BI building data warehouses and creating reports.  I can often, with a few lines of SQL, recreate a legacy report that took a dozen flat files, even more JCL batch files, and 5-10 COBOL "programs".  I just finished one that can be run, on demand, in less that 15 seconds.  The previous version was run weekly because it took almost 3 and a half hours.  This is not an isolated horror story.  Many companies are in the same boat.  It is a testament to the power of a properly designed relational model.

  • definitely an issue of use case, I will agree. 
    My brain boggles at even how I would find a data situation that would require much past 4th normal form, and even that tends to cause mental explosions. Experience levels of the DBA and Architect also affect the final solution. I've met a few folks fresh out of training hell-bent on maximum normalization, when truthfully, a completely de-normalized table is the fastest and most optimal solution. As for foreign keys, nice, handy, and very useful, and have saved my butt several times, but sometimes, based on business logic, you can find a situation where enforcing integrity at the application level is a far better control than relying on the underlying database architecture methods.

    This is where the DBA, Data Architect (if available), and development team, and the stakeholder need to work together to arrive at the most optimal solution.

    As always, in this industry, the answer comes back to "It depends."

    Luther

  • This is a scary editorial, and the preceding comments are also scary.  Normalization is not some wonderful thought experiment that it can be nice to have, but it's not strictly speaking necessary.  It applies a mathematical rigor to the art and science of relational database development.  

    Normalization through the third normal form is sufficient for most databases, and most sections of any database.  It is derelict of any database designer to suggest that normalization to this point is "optional."  The normal forms above that address specific cases that the first three normal forms do not adequately address.  That the editorial talked about sixth normal form is something to strive for is silly. This is to handle the use of time in a manner that was not foreseen at the time Codd laid out the original ground rules.  For all the database work that I've done, I have not had to deal with this.  So when I start creating a database, I don't think I've got to normalize this through six normal forms.  Three almost always suffices.  And frankly the rules are common sense.  If you have each fact in one place, you're probably there, or pretty close.

    A database administrator's first duty is to protect the integrity of the data.  It is not to have the most performant database.  Six joins may be faster than ten, but if the design makes data errors possible, what's the point of going fast?  Is it better to return bad data fast, or good data a little slower?  The rules governing normalization provide the guide for carrying out this first duty.  They are ignored at your peril, and the peril of all those who will have to live with the design that you created.

    My primary concern in expressing the counter view is that all too often when I've dealt with actual designers and designs, it seems that the resistance is an unwillingness to take on the rigor required.  And unfortunately I have had to live with the results.  Denormalization has it's place.  I have occasionally--and I stress occasionally--denormalized a portion of a design for performance reasons.  But that was after no other solution would work. 

    As for OLAP designs, these normalization rules don't apply.  A different set of rules comes into play, such as granularity of fact tables.  These rules are also ignored at your own peril.  I got into BI because I was tasked to determine why the warehouse a consultant had created for us had some bad data.  After having read Ralph Kimball's book to give me some theoretical knowledge, I determined that he had not factored in this very important aspect of design.

  • Thimble view of a long story....
    Ten years ago I was designing reports for an 'improved' version of a proprietary app. The developers said they were using 'sixth and seventh level normalization'.  After a lot of searching I found one other system that was using this same system, the Royal Mail Service in the UK.
    The person I talked with had retired from there a couple years before. He described it as the 'worst bloody system ever'.  
    The system I was working on had to be abandoned ( 2 yrs + $4M US ) because the database could only deliver meaningful data in columns and our customers wanted data in rows.

  • RonKyle - Monday, July 23, 2018 7:03 AM

    This is a scary editorial, and the preceding comments are also scary.  Normalization is not some wonderful thought experiment that it can be nice to have, but it's not strictly speaking necessary.  It applies a mathematical rigor to the art and science of relational database development.  

    Normalization through the third normal form is sufficient for most databases, and most sections of any database.  It is derelict of any database designer to suggest that normalization to this point is "optional."  The normal forms above that address specific cases that the first three normal forms do not adequately address.  

    Unfortunately, it is an option and not a poor one to suggest it's optional. In cases like mine, normalization dramatically decreases the performance of reading the data. If you were to come in as a database designer and told me that you were going to normalize the 10 TB of data in our data warehouse just because it's what you have always done with other traditional SMP databases, then you're going to be shown the door.

    When it comes to normalization and even denormalization, it just depends...

  • I can't speak for data lakes, data warehouses, and so on but for OLTP 3NF has always been sufficient for me. There are *special* cases where denormalizing can be required, but it's a niche thing. Personally I use it for denomination sub-totals and nothing else.

    The database is where data integrity rules belong, period. Otherwise an application can get around them either through ignorance or malice, and neither case is of the good.

    Now in a read-only environment I can see where denomalization might be helpful and application logic supplies some business logic, but absolutely not where users can change the data. Users are endlessly inventive and the old saw "make something idiot proof and the universe will provide better idiots" is so true it isn't even funny.

  • normalize the 10 TB of data in our data warehouse just

    You must not have read the last paragraph.  As I stated, these rules do not apply to OLAP (data warehouse) designs.  That doesn't make them "optional" for OLTP designs.  If your statement that it depends means it depends on whether it's an OLTP design or and OLAP design, then I would say that normalization is a requirement for the first and a detriment for the second.  In my OLAP designs, snowflaking is the exception.  I've seen cases where an OLAP designer resorts to it only because they can't get past that normalization rules as they've used them before do not apply in this design environment.

    //Edited for grammar.

  • This kind of strikes me as on par with asking should every table have a clustered identity column as the primary key?  And as others have said it totally depends on what you're doing with the data.  Also normalization is a technique that has practical applications on the other hand third sixth whatever normal form are just terms for academia, chances are a real database would never be perfectly in one of those.

  • I think the thing that people tend to forget, is that the Normalization process for a database is NOT part of the design process, it is part of the analysis stage before design.  The clarification also needs to be made between unnormalized databases and denormalized databases.  Unnormalized ones have not gone through this analysis or at least not gone through it completely, those tend to be the worst case scenario for an OLTP system.  Denormalized ones have gone through normalization analysis and during the design process there are compromises made for performance and usability reasons.

  • chances are a real database would never be perfectly in one of those

    I disagree.  I can show you smaller designs that are perfectly in 3NF.  There is nothing about the designs that requires normalization to a higher form.  I can show you larger designs that are in perfect 3NF less some tables that have a column containing a GUID candidate key only because the original table uses a GUID primary key instead of the more sensible business key.  Unfortunately I can't purge this issue, but that goes to my original point that someone(s) in the past did not apply the necessary rigor and have created a system that is more difficult to use than it should be. 

    As the subsequent poster indicates, there is a difference between unnormalized and denormalized.  If you have not applied the design through 3NF, you have an unnormalized database.  If the design was in 3NF but a significant performance issue arose in testing that could only be solved by some form of denormalization and not by better indexing, data type choice, or whatever else, then so be it.  If you have a lot of these, something is probably wrong.

  • I firmly work for normalization, but there are rare circumstances where ideological purity can reduce performance without any practical gain.

    ...

    -- FORTRAN manual for Xerox Computers --

Viewing 15 posts - 1 through 15 (of 36 total)

You must be logged in to reply to this topic. Login to reply