Why Normalization is not good

  • Under what circumstances Normalization is disadvantageous to database?Can someone explain with a neat example?

  • Normalisation may be a disadvantage when Warehousing data which is going to be read-only.

  • This is quite a vague question. Have you already done any research on the subject? I'm sure a quick Google, Bing, Ask, etc, would yield some good results.Could you try to be more specific?

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • sathwik.em91 - Wednesday, May 30, 2018 5:29 PM

    Under what circumstances Normalization is disadvantageous to database?Can someone explain with a neat example?

    Maybe the best question to ask you, is, "Why?"   What's the motivation behind asking this question?   You'll find that you often learn a great deal more by specifying the "why" behind your question, so that we can at least look at the problem from your point of view.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • It really depends on usage of the database. The often saying is if you want good writes, then normalize. If you want good reads, then denormalize. There is no real set rule there, but both approaches have their pros and cons. For example, in my data warehouse, normalization would be bad because it would kill performance on the analytical reports we generate from the database. This is because the amount of complex JOIN's we would need in a normalized world would be a killer on performance. However, the moment we start UPDATING the database, is the moment all hell breaks loose in the denormalized world.

  • Normalisation may be a disadvantage when Warehousing data which is going to be read-only. 

    Data Warehouses aren't denormalized.  They have a different set of design rules which gives them the appearance of being denormalized.  But the design doesn't come about because someone takes an OLTP design and denormalizes it into an OLAP design.

  • xsevensinzx - Thursday, May 31, 2018 7:39 AM

    It really depends on usage of the database. The often saying is if you want good writes, then normalize. If you want good reads, then denormalize. There is no real set rule there, but both approaches have their pros and cons. For example, in my data warehouse, normalization would be bad because it would kill performance on the analytical reports we generate from the database. This is because the amount of complex JOIN's we would need in a normalized world would be a killer on performance. However, the moment we start UPDATING the database, is the moment all hell breaks loose in the denormalized world.

    I've seen the opposite. Many developers find it easier to drop everything denormalized in XML, JSON or CSV and will complain because the query is taking a lot of time when trying to filter by a specific value inside those things. People thought they were clever by using bitmap columns, but there's a reason for them to be disappearing. JOINs shouldn't be complex on a properly normalized database.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Many developers find it easier to drop everything denormalized in XML, JSON or CSV and will complain because the query is taking a lot of time when trying to filter by a specific value inside those things. People thought they were clever by using bitmap columns, but there's a reason for them to be disappearing. JOINs shouldn't be complex on a properly normalized database.

    I've seen this as well and it drives me nuts.  It of course makes the programming easier because they can just drop these fields in as they are.  I tell programmers who have to also be designers that an OLTP data has to not only be able to pull individual records efficiently, but then to allow for reporting queries across the data.  If it only does the first, it's not a good design.

  • sathwik.em91 - Wednesday, May 30, 2018 5:29 PM

    Under what circumstances Normalization is disadvantageous to database?Can someone explain with a neat example?

    As Tom said, the question is too vague and without knowing the platform and the purpose, there is no way of properly answer this question.
    😎 

    There are many misconceptions and myths around de-normalisation of data sets, most of those are derived from the perceived overhead of relational operators when retrieving data from the set. If you are targeting a multi dimensional OLAP, then this could possibly be an option, more efficient and modern analytical engines such as the Tabular (PowerBI.com, PowerBI desktop, Excel PowerPivot and SSAS Tabular) work much better on sets in the 3rd normal form or higher.

  • Eirikur Eiriksson - Thursday, May 31, 2018 9:45 AM

    sathwik.em91 - Wednesday, May 30, 2018 5:29 PM

    Under what circumstances Normalization is disadvantageous to database?Can someone explain with a neat example?

    As Tom said, the question is too vague and without knowing the platform and the purpose, there is no way of properly answer this question.
    😎 

    There are many misconceptions and myths around de-normalisation of data sets, most of those are derived from the perceived overhead of relational operators when retrieving data from the set. If you are targeting a multi dimensional OLAP, then this could possibly be an option, more efficient and modern analytical engines such as the Tabular (PowerBI.com, PowerBI desktop, Excel PowerPivot and SSAS Tabular) work much better on sets in the 3rd normal form or higher.

    Not really. They work on tiny datasets that do very simplistic computations for operational reporting. When you start actually doing analytics where computational needs increase along with data sizes, then things start drastically breaking down on most of these engines. For example, try loading 100 million records into a engine to further aggregate it. The read alone is going to take some time before you even get to a point of aggregating it. 

    Denormalization helps with computating that data fast with little resistance on the performance killers. For example, taking a large denormalized fact table and creating a new tabular dataset from it that ultimate feeds into your X reporting engine as opposed to trying to feed it directly in N normal form that could ultimately time out or take a long time to remodel the data on the read (normalized -> denormalized)

  • xsevensinzx - Thursday, May 31, 2018 3:48 PM

    Eirikur Eiriksson - Thursday, May 31, 2018 9:45 AM

    sathwik.em91 - Wednesday, May 30, 2018 5:29 PM

    Under what circumstances Normalization is disadvantageous to database?Can someone explain with a neat example?

    As Tom said, the question is too vague and without knowing the platform and the purpose, there is no way of properly answer this question.
    😎 

    There are many misconceptions and myths around de-normalisation of data sets, most of those are derived from the perceived overhead of relational operators when retrieving data from the set. If you are targeting a multi dimensional OLAP, then this could possibly be an option, more efficient and modern analytical engines such as the Tabular (PowerBI.com, PowerBI desktop, Excel PowerPivot and SSAS Tabular) work much better on sets in the 3rd normal form or higher.

    Not really. They work on tiny datasets that do very simplistic computations for operational reporting. 

    Don't know where you get this from, the data sets I've worked on are neither tiny nor are the computations simple.

    When you start actually doing analytics where computational needs increase along with data sizes, then things start drastically breaking down on most of these engines. For example, try loading 100 million records into a engine to further aggregate it. The read alone is going to take some time before you even get to a point of aggregating it. 


    This is simply not true, the Tabular model can handle hundreds of millions of rows very easily, given that the data set is normalized and properly designed. 

    Denormalization helps with computating that data fast with little resistance on the performance killers. For example, taking a large denormalized fact table and creating a new tabular dataset from it that ultimate feeds into your X reporting engine as opposed to trying to feed it directly in N normal form that could ultimately time out or take a long time to remodel the data on the read (normalized -> denormalized)


    Denormalizing the data set both increases the storage footprint and the I/O pressure, resources which are far more prone to congestion than the modern day CPUs number crunching.
    😎

  • Eirikur Eiriksson - Friday, June 1, 2018 12:56 AM

    Don't know where you get this from, the data sets I've worked on are neither tiny nor are the computations simple.

    When you start actually doing analytics where computational needs increase along with data sizes, then things start drastically breaking down on most of these engines. For example, try loading 100 million records into a engine to further aggregate it. The read alone is going to take some time before you even get to a point of aggregating it. 


    This is simply not true, the Tabular model can handle hundreds of millions of rows very easily, given that the data set is normalized and properly designed. 

    Denormalization helps with computating that data fast with little resistance on the performance killers. For example, taking a large denormalized fact table and creating a new tabular dataset from it that ultimate feeds into your X reporting engine as opposed to trying to feed it directly in N normal form that could ultimately time out or take a long time to remodel the data on the read (normalized -> denormalized)


    Denormalizing the data set both increases the storage footprint and the I/O pressure, resources which are far more prone to congestion than the modern day CPUs number crunching.
    😎

    Couple of things.

    Handing a business user dozens of tables and telling them to answer their data questions is likely not going to end well. Many will not be DBA level 10 experts to understand how to do this. You're likely going to end up denormalizing the data anyways, which is essentially the point of why you go ahead and do that on disk in the first place. In the denormalized world, where it's already denormalized on disk, we often land to the same conclusion anyways.

    What conclusion? We still need to create views or subsets of the data for the business user. In the normalized world, this is likely going to result in views created to denormalize the data and then aggregate it. Then we feed those views into the in-memory analytical models or data marts or data warehouses even. In the denormalized world, we normally can skip the first step and just create the view that is aggregating it. Both results in a view feeding into the model.

    The kicker here is that in the denormalized world, we are aggregating from a columnstore that is blazing fast. It's optimized for computation and optimize for extremely fast reads on the specific fields you need. This is going to be faster than your normalized dataset because it's in one location, compressed, and doesn't require 30 joins to aggregate. If you're in MPP, like well, you should be, then aggregation is happening over N nodes, which is going to only increase the performance. But that's not really just a comparison of denormalized versus normalized as it is SMP versus MPP. 

    Ultimately, there is a huge difference from having to JOIN a bunch of tables to aggregate versus just aggregating on one single table with NO JOINS on top of aggregating row-based SMP systems versus column-based MPP systems.

    (Note: NO JOIN's is only referring to having to join other tables to pull in other keys with their respective metrics to aggregate first. Once aggregated, in both cases, you do ultimately JOIN your dimensions to translate those keys into their values.)

  • xsevensinzx - Friday, June 1, 2018 3:42 AM

    Eirikur Eiriksson - Friday, June 1, 2018 12:56 AM

    Don't know where you get this from, the data sets I've worked on are neither tiny nor are the computations simple.

    When you start actually doing analytics where computational needs increase along with data sizes, then things start drastically breaking down on most of these engines. For example, try loading 100 million records into a engine to further aggregate it. The read alone is going to take some time before you even get to a point of aggregating it. 


    This is simply not true, the Tabular model can handle hundreds of millions of rows very easily, given that the data set is normalized and properly designed. 

    Denormalization helps with computating that data fast with little resistance on the performance killers. For example, taking a large denormalized fact table and creating a new tabular dataset from it that ultimate feeds into your X reporting engine as opposed to trying to feed it directly in N normal form that could ultimately time out or take a long time to remodel the data on the read (normalized -> denormalized)


    Denormalizing the data set both increases the storage footprint and the I/O pressure, resources which are far more prone to congestion than the modern day CPUs number crunching.
    😎

    Couple of things.

    Handing a business user dozens of tables and telling them to answer their data questions is likely not going to end well. Many will not be DBA level 10 experts to understand how to do this. You're likely going to end up denormalizing the data anyways, which is essentially the point of why you go ahead and do that on disk in the first place. In the denormalized world, where it's already denormalized on disk, we often land to the same conclusion anyways.

    What conclusion? We still need to create views or subsets of the data for the business user. In the normalized world, this is likely going to result in views created to denormalize the data and then aggregate it. Then we feed those views into the in-memory analytical models or data marts or data warehouses even. In the denormalized world, we normally can skip the first step and just create the view that is aggregating it. Both results in a view feeding into the model.

    The kicker here is that in the denormalized world, we are aggregating from a columnstore that is blazing fast. It's optimized for computation and optimize for extremely fast reads on the specific fields you need. This is going to be faster than your normalized dataset because it's in one location, compressed, and doesn't require 30 joins to aggregate. If you're in MPP, like well, you should be, then aggregation is happening over N nodes, which is going to only increase the performance. But that's not really just a comparison of denormalized versus normalized as it is SMP versus MPP. 

    Ultimately, there is a huge difference from having to JOIN a bunch of tables to aggregate versus just aggregating on one single table with NO JOINS on top of aggregating row-based SMP systems versus column-based MPP systems.

    (Note: NO JOIN's is only referring to having to join other tables to pull in other keys with their respective metrics to aggregate first. Once aggregated, in both cases, you do ultimately JOIN your dimensions to translate those keys into their values.)

    The modern analytical platforms which I referred to do not need denormalizaion, they actually work much much better on a normalized set which means that there is hardly any difference between the underlying data source and the analytical model. There are several things one has to keep in mind for optimising the model for the tabular engine, such as cardinality, unique value ratio etc. It is a whole new ballgame and does not have the restrictions of the older multi dimensional approach!
    😎
    I'm really not trying to patronise or any such thing, suggest you read up on the tabular analytical engine and the vertipaq machine (under the hood/bonnet)

  • The modern analytical platforms which I referred to do not need denormalizaion, they actually work much much better on a normalized set which means that there is hardly any difference between the underlying data source and the analytical model. There are several things one has to keep in mind for optimising the model for the tabular engine, such as cardinality, unique value ratio etc. It is a whole new ballgame and does not have the restrictions of the older multi dimensional approach!
    Cool
    I'm really not trying to patronise or any such thing, suggest you read up on the tabular analytical engine and the vertipaq machine (under the hood/bonnet)

    With just a little bit of research, you would realize that the tabular model is not some new ballgame that overcomes the restrictions of the "older multi-dimensional approach."  The multi-dimensional approach can do more with complex relationships and handles large sets better.  It's primary drawback is that is complicated, and apparently complicated enough that not enough people understand it.  The development time is also longer with the MD approach.  The tabular model is intended to be a simplified version, and even allows for very simplified versions.  But it's not better.  If you have a good multi-dimensional solution, you do not need to switch to tabular, because that would not be an upgrade.  We have looked at our SSAS solutions and concluded that we would not be able to do some of the things that we do in the tabular model.  If you have a failed multi-dimensional solution, then you should consider the tabular model as a rescue plan.  As I understand very well the multi dimensional approach and SSAS, none of my BI projects fail.

  • Eirikur Eiriksson - Friday, June 1, 2018 8:33 AM

    xsevensinzx - Friday, June 1, 2018 3:42 AM

    Eirikur Eiriksson - Friday, June 1, 2018 12:56 AM

    Don't know where you get this from, the data sets I've worked on are neither tiny nor are the computations simple.

    When you start actually doing analytics where computational needs increase along with data sizes, then things start drastically breaking down on most of these engines. For example, try loading 100 million records into a engine to further aggregate it. The read alone is going to take some time before you even get to a point of aggregating it. 


    This is simply not true, the Tabular model can handle hundreds of millions of rows very easily, given that the data set is normalized and properly designed. 

    Denormalization helps with computating that data fast with little resistance on the performance killers. For example, taking a large denormalized fact table and creating a new tabular dataset from it that ultimate feeds into your X reporting engine as opposed to trying to feed it directly in N normal form that could ultimately time out or take a long time to remodel the data on the read (normalized -> denormalized)


    Denormalizing the data set both increases the storage footprint and the I/O pressure, resources which are far more prone to congestion than the modern day CPUs number crunching.
    😎

    Couple of things.

    Handing a business user dozens of tables and telling them to answer their data questions is likely not going to end well. Many will not be DBA level 10 experts to understand how to do this. You're likely going to end up denormalizing the data anyways, which is essentially the point of why you go ahead and do that on disk in the first place. In the denormalized world, where it's already denormalized on disk, we often land to the same conclusion anyways.

    What conclusion? We still need to create views or subsets of the data for the business user. In the normalized world, this is likely going to result in views created to denormalize the data and then aggregate it. Then we feed those views into the in-memory analytical models or data marts or data warehouses even. In the denormalized world, we normally can skip the first step and just create the view that is aggregating it. Both results in a view feeding into the model.

    The kicker here is that in the denormalized world, we are aggregating from a columnstore that is blazing fast. It's optimized for computation and optimize for extremely fast reads on the specific fields you need. This is going to be faster than your normalized dataset because it's in one location, compressed, and doesn't require 30 joins to aggregate. If you're in MPP, like well, you should be, then aggregation is happening over N nodes, which is going to only increase the performance. But that's not really just a comparison of denormalized versus normalized as it is SMP versus MPP. 

    Ultimately, there is a huge difference from having to JOIN a bunch of tables to aggregate versus just aggregating on one single table with NO JOINS on top of aggregating row-based SMP systems versus column-based MPP systems.

    (Note: NO JOIN's is only referring to having to join other tables to pull in other keys with their respective metrics to aggregate first. Once aggregated, in both cases, you do ultimately JOIN your dimensions to translate those keys into their values.)

    The modern analytical platforms which I referred to do not need denormalizaion, they actually work much much better on a normalized set which means that there is hardly any difference between the underlying data source and the analytical model. There are several things one has to keep in mind for optimising the model for the tabular engine, such as cardinality, unique value ratio etc. It is a whole new ballgame and does not have the restrictions of the older multi dimensional approach!
    😎
    I'm really not trying to patronise or any such thing, suggest you read up on the tabular analytical engine and the vertipaq machine (under the hood/bonnet)

    Indeed, but that all depends. You must be referring to the fact you can read in parallel N normalized tables at the same time versus waiting on one big fact table to load in without parallelism. The problem though is you still end up with a pretty complex model in the front-end that is going to be difficult to navigate for anyone not specialized in BI versus working with one denormalized set of data by default.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply