Normalization

  • Belatedly I realised that normalization isn't entirely defeated by Codd tables without keys. Given the definition of JDs already discussed, the Codd table T(a) would satisfy the JD * (a,{}) where {} is a projection on the empty set of attributes. This is a trivial join dependency. But if T doesn't have any keys then this trivial JD must be a JD not implied by keys. So T is not in PJ/NF.

  • David Portas (5/10/2011)


    I had always assumed that the detection mechanism mentioned by Codd would have to be applied at runtime and evaluated for each tuple involved in a query rather than just be based on static analysis of the query syntax. The reason I thought runtime evaluation was necessary was because of situations like the following one. Using the Codd table:

    z x

    ---- ----

    4 2

    0 a-null

    A restriction based on the condition that z / x = 0 would apparently return no tuples (based on RM2 page 185) even though z = 0 divided by any applicable value is known to be 0. I certainly don't share Codd's curious optimism that this is "not a burning issue" - but for the purposes of this discussion I will assume for the time being that it isn't.

    Well, presumably one of the applicable values that a-null could be is 0, and it surely is not true that 0/0 delivers 0 (it should deliver something like "error: indeterminate value" or "error: zero-divide" if it's a bit less good at distinguishing between errors); so if we look at Codd's true selection we get no rows, and if we look at his maybe-selection we get 1 row (I can't remember whether the true-op and maybe-op distinction is explicitly mentioned in teh RM2 paper, but the earlier papers with nulls and MVL certainly have it). Also, although a-null and i-null appear to be mutually exclusive, the RM2 model does explicitly allow a-null to become i-null, so a-null doesn't really mean there is known to be an applicable value, so that's another case of a valid substitution for the a-null that would leave no rows delivering true.

    I'm prepared to consider some other 3VL system but I'm not sure exactly what you are proposing if your system is not Codd's 4VL and not SQL's 3VL. Just to restate your explanation of relational equality and join dependency. I understand that you think null marks of the same kind for the same attribute should be considered equal for the purposes of relational comparison but not for joins. So where T is a table then T = T is always true (with nulls or without) but T |X| T = T is not necessarily true. In other words a table does not necessarily satisfy even trivial join dependencies. Is that correct?

    If T has any columns that allow NULLs, the it doesn't satisfy T |X| T = T; because if T has a nullable column C and P1 and P2 are both projections of T that contain C then P1 |X| P2 = /= T. That doesn't prevent the table from satisfying some trivial join dependencies (if your definition of a trivial join dependency is that it is one in which every projection concerned includes the whole of the primary key, which is the one I remember from a long time ago) since the set of distinct projections each of which is on the key columns plus one other column always delivers a trivial join dependency that is satisfied. If you mean some things which are trivial join dependencies (and hence are satisfied) in the null-free model are not join dependencies (because they are not satisfied) in a null-containing model that is absolutelyabsolutely correct. If you mean no trivial join dependencies are satisfied in a null containing model, that is incorrect.

    So for example if I have some things (in the DB: some maps in a relation) such that for each thing (in the DB: map - I suppose I have to call it tuple in a RDB context although it's mathematically inappropriate) R it is true that if R.Y has a non-null value then R.Z must be inapplicable and vice-versa then that constrain in the modelled world is a functional dependency in the model - a functional dependency requiring something to be NULL. If I ignore nulls, normalisation will not eliminate non-trivial functional dependencies like this one, so I can't ignore NULLs for normalisation. If I throw away i-NULL and just have one pure NULL, I can't construct functional dependencies like this, so goodbye i-NULL and 4VL, let's get back to a single pure null and a 3VL.

    Wouldn't you want to define funcational dependency in an analogous way to how you've defined join dependency? I mean that an FD A->B is never satisfied by any table where B permits nulls? If you allow the FD A->B to be satisfied when B permits nulls then you would have FDs that were not also JDs. To comply with Fagin's definition of PJ/NF the set of join dependencies satisfied by a Codd table would have to be equivalent to the set of all key dependencies (FDs, MVDs and JDs implied by the keys of the table). So if A->B is a permissable FD where B allows nulls then by your definition of a JD you could have a Codd table which satisfied PJ/NF but not 3NF.

    That's a perfectly reasonable reformulation of my argument (which you quoted) for throwing away the i-null concept; I don't see how a real null (as opposed to an i-null) could reasonably be possible in a column which is the target of a functional dependency (or of an MFD).

    Alternatively, assume the FD A->B is never satisfied if B permits nulls. Now we seem to have a situation even worse than the kludge that Codd suggested for applying normalization with nulls. Codd proposed that an FD could be in effect if you ignored only the rows where nulls occurred. However, if we insist that an FD is also a JD then by your definition of JDs it seems that a FD doesn't permit nullable attributes either. So we would have a strange situation where we can add any nullable attributes we like to a Codd table, without any risk of creating a non-key dependency and so normalization becomes a useless exercise in the face of those nullable attributes.

    I'm more optimistic. I think a straightforward Kleene logic 3VL with a single NULL which means nothing more than "this value is not in the database" works fine (that, unfortuantely for SQL, is nothing like the SQL NULL). I don't envisage a vast proliferation of nullable columns, since most of the time nulls aren't needed, and only a fool will permit them where they are unneccessary. And in a system with just the one sort of NULL I don't see any real problem with normalisation; for 2NF,3nF and 4NF we could define FDs as A->B when the value of A determines the value of B except where B is null, and MFDs as A->>B where the except that the any of the B attributes are permitted to be NULL (which was Codd's original proposal on Nulls and normalisation) or go with the rule that a nullable column is never in the target of an FD or an MFD (which is the easy way, but maybe not sensible). I would perhaps also be happy with using a separate calculus (where NULL = NULL delivers true) for defining 2nd, 3rd, and 4th normal forms (and maybe even for defining 5NF, although that feels sort of bizarre) if someone were to demonstrate the need for it, but I can't envisage the need being demonstrated and without the demonstration I'd be unhappy with such a definition. And I certainly do need that one NULL (or something for practical purposes equivalent) for real world problems where the application needs to know that data is missing and decide how to deal with it, and not be fobbed off with "default values" or the result of some really complex heterogenous union as a substitute for the natural result delivered by open join (which entails NULL) or by something built from open union (which also entails NULL).

    I can believe that it might just be possible to devise some hypothetical system of nulls and tables with a set of derivation rules that makes sense of the normal forms we know. That hypothetical system is certainly not SQL. Nor will it be RM2 if you want 3VL instead of 4VL.

    I guess we agree on that then - it won't be SQL and it won't be RM2. Nor will it be RM-T.

    In the absence of the definition of any such system (please define it if you think you can) I stand by my orginal comments on nulls, which are borne out when the consequences of RM2 are considered: None of 1NF,2NF,3NF,BCNF,4NF,5NF,PJ/NF,DKNF allow nulls. A table that permits nulls is not an accurate representation of a relation that satisfies any of those NFs.

    I don'tunderstand how that claim can imaginably be made in regard of 1NF by any intelligent person (although I do know perfectly well that several intelligent people have made it: you, Chris Date, Fabian Pascal,... it's a long list). Are you claiming for example that the domain consisting of the (fully defined) integers plus "bottom" is not a valid domain? If so, ouch! So we have a definition of domain for relational algebra that is radically different from the usual mathematical definition - that would be news to me, if true. I don't believe it makes sense for 1NF, 2NF, 3NF, BCNF, 4NF either, and of course DKNF is just a Ron Fagin fantasy (Chris Date has described the pointlessness of the DKNF definition very clearly, surely you've seen that?) which is guaranteed not to be obtainable for almost any real database (not just "not practically obtainable" but "provably unobtainable") - "almost any real database" here means every nontrivial database and most of the trivial ones too. For 5NF and PJ/NF (why list both, when they are the same thing?) I don't see any problem either.

    In the design of a database one does not normalise views. In fact it is utter nonsense to want to talk about 3NF for views

    It certainly isn't nonsense in the relational model without nulls. In the absence of nulls a view is a derived relation and a derived relation appears the same and has all the same essential properties as any base relation. Normalization is orthogonal to whether relations are base or derived and the database designer ought to be able to apply normalization to any relation schema without concerning himself about whether relations are base or not. If normalization theory is rendered useless (as I think it must be) by the presence of derived Codd tables then that is a problem caused by Codd's model of keyless tables. It is not a problem that exists in the relational model without nulls.[/quote]

    True. But the relational model without NULL is not a model which is of sufficiently general applicability to be really useful, so even if it were a fact that it can consider normalisation of views that would be rather irrelevant; however, it can't. I have never seen any suggestion for rewriting queries in such a way that the output is guaranteed to be in 3NF (or even 2NF), and none of the databases I've seen that claim to implement the null-free relational model provides any feature to make that happen (a view would have to be a multitude of relations and constraints connecting them, for that to be possible). Suppose there is a base relation R with attributes named A,B,C,D,E,F (all using the integer domain), no nulls allowed, the relation is in 5NF and the primary key is (A,B,C); look at the operation expressed in SQL by

    create view V as SELECT A,B,C,D,E,F from R where D=A+B and E=B-C and F = C*A

    The derived relation V that this produces can't be in 3NF because it violates 2NF at least 3 times: (A,B,C) is the primary key of V, but each of D, E, and F is dependednt on only 2 of the 3 key attributes. But that operation is a perfectly valid one in every null-excluding system I have encountered - in fact it is surely a fundamental requirement of the relational model that such a restriction operation can be implemented. Obviously I could write much nastier queries than that, where the output (in order to be in 3NF) would have to be broken into dozens of relations, but it seems reasonable to restrict myself to this extremely simple example (which requires splitting into at least 3 separate derived tables to reach 2NF - and of course almost certainly what the application using the database wants is not several separate relations that it has to join to get the single relation it needs, it wants the single relation and is not interested in 2NF).

    Tom

  • David Portas (5/10/2011)


    Belatedly I realised that normalization isn't entirely defeated by Codd tables without keys. Given the definition of JDs already discussed, the Codd table T(a) would satisfy the JD * (a,{}) where {} is a projection on the empty set of attributes. This is a trivial join dependency. But if T doesn't have any keys then this trivial JD must be a JD not implied by keys. So T is not in PJ/NF.

    Don't forget that Codd required both that every base table have a primary key and that no attribute included in a primary key permit NULL.

    Since there is no key, you are talking about a derived table here, and derived tables are often not in PJ/NF even when NULLs are not allowed - see my post with an example of the restriction operator delivering from a 5NF table (with no nulls permitted) a derived table (also with no nulls permitted) which isn't even in 2NF, let alone in PJ/NF. The presence of NULLs makes no significant difference to questions of normality for derived tables - unless of course we include the requirement for a primary key in the definition of 1NF and don't inculde a requirement that that primary key be meaninful, in which case 1NF is a form which the null-free model can guarantee in derived relatins while the null-allowing model can't; of course if we include a requirement for a meaningful primary key neither model can guarantee it).

    Tom

  • Tom.Thomson (5/10/2011)


    Well, presumably one of the applicable values that a-null could be is 0, and it surely is not true that 0/0 delivers 0 (it should deliver something like "error: indeterminate value" or "error: zero-divide" if it's a bit less good at distinguishing between errors);

    Yes but that's just a neat way to sidestep the real problem we were talking about. If I'd said z * x = 0 instead of z/x = 0 would you be any more likely to agree that MVL without tautology detection is just an accident waiting to happen?

    You have answered my main argument about any hypothetical system which can support both Normalization and nulls at the same time:

    it won't be SQL and it won't be RM2. Nor will it be RM-T.

    In other words it is none of the sytems that people are likely to be talking about on SSC. If you have your own definition of such a system then please publish it somewhere so that I can understand and scrutinise it properly. The space in this forum is certainly too small for that - after all Codd wrote 500 pages and still didn't manage it! Until you've done that I'm going to remain as sceptical as anyone else with a scientific mind should. Myself and millions of others who know of no such system are going to stick with what is presently known: that normal forms as originally defined have nothing to do with any table with a null in it.

    For 5NF and PJ/NF (why list both, when they are the same thing?

    According to some sources (e.g. David Maier, Terry Halpin and this) they are not the same thing. I mentioned PJ/NF to make it clear that I was talking about Fagin's original definition and not any other.

    Are you claiming for example that the domain consisting of the (fully defined) integers plus "bottom" is not a valid domain?

    It's clear to me right from Codd's first papers and I think from everything else I've ever read about the relational model that what RM calls a domain is a set, not a poset. A relational domain as conventionally defined is not the same as what domain theory calls a domain.

  • fourteen, fourteen pages already on this thread!!! :w00t:

    There is an abundant amount of information on the posts of this thread but I'm affraid most people wouldn't read it from start to end then same queistions, issues, comments and answers can be found once and again.

    Don't you think it is time to close this one and start new, more specific threads on data normalization?

    Just my two cents. 🙂

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (6/23/2011)


    fourteen, fourteen pages already on this thread!!! :w00t:

    There is an abundant amount of information on the posts of this thread but I'm affraid most people wouldn't read it from start to end then same queistions, issues, comments and answers can be found once and again.

    Don't you think it is time to close this one and start new, more specific threads on data normalization?

    Just my two cents. 🙂

    Heh... nah... that would normalize the thread. 🙂 Besides, it's only 3 pages long when you have it set to return 50 posts per page. :hehe:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • David Portas (6/21/2011)


    Tom.Thomson (5/10/2011)


    Well, presumably one of the applicable values that a-null could be is 0, and it surely is not true that 0/0 delivers 0 (it should deliver something like "error: indeterminate value" or "error: zero-divide" if it's a bit less good at distinguishing between errors);

    Yes but that's just a neat way to sidestep the real problem we were talking about. If I'd said z * x = 0 instead of z/x = 0 would you be any more likely to agree that MVL without tautology detection is just an accident waiting to happen?

    No. A better solution to that might be to define the results of operatins on NULL more carefully - perhaps so that NULL/0 throws an error and NULL*0 is 0. But the best solution of all is for the programmer to be aware of what he is doing - if he is handling data than may involve nulls, he should know that, and write (x=0) or (z=0) instead of z*x=0, or with more complex queries perhaps resort to maybe-joins, maybe-unions, and maybe-restrictions and use their differences from the true-join, true-union, and true-restrict to indicate where there are cases that he needs to take care of. Of ourse this means that programming a database system that allows NULL is difficult, but so are many programming tasks, and all the anti-null rhetoric in teh world does not detract from the fact that the avoiodance of NULLs is not always possible in the real world because we don't always have complete information and restructuring for NULL-avoidance will often result in a schema so complex with any serious application requiring extremely complex joins so complex withextremely complex mutiple record-set (aka derived relatin) results that makes database programming both at the relational calculus (whatever replaces SQL) level and at the application level (which has to handle the complex result structure) a dreadful nighmare and ensures that no currently understood technology could deliver decent performace from such a schema.

    You have answered my main argument about any hypothetical system which can support both Normalization and nulls at the same time:

    it won't be SQL and it won't be RM2. Nor will it be RM-T.

    In other words it is none of the sytems that people are likely to be talking about on SSC. If you have your own definition of such a system then please publish it somewhere so that I can understand and scrutinise it properly. The space in this forum is certainly too small for that - after all Codd wrote 500 pages and still didn't manage it! Until you've done that I'm going to remain as sceptical as anyone else with a scientific mind should. Myself and millions of others who know of no such system are going to stick with what is presently known: that normal forms as originally defined have nothing to do with any table with a null in it. [/quite]

    I'm fed up with hearing that last assertion. Heath and Codd were discussing join operations that included in the output rows that had no match in the partner relation, and the concept of a NULL value to enable these outer joins, and discussing also what was later called "outer union" which also required NULL, back in 1970 and 1971 while they were working on normal forms; Codd makes that quite clear in his "extending the model to capture more meaning" paper. You can't find any work on normalisation earlier than that carried out by Codd and Heath, and they believed that some sort of NULL would be needed for technical (relational model related) reasons when they were doing that work; Codd's later formalisation of NULL was the result of his realising that this technical NULL could also be used to attack the missing data problem.

    And I will state once again, for the record, that nothing in the definitions of 1NF, 2NF, 3NF, EKNF, BCNF, or PJ/NF is influenced in the slightest by permitting NULL values for non-prime attributes (attributes that are not in any candidate key), and point out that I've shown you how this works for PJ/NF and that working for PJ/NF implies that it works for all the others. If you have a reutation of the outline proof I posted before perhaps you could post it?

    For 5NF and PJ/NF (why list both, when they are the same thing?

    According to some sources (e.g. David Maier, Terry Halpin and this) they are not the same thing. I mentioned PJ/NF to make it clear that I was talking about Fagin's original definition and not any other.

    Ok, I see that 5NF (like 6NF) has too many definitions. I need to look at the three variants of 5NF that a quick skim of that paper suggests exist and see what implications, if any, the two extra variants have for non-prime attributes allowing NULL.

    Tom

  • Jeff Moden (6/26/2011)


    PaulB-TheOneAndOnly (6/23/2011)


    fourteen, fourteen pages already on this thread!!! :w00t:

    There is an abundant amount of information on the posts of this thread but I'm affraid most people wouldn't read it from start to end then same queistions, issues, comments and answers can be found once and again.

    Don't you think it is time to close this one and start new, more specific threads on data normalization?

    Just my two cents. 🙂

    Heh... nah... that would normalize the thread. 🙂 Besides, it's only 3 pages long when you have it set to return 50 posts per page. :hehe:

    Let's push the concept a bit forward and go back to the "single table with no indexes database model" ... con't complain, everything is in there, hello?! 😀

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (6/23/2011)


    fourteen, fourteen pages already on this thread!!! :w00t:

    There is an abundant amount of information on the posts of this thread but I'm affraid most people wouldn't read it from start to end then same queistions, issues, comments and answers can be found once and again.

    Don't you think it is time to close this one and start new, more specific threads on data normalization?

    Just my two cents. 🙂

    Probably a good idea; this thread was hijacked to present the fundamentalist anti-null view in the 5th message, and has never really recovered - apart from a brief ddiversion into the definition of 1NF the whole thread has been about that single pointless debate.

    Tom

  • On the ordering of rows and 1NF...

    Is the disconnect here not between the idea of rows which are ordered when used, and rows which are inherently ordered? If I have a table:

    create table addresses(

    address_line_index tinyint not null

    check (address_line_index between 1 and 3),

    ...

    )

    Then the address_line_index represents an attribute by which I may choose to meaningfully order the rows when selecting from the table, but does not represent an inherent ordering to the rows in the table. The conceptual relation for this table itself is thus "unordered".

    Similarly, would it violate normal form to have a log table, of the form:

    create table event_log (

    event_datetime datetime not null,

    ...

    )

    Where events are inserted into the log in date ascending order (because this is the order of events as they occur)? I don't think anyone would say this violates 1NF but, just as before, we can select the contents of the table in a meaningful order using the event_datetime column.

    I take Gus's point on update atomicity and agree that this poses a problem for the addresses structure. I suppose one could argue that the address_line_index, being a well defined domain, contains meaningful values and it would make no sense to insert a new row between 1 and 2. Having said that, this also suggests that "line 1" really means "the street address" part, and "line 2" means "the suburb part", which begs the question as to why they are being stored on different rows as opposed to separate attributes on a single row (returning us full circle to the null/combinatorial explosion discussion).

    Ultimately, this "row per address line" structure seems like a kind of specialized entity-attribute-value structure. The specialization, whereby all attributes are for one entity and all attributes are related (common use of the term), makes this look less "EAV-like" than a more common, generic EAV table.

    My own personal opinion? This table is in 1NF, just as the event_log table is in 1NF. Our ability to extract rows in a meaningful order does not violate the principle that the relation is unordered. I do not, however, think that the EAV pattern is a good one, and the point Gus made about update atomicity is certainly one reason for that.

    edit: /necro

Viewing 10 posts - 136 through 144 (of 144 total)

You must be logged in to reply to this topic. Login to reply