Normalization

Question

Normalization

Viewing 15 posts - 16 through 30 (of 144 total)

You must be logged in to reply to this topic. Login to reply

TomThomson SSC Guru Points: 104773 More actions · Answer 1

David Portas (1/22/2010)
Since it is always possible to decompose further and eliminate nulls they are not formally necessary either - you can always create a truly normalized design without them.

This is an extremely naive view supported by neither the theory nor the practicalities.

In order to support that view you would have to insist that every domain had an entity identifier which meant "I'm not an entity identifier but instead an indicator that which entity should be identified when I occur is unknown, although it is certain that there should be such an identifier". That's just a specialised NULL, one of several specialised NULLs that are essential if we want to represent our knowledge of some relations accurately (we can't consider representing the relations themselves accurately unless we assume we have perfect knowledge of them, which will almost always be a really stupid mistake to make, all we can do in practise is represent our knowledge of them).

Consider a two column table WidgetName with columns "widget_id" and "widget_principal_name_id". Now lets say that we always know whether a widget has or does not have a principal name, and we deal with the case where a widget doesn't have a principla name by not letting that widget's identifier occur in the Widget_id column of the WidgetName table. Also, there are some widgets which we know do have principal names, and we need to represent that knowledge, but we don't know which principal names they have; we can't represnet that by leaving that widget out of the WidgetName table (that would mean it had no principal name), so we represent it by putting the widget_id in the table with the widget_principal_name_id in that row having the value null. Now we can get the principal names of all the widgets for which we know the principal name using an inner join, discover which widgets don't have principle names by projection and set subtraction (a one column projection of the widget table has a one column projection of the widget_name table subtracted from it - in SQL subtraction is done either by selection and projection from a one-sided outer join or by using "not exists") and we can discover the widgets which have an unknown principal name by restriction of the widget name table followed by projection. I'd like to see you explain (a) how you would do this without any sort of NULL and (b) in what sense this two column table is not fully normalised.

And don't suggest adding a "has_principal_name" attribute to some other table - yes it works, but it is inefficient and introduces exactly the kind of data redundancy and overhead on updates that normalisation is supposed to eliminate; it might be a good trade-off if more than 50% of widgets had unknown principal names, and in that case I would call it a useful denormalisation. Remember that in a fully normalised schema some tables represent our knowledge about relations (that's while we call it a "relational" system) while others represent our knowledge about the entities which populate the domains of the relations, and that deciding to move information about relations into tables representing entities is most certainly denormalisation.

Tom

sqlvogel SSCrazy Eights Points: 9526 More actions · Answer 2

Tom.Thomson (3/2/2010)
I'd like to see you explain (a) how you would do this without any sort of NULL and (b) in what sense this two column table is not fully normalised.

(a)

CREATE TABLE WidgetWithName (widget_id INT NOT NULL PRIMARY KEY /* REFERENCES Widget ?? */);

CREATE TABLE WidgetName (widget_id INT NOT NULL REFERENCES WidgetWithName (widget_id), widget_principal_name_id INT NOT NULL);

Using a null in place of a value that exists but is unknown would be a less accurate representation because null in SQL does not accurately represent the unknown case. In my example I can accurately represent a widget with a name attribute, a widget without and a widget with a name which is not known.

(b) No table with a null is normalized because all the classic normal forms are defined only for relations without nulls (see E.F.Codd quoted earlier in this thread). To have a sound basis for saying that a table with a null is normalized you would therefore have to reinvent those definitions, in which case I think you ought to state clearly what definitions you are going to use. For example what is a join dependency on a table with nulls? In SQL "nulls don't join" therefore a SQL table with a null is not the join of the projections on its superkeys and therefore it is not in Fifth Normal Form according to the conventional definition of a join dependency.

GSquared SSC Guru Points: 260824 More actions · Answer 3

The main problem with Nulls that I have in normalizing tables is that Null ends up being used for both unknown and no-value. I'd rather have two versions, one for unknown, one for no-value.

For strings, a zero-length string works for no-value. But not for dates, nor for numbers. 0 works for no-value for cardinal numbers, but not necessarily for ordinals. And there really isn't a no-value that I know of for dates. You can hack that one with Jeff's solution of an arbitrary date (usually either the top value or some value too low to be real), but that's only a valid solution if your date range doesn't include the value designated, and it reduces dates to arbitrary pointers to values, instead of atomic values on their own.

Beyond that, normalization by extending optional values into sub-tables works so long as you have a managable set of options.

I've seen some really "normalized" models for people in data that went too far and became significantly unmanageable. For example, names. I have 3: personal, middle, family. But I also have a nickname, and a "most commonly goes by on forms" name. Most databases model this with "first, middle, last", and miss out on potentially important information. My legal name is "Richard Augustin Gwynne", but nobody calls me "Richard", not even my parents. I go by "Gus". On forms that want a first name and a middle initial, it's "Gus A", which doesn't even make sense (since the "A" is the initial for "Augustin", which "Gus" is short for), but it's what appears on a large number of legally binding documents and is legally acceptable. So I ocassionally get two pieces of mail, one addressed to "Richard A", one to "Gus A", both at the same address. None of the usual methods of deduping names will catch that one, since "Gus" isn't normally synonymous with "Richard".

Some models would split the name vertically and use a sequence number to indicate their order. So, with my "ID" (surrogate key of whatever sort), the data would be sequence 1 = Richard, sequence 1 = Gus... oops, just broke that model, since I have two different valid first names.

So it has to be modeled as not only sequence, but valid combinations, so combo 1, sequence 1 = Richard; combo 1, sequence 2 = Augustin; combo 1, sequence 3 = Gwynne; combo 2, sequence 1 = Richard, combo 2, sequence 2 = A; combo 2, sequence 3 = Gwynne; combo 3, sequence 1 = Richard; etc. All the "valid" combinations add up to quite a few rows of data. Reconstructing those requires extensive code compared to other models, but it does achieve the greatest precision. At the cost of often having precision exceed accuracy, which is a waste of time and effort. It also makes for a very, very complex front-end application and a steep learning curve for end users. After all, you can't just split on spaces or some such, since "Mac Donald" is one name-segment in some cases, if "Mac" is a modifier for "Donald", and is two segments if "Mac" is a nickname or some such and stands on its own.

The main advantage to such a system is that it holds "John Smith" just as easily as "Cyrano II de Savignien II de Bergerac III" (I think I misspelled his name badly, but the point is still valid). The main disadvantage is it also makes it just as difficult to enter into the system and to reconstitute from the system.

Another flawed model is "First, Last, Full". I've seen that quite often. Again, has problems and advantages.

Allowing some null columns, makes it much easier to deal with all around.

Very few things in this world are as cut-and-dry in data modeling as many people would like to think. Even gender is complicated by things like California laws and genetic flaws. (What gender, for example, is someone who is diploidy X haploidy Y, with mixed expression on primary sexual characteristics? That's not a legal oddities question, that's a reproductive-function, biology question, and not necessarily easy to answer.)

Personally, I work for whatever achieves the greatest efficiency, scalability, speed, and is easiest to work with for devs and end-users. Always, that's been a compromise. It's never been "fully normalized by relationally pure definitions", but it's been as close as I can get within the real constraints of the data.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

sjsubscribe SSCrazy Points: 2080 More actions · Answer 4

Nulls in join relationships are the problem, otherwise any sensible adaptations would do. If it's just tackling name problems, such as first, last, mi, etc., use something that does not alter the main join properties of the person entity. As others have pointed out, there are techniques for getting at uniqueness, identifying duplicates, parts of a name, and such. But that issue should not complicate the design of the main relationships using the relational model.

I always took the warnings about nulls from C. J. Date and others was about the first kind -- nulls in fields participating in relationships undermines the integrity of the relational model and requires too many error-prone workarounds.

TomThomson SSC Guru Points: 104773 More actions · Answer 5

David,

I think that before dealing with your reply I should quote from the eminent authority to whom you appeal just to dismiss the idea that I am disagreeing with him while you agree with him (perhaps you only read material from 1969, or maybe as late as 1972, and ignored everything he did after that?)

Edward Frank Codd (14/10/1985)
Rule 3: Systematic treatment of null values:
The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of "missing information and inapplicable information" that is systematic, distinct from all regular values (for example, "distinct from zero or any other number", in the case of numeric values), and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way.

I think that puts paid, quite clearly, to the notion that Codd's relational model doesn't permit Nulls. In fact his criteria for a database being relational include a definite requirement that NULL be available.

David Portas (3/2/2010)
(a)
CREATE TABLE WidgetWithName (widget_id INT NOT NULL PRIMARY KEY /* REFERENCES Widget ?? */);
CREATE TABLE WidgetName (widget_id INT NOT NULL REFERENCES WidgetWithName (widget_id), widget_principal_name_id INT NOT NULL);
Using a null in place of a value that exists but is unknown would be a less accurate representation because null in SQL does not accurately represent the unknown case. In my example I can accurately represent a widget with a name attribute, a widget without and a widget with a name which is not known.

Let's describe what you have done in terms of Codd's RM/T: to avoid using a NULL in a P-relation you have introduced a new E-relation which is a restriction of an existing E-relation. Do you really think that's acceptable? I'm sure Chris Date would accept it (he's an anti-null nutcase), and I'm equally sure that Ted Codd would have rejected it as a confusion of domains.

David Portas (3/2/2010)
(b) No table with a null is normalized because all the classic normal forms are defined only for relations without nulls (see E.F.Codd quoted earlier in this thread). To have a sound basis for saying that a table with a null is normalized you would therefore have to reinvent those definitions, in which case I think you ought to state clearly what definitions you are going to use. For example what is a join dependency on a table with nulls? In SQL "nulls don't join" therefore a SQL table with a null is not the join of the projections on its superkeys and therefore it is not in Fifth Normal Form according to the conventional definition of a join dependency.

Try not to tell me "see Codd" when talking about something introduced by Ron Fagin and never touched by Codd. You just lose credibility by doing that. Codd defined 1NF, 2NF, 3NF and of course (with Ray Boyce) BCNF. Fagin invented 4NF, 5NF and DKNF. And Date invented 6NF. Codd couldn't accept 5NF as a requirement because it forbids the NULLs which he required to be included in the relational model.

Codd's publication of his 12 rules defining a relational database led of course to his exit from IBM, because it was regarded as heretical (he didn't mind being considered heretical - IBM had treated him as a dangerous nut since the early 70s, a fact which today IBM will of courrse deny, and ensured that he had no influence over the details of its "relational" database project) because it clearly showed that neither System R nor SQL/DS was relational; Fagin of course remained at IBM Alameda.

Tom

sqlvogel SSCrazy Eights Points: 9526 More actions · Answer 6

The article by Codd that I quoted was talking about normalization of FDs and MVDs generally, not specifically 5NF. Boyce, Codd and Fagin's work on Normal Forms all deal only with relations without nulls. That includes 2NF, 3NF, 4NF, BCNF. I'm not aware of any work on Normal Forms which include nulls. If you know differently then I'd appreciate a reference where I can read about it because it isn't in any of the works I have seen cited before.

The ideas in RM/T are interesting but I think it's fair to say that they have since been superseded by much more comprehensive work on semantic modelling, such as that by Nijssen and Halpin. 5NF is of fundamental importance in my view because it avoids issues like Connection Traps and other types of ambiguity that demonstrably lead to incorrect results. The only alternative in SQL is to add some complex constraints that may be difficult or impossible to implement in some cases.

The implementation of nulls in SQL (through no fault of Codd's of course) is so problematic that I find cautious design to be the best policy. So my approach is to design a schema in 5NF and then denormalize or add nulls only where there is some special advantage to doing so or where SQL somehow forces a null on me. Would you suggest a different approach and what design rules do you apply? Something other than 5NF?

Fundamentally the database is a collection of propositions about the Universe of Discourse. That is how it is viewed by business stakeholders and by data consumers who use it. Since all propositions can be stated as values (ie. without nulls) any nulls are the creation of and the responsibility of the database designer. It has always seemed important to me that when a designer adds a null to a data model he should be clear in his mind about why he is doing so - especially given the obvious ambiguity in SQL where null is used to represent many different and contradictory things in different places. That's why the Normalize First / Add Nulls Later approach works for me. Reasonable people differ of course. 🙂

TomThomson SSC Guru Points: 104773 More actions · Answer 7

David Portas (3/3/2010)
The ideas in RM/T are interesting but I think it's fair to say that they have since been superseded by much more comprehensive work on semantic modelling, such as that by Nijssen and Halpin.

I'm afraid that in my view much (not all, but much) work on the relational model since Codd has been chasing moonbeams.

5NF is of fundamental importance in my view because it avoids issues like Connection Traps and other types of ambiguity that demonstrably lead to incorrect results. The only alternative in SQL is to add some complex constraints that may be difficult or impossible to implement in some cases.

Way back in 1976 Phil Bernstein proposed three criteria for schema generation: 1. The complete representation property - all functional dependencies should be truly represented in the schema; 2. Minimality: no schema with fewer relations represents all functional dependencies. 3. Classical normality: the schema must not violate 3NF. I think those were extremely good criteria. He published an algorithm for generating schemata from functional dependency sets that satisfies these criteria. A later paper with Beeri and Goodman as co-authors proposed an extra principle ("separation") which is a bit stronger than 3NF, and it turned out that schemas generated by Bernstein's algorithm provide much better separation than 3NF. Bernstein's algorithm has low computational complexity (quadratic in the size of the basis of the functionality dependency set), whereas so far as I know there is no useful algorithm for generating schemata with BCNF, 4NF, 5NF or 6NF properties (there are known algorithms, but they all have high computational complexity) from the dependencies. The trouble with BCNF and 4NF is that neither of them provides complete representation. This has been known since the late 1970s. Since 5NF is stronger than 4NF, it was known the day it was proposed that it could not provide complete representation.

The implementation of nulls in SQL (through no fault of Codd's of course) is so problematic that I find cautious design to be the best policy.

I agree, naturally, since SQL got NULL hopelessly wrong.

So my approach is to design a schema in 5NF and then denormalize or add nulls only where there is some special advantage to doing so or where SQL somehow forces a null on me. Would you suggest a different approach and what design rules do you apply? Something other than 5NF?

In principal, I start by listing all functional dependencies and using Bernstein's algorithm to generate a schema. In practise I usually list all the functional dependencies, and then work out an Elementary Key Normal Form (EKNF is intermediate between 3NF and CBNF; the name was invented by Zaniolo in 1982, but Bernstein's 1976 algorithm already generated schemata conforming to EKNF) without following the algorithm. Having this first draft schema, I then look for row-generating dependencies (multi-valued dependencies) and join dependencies; each one or these has to be considered on its merits, maybe complete representation of the functional dependencies can be sacrificed to some extent to alleviate the problems of programming these, or maybe the right answer is to deal with them using triggers or to document them loudly and clearly and subsequently ensure that all code takes account of them (it's horses for courses). I also look at all candidate keys that are non-trivial subsets of the table's attribute set (ie are neither single column nor all columns) and determine whether I should introduce a new domain to represent that key. Obviously I can't allow nulls in attributes that form part of a key which is not a proper superset of another key, nor indeed of any candidate key that I wish to use as a key (in SQL terms, as the subject of a uniqueness constraint or of a primary key constraint). There's no imaginable reaon for not allowing nulls for other attributes, except that SQL has failed to distinguish between the different sorts of NULL and you have to be very sure you know which kind of null you are allowing for each attribute where null is permitted.

Fundamentally the database is a collection of propositions about the Universe of Discourse. That is how it is viewed by business stakeholders and by data consumers who use it. Since all propositions can be stated as values (ie. without nulls) any nulls are the creation of and the responsibility of the database designer.

Oh no they aren't. They are a property of the real world. If something is unknown, it is unknown - end of story. Business stakeholders are generally well aware that there will be things that they don't know and don't want to use an approximation for because someone might assume the approximation was the real value. If we changed all the domains so that, for example, instead of the integer domain we used the union of the integer domain with the two element domain ("I don't know yet", "this information was not collected") and perhaps a few other values (like "input validation found an error in this field but we can't afford to throw the record away") we could do without NULLs (although of course when Codd wanted NULLs that was exactly what he was proposing should be done, so would we then be doing without NULLs? Certainly the constraint I mentioned above about not using NULL in keys applies equally to these values!)

It has always seemed important to me that when a designer adds a null to a data model he should be clear in his mind about why he is doing so

Indeed he should be clear about what he is doing whenever he adds anything to a data model, or changes anything, or removes anything - nulls are no different.

- especially given the obvious ambiguity in SQL where null is used to represent many different and contradictory things in different places.

Too true!

That's why the Normalize First / Add Nulls Later approach works for me. Reasonable people differ of course. 🙂

Well, as noted above it depends on how far you normalise - in cases where there are complex muti-valued dependencies going to 4NF will often introduce some really nasty problems. But as you say, reasonable people differ on this.

Tom

sqlvogel SSCrazy Eights Points: 9526 More actions · Answer 8

Tom.Thomson (3/7/2010)

If something is unknown, it is unknown - end of story. Business stakeholders are generally well aware that there will be things that they don't know and don't want to use an approximation for because someone might assume the approximation was the real value.

I definitely agree. But Null in SQL is an approximation of unknown - and a pretty poor one at that. The behaviour of null in the SQL language does not accurately reflect that of a value being unknown in the real world. So it is already a compromise with reality to force a null upon the user for that situation. If a value is unknown then it is more accurate to leave that value out of the database altogether rather than try to approximate it with a null or some "magic" value. The user then knows he only has to deal with known values and any ambiguity is avoided.

I'm glad you mentioned dependency preservation as a potential reason to prefer 3NF over higher normal forms. I agree that's a good reason and it's one that doesn't seem to get paid much attention to by database professionals... but then many of them don't pay much attention to NFs greater than 3 anyway. The allure of the Rule of Three is too great 🙂

Kevin Butterworth SSC Enthusiast Points: 134 More actions · Answer 9

GSquared (3/2/2010)
Very few things in this world are as cut-and-dry in data modeling as many people would like to think. Even gender is complicated by things like California laws and genetic flaws. (What gender, for example, is someone who is diploidy X haploidy Y, with mixed expression on primary sexual characteristics? That's not a legal oddities question, that's a reproductive-function, biology question, and not necessarily easy to answer.)

I've snipped this part of your post out to cover it in more detail as it raises an interesting point. Hopefully the forum will not eat my reply this time, naughty forum.

Rather than this being a complicated question it is more a demonstration of poor requirements capture. People often want to capture a gender field, and they believe that it gives them valuable information. When you think about it more carefully there is very rarely a good reason to capture gender data as it doesn't necessarily tell you what you want to know.

Imagine that you own a clothing shop and want to capture gender information to target clothes to a person. Capturing a default clothing gender is useful, however capturing a gender preference or legal gender isn't. The reason for this is that if I were to register to buy clothes for my wife (more likely to be the other way round) then I don't want to see clothes based on my gender, but on hers. Giving me the choice to see male clothes, female clothes, or both genders clothes makes a lot more sense than targeting my actual gender.

There are times that you have to capture legal gender, however expecting to use that data in the future as a basis for decisions (ie which ward to place a patient on) is risky. For the times that you are required to capture legal gender it is probably also a good idea to capture (or at least consider capturing) preferred gender, especially if you are intending your system to be used in multiple areas and for a long time. That way the location could be used to return the gender, allowing compliance with California style legislation as it spreads around the world.

As such I don't think the problem is all that serious, it just requires people to capture the information that they need rather than assuming that legal gender is linked to preferences or behaviour.

Sorry for that, now back to the discussion on normalisation of things other than gender.

Jeff Moden SSC Guru Points: 1003871 More actions · Answer 10

Kevin Butterworth (3/15/2010)
That way the location could be used to return the gender, allowing compliance with California style legislation as it spreads around the world.

Heh... Avoid needing to be compliant with something out of control to begin with. 🙂 Instead of capturing the gender of the human(s), capture the "gender" of the product which may be "NA" (example... men and women both buy coffee makers for themselves and others whereas it's mostly women that buy bra's :hehe:).

So far as I'm concerned, it's no one's business as to which gender someone claims to be. It's almost as bad as asking what their sexual preference is. It's private information that should be excluded much like age and race should be excluded. The only time that information should be included is in a government Census (and since they've been getting everything wrong, maybe not even then :w00t:).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Kevin Butterworth SSC Enthusiast Points: 134 More actions · Answer 11

I deal with health data, and for some reason people get really upset when you stick a bloke on a bedpan in the middle of a ward full of women. There is just no avoiding the need to use gender information sometimes.

Personally I see three distinct types of gender information, a legal gender (used to decide which ward people go in), a perceived gender (the gender the person associates themselves with) which would be used in California + other places as it spreads, and a required gender (what type of clothing someone wants to view).

It doesn't matter what my view of perceived gender is, if I'm designing a health information system that will be used for the next decade plus over multiple areas then I need to start considering it in the designs.

TomThomson SSC Guru Points: 104773 More actions · Answer 12

sjsubscribe (1/20/2010)
Certainly the current practice is mixed at best or else no DBMS would allow null values as valid entries in a field. That's what the relational theory purists wants. Rejection refers to that specifically. Hyperbole aside, that's what I got out of the debate anyway.

I certainly didn't get the impression that Fotache was saying that. Not at all. He appeared to saying three things: (a) there are no decent tools to assist in the use of normalisation as a design method; (b) the literature on normalisation is very theoretical, with few attempts to describe normalisation for practical examples instead of as mathematical theory, and many of the few attempts to relate theory to practical examples contain serious errors; and (c) as a consequence of (a) and (b) very few people use normalisation as a mechanism for schema design (although some use some parts of normalisation as a method of checking and perhaps refining a schema design that they have obtained by some other method). And I believe all those three things to be true.

Incidentally, I'm a relational purist and I can tell you that what you say relational purists is quite wrong: I don't think primary key attributes should be allowed to be NULL, I think higher order relations are needed to get a really useful form of the relational calculus, and I think that there are trade-offs that have to be considered very carefully before normalising beyond EKNF (even BCNF may be a step too far) because there are pros and cons to be balanced on a case by case basis.

I regard anti-null fanatics (and also people who can't get accept that MVLs are essential in the real world) as bigoted fundamentalists who have lost the relational plot. You have no idea what real relational purists want, because you think that "relational purist" means one of those bigoted idiots who wants to ban all nulls, to ban the use of MVL, insists that everything is in 6NF, and insists that relational calculus must be a flat first order thing (claiming that this is the gospel accourding to Codd - well it aint). A real relational purist (someone who has done the mathematics and knows how it all hangs together) disagrees with all that nonsense.

Tom

TomThomson SSC Guru Points: 104773 More actions · Answer 13

David Portas (1/22/2010)

PaulB-TheOneAndOnly (1/21/2010)
I can see no conflict between data normalization and accepting null values in non-key columns - as all rdbms out there allow and enforce.
All the conventional Normal Forms apply exclusively to relations with values, never nulls. In fact normalization predates the invention of SQL and nulls. So you can't say that any table with a null represents a relation in Normal Form.

You must explain to us how allowing nulls in non-key fields causes repeating groups to spout up instead of atomis values (1NF), or causes a column suddenly to have a non-trivial functional dependency on a proper subset of a candidate key (2NF), or causes there to be an attribute which is not directly dependent on every candidate key (3NF), or introduces a non-trivial functional dependency on something other than a superkey (BCNF) or introduces a non-trivial multivalued dependency on something other than a superkey (4NF), and we can all have a good laugh. You can't defend your statement quoted above by reference to 5NF or DKNF or 6NF, because you are talking about normalisation that predates the invention of SQL and nulls (in fact that probably means you couldn't use 4NF as an argument either, quite apart from the fact that logically nulls in non-key columns have no effect on 4NF, because SQL predates 4NF and is about contemporary with BCNF). Of course allowing nulls in non-key columns has no effect on 5NF or 6NF anyway; in the case of 5NF and 6NF this is because allowing nulls other than in key columns reduces the number of non-trivial join dependencies, it doesn't increase it (and if there is a join dependecy - trivial or not - each nullable column is present in exactly one of the projections).

If you allow an attribute to be null under normalization, you either have to ignore dependencies involving that attribute or you must make an informal (and incorrect) assumption that null can always be treated like a proper value. Formally speaking nulls just can't work with normalization.

Those two sentences are counterfactual nonsense. Show us the mathematics that supports these statements! Dana Scott and Christopher Strachey would probably be ("would have been" in Christopher's case, sadly) interested to discover that the whole of domain theory (including topological domain theory) and denotational semantics hung upon a totally flawed concept which Codd called "null" and they (and most other mathematicians) called "bottom". Can you find anything in the famous Cardelli and Wegner survey of type theory (http://lucacardelli.name/Papers/OnUnderstanding.A4.pdf) that suggests a domain or a type can't hold a bottom or null value?

In particular, there is no clear answer as to what nonloss decomposition or join dependency means for tables that have nulls in them.

A join dependency for a table that permits nulls in some non-key probels is a set of projections on superkeys whose equijoin contains exactly the rows that were present in the original table from which the projections were taken. Is that a clear enough definition for you? If not, you must have difficulty understanding the concept of join dependency for a table which has no nullable columns, since the concept and its definition are identical. It should be clear that (as I said above) each nullable column is included in exactly one of the projections, and it should be clear that if each projection is on a superkey then this join dependency is implied by the tables candidate keys. So the 5NF/PJNF definition is easy to understand for tables which allow nulls in non-key columns.

So no table with a null is truly normalized. Since it is always possible to decompose further and eliminate nulls they are not formally necessary either - you can always create a truly normalized design without them.

Yes, and I can always guarantee that everything I need to record in the database will be there on day one and no information will be delayed in arriving, or that the distortions to my schema caused by moving out potentially delayed data into separate relations will all be beneficial and the fairy on top of the Christmas tree will ensure that they won't cause enormous coding problems and maintenance problems and performance deterioration through the hundred-way joins needed to elimanate the nulls. Actually, I live in the real world so I don't believe in that fairy and I do believe in nulls.

edit:fix quote tags

Tom

TomThomson SSC Guru Points: 104773 More actions · Answer 14

David Portas (1/22/2010)
Irrespective of that, the Normal Forms, upto and including 6NF don't leave any room for the possibility of nulls. Codd himself says so: "The concepts and rules of functional dependence, multi-valued dependence, and join dependence were all developed without considering missing db-values" ("Missing Informatlon in Relational Databases", 1986).

And he meant what he said, not what you claim what he said meant (which it clearly didn't). What he meant was "It was neccessary to look at these rules again in the light of having nulls". That's been done, and in none of the normal forms which existed by 1986 (10 years after when you say nulls were invented) did the nulls make the slightest difference.

Consider for example a join dependency JD *({A,B},{A,B,C}). What does this mean if A or B contain nulls? In SQL "nulls don't join" so the join of any projections including null will have fewer rows than you started with. How then can any SQL table with nulls be equal to the join of the projections on its superkeys (in other words 5NF)? It cannot be and therefore no table with nulls can accurately be described as being in 5NF.

Well, first of all 5NF doesn't require a table to be the join of the projections on its superkeys, it requires that every non-trivial join dependency is implied by the candidate keys (ie every one of the projections in the dependency is on a superkey, NOT every superkey has a projection on it in the dependency); that didn't have to be amended to allow for NULLs, it was already by 1986 the standard definition of the PJNF).

Second, if A or B is nullable, you can't jave a join dependency in which either A or B occurs in more than one of the projections. So your example makes no sense.

Third, it is easy for a table with nullable columns to have join dependencies; suppose a table has two candidate keys {X} and {Y}, and additional columns A,B,C,D all of which are nullable. Then it's absolutely guaranteed that ({X,Y},{X,A},{Y,B,C}) is a join dependency (the join is clearly loss free) as are also several others like ({X,Y},{Y,A},{X,B},{Y,C}) (because if they were not join dependencies X and Y couldn't be candidate keys).

Tom

TomThomson SSC Guru Points: 104773 More actions · Answer 15

I started looking at this topic again because it was flagged as having new material I hadn't seen before. I don't know why it was flagged like that, because ther was nothing new here. Anyway, some of the early comments triggered me into maybe excessive verbosity. I will shut up for a bit, now.

Tom