Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12345»»»

Normalization Expand / Collapse
Author
Message
Posted Saturday, January 23, 2010 3:05 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Thursday, December 4, 2014 7:52 AM
Points: 9,294, Visits: 9,495
David Portas (1/22/2010)
PaulB-TheOneAndOnly (1/21/2010)
[quote]I can see no conflict between data normalization and accepting null values in non-key columns - as all rdbms out there allow and enforce.


All the conventional Normal Forms apply exclusively to relations with values, never nulls. In fact normalization predates the invention of SQL and nulls. So you can't say that any table with a null represents a relation in Normal Form.


This just is not true David. E. F. Codd himself championed the use of Nulls in Relational DBMS's and if he saw no problem with it, then I sure do not.


-- RBarryYoung, (302)375-0451 blog: MovingSQL.com, Twitter: @RBarryYoung
Proactive Performance Solutions, Inc.
"Performance is our middle name."
Post #852633
Posted Saturday, January 23, 2010 4:19 PM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: 2 days ago @ 4:01 AM
Points: 453, Visits: 3,498
RBarryYoung (1/23/2010)
David Portas (1/22/2010)

All the conventional Normal Forms apply exclusively to relations with values, never nulls. In fact normalization predates the invention of SQL and nulls. So you can't say that any table with a null represents a relation in Normal Form.


This just is not true David. E. F. Codd himself championed the use of Nulls in Relational DBMS's and if he saw no problem with it, then I sure do not.


I already mentioned that in later years Codd did champion the use of nulls (more precisely he championed the use of I-marks and A-marks and was less happy with SQL-style nulls). See the references in my previous reply. That doesn't alter the fact that the normalization and normal forms predate the invention of nulls. NFs 2,3,BCNF,4,5,6 therefore deal only with relations with values, never with nulls.

I'm not sure what it is you are disagreeing with. I'm always happy to learn where I'm wrong but I think the chronology is pretty indisputable. The fact that the dependency theory on which normalization is based excludes the possibility of nulls is also pretty clear and uncontroversial as far as I know (see my previous quotation for Codd's own view). What am I wrong about?

P.S. Codd did in fact acknowledge and write about some problems with his later attempts to retro-fit null marks and other features into the RM (his "RM version 2"). Arguably he created more problems than he solved. That may partly explain why as far as I know Codd's RM V2 model has never been implemented and it has been neglected by both the scientific and industry communities.
Post #852646
Posted Saturday, January 23, 2010 10:20 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 6:12 PM
Points: 35,808, Visits: 32,482
Jeff Moden (1/23/2010)
So, when YOU create an address table, do you have a "second address" line or do you normalize it? Enquiring minds want to know.


I'd still like an answer to the question above, David.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #852679
Posted Sunday, January 24, 2010 4:21 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: 2 days ago @ 4:01 AM
Points: 453, Visits: 3,498
Jeff Moden (1/23/2010)
Jeff Moden (1/23/2010)
So, when YOU create an address table, do you have a "second address" line or do you normalize it? Enquiring minds want to know.


I'd still like an answer to the question above, David.


You are asking how I would implement optional parts of an address within the limitations of a fixed set of numbered address line attributes? If an address has fewer lines than there are attributes then I'd populate the extra lines with zero-length strings. Null wouldn't be appropriate because all the parts of the address are known and present and using null would give undesirable results. For example if you compare all the parts of an address in a join or subquery you ought to expect two identical addresses to be returned as equivalent, but that wouldn't work if you start adding nulls into addresses.

I'm not much of a fan of the "AddressLine" design pattern. There are alternatives. For example the BS7666 address standard we have in the UK. Even a single attribute for the whole address (except postal code) can make more sense than having numbered lines in an address table.
Post #852712
Posted Sunday, January 24, 2010 9:20 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 6:12 PM
Points: 35,808, Visits: 32,482
I was hoping that you'd say that. I agree. All parts of the address are known is it is known that the 2nd address line has a value of nothing rather than being unknown which is what a NULL would indicate.

In an even more practical sense, the use of NULL's for "end dates" can also be even more bothersome and I'll use the proverbial "end of time" date (9999-12-31) instead of a NULL.

The problem with both of those is, or course, duplication of "non data" similar to the "conventional" use of NULL. Some folks could imply that such duplication in those columns I speak of is justification for additional normalization and that may even be true. It just wouldn't be practical in my humble opinion. That's where I thought you guys were going with all of this and I'm relieved that's not the case... Heh... I didn't want to be a renegade for yet another thing in SQL.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #852736
Posted Tuesday, March 2, 2010 8:23 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Tuesday, December 23, 2014 7:18 PM
Points: 7,930, Visits: 9,654
David Portas (1/22/2010)
Since it is always possible to decompose further and eliminate nulls they are not formally necessary either - you can always create a truly normalized design without them.

This is an extremely naive view supported by neither the theory nor the practicalities.

In order to support that view you would have to insist that every domain had an entity identifier which meant "I'm not an entity identifier but instead an indicator that which entity should be identified when I occur is unknown, although it is certain that there should be such an identifier". That's just a specialised NULL, one of several specialised NULLs that are essential if we want to represent our knowledge of some relations accurately (we can't consider representing the relations themselves accurately unless we assume we have perfect knowledge of them, which will almost always be a really stupid mistake to make, all we can do in practise is represent our knowledge of them).

Consider a two column table WidgetName with columns "widget_id" and "widget_principal_name_id". Now lets say that we always know whether a widget has or does not have a principal name, and we deal with the case where a widget doesn't have a principla name by not letting that widget's identifier occur in the Widget_id column of the WidgetName table. Also, there are some widgets which we know do have principal names, and we need to represent that knowledge, but we don't know which principal names they have; we can't represnet that by leaving that widget out of the WidgetName table (that would mean it had no principal name), so we represent it by putting the widget_id in the table with the widget_principal_name_id in that row having the value null. Now we can get the principal names of all the widgets for which we know the principal name using an inner join, discover which widgets don't have principle names by projection and set subtraction (a one column projection of the widget table has a one column projection of the widget_name table subtracted from it - in SQL subtraction is done either by selection and projection from a one-sided outer join or by using "not exists") and we can discover the widgets which have an unknown principal name by restriction of the widget name table followed by projection. I'd like to see you explain (a) how you would do this without any sort of NULL and (b) in what sense this two column table is not fully normalised.

And don't suggest adding a "has_principal_name" attribute to some other table - yes it works, but it is inefficient and introduces exactly the kind of data redundancy and overhead on updates that normalisation is supposed to eliminate; it might be a good trade-off if more than 50% of widgets had unknown principal names, and in that case I would call it a useful denormalisation. Remember that in a fully normalised schema some tables represent our knowledge about relations (that's while we call it a "relational" system) while others represent our knowledge about the entities which populate the domains of the relations, and that deciding to move information about relations into tables representing entities is most certainly denormalisation.


Tom
Post #875205
Posted Tuesday, March 2, 2010 9:00 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: 2 days ago @ 4:01 AM
Points: 453, Visits: 3,498
Tom.Thomson (3/2/2010)
I'd like to see you explain (a) how you would do this without any sort of NULL and (b) in what sense this two column table is not fully normalised.


(a)

CREATE TABLE WidgetWithName (widget_id INT NOT NULL PRIMARY KEY /* REFERENCES Widget ?? */);

CREATE TABLE WidgetName (widget_id INT NOT NULL REFERENCES WidgetWithName (widget_id), widget_principal_name_id INT NOT NULL);

Using a null in place of a value that exists but is unknown would be a less accurate representation because null in SQL does not accurately represent the unknown case. In my example I can accurately represent a widget with a name attribute, a widget without and a widget with a name which is not known.

(b) No table with a null is normalized because all the classic normal forms are defined only for relations without nulls (see E.F.Codd quoted earlier in this thread). To have a sound basis for saying that a table with a null is normalized you would therefore have to reinvent those definitions, in which case I think you ought to state clearly what definitions you are going to use. For example what is a join dependency on a table with nulls? In SQL "nulls don't join" therefore a SQL table with a null is not the join of the projections on its superkeys and therefore it is not in Fifth Normal Form according to the conventional definition of a join dependency.
Post #875241
Posted Tuesday, March 2, 2010 10:14 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Thursday, December 18, 2014 9:58 AM
Points: 13,872, Visits: 9,600
The main problem with Nulls that I have in normalizing tables is that Null ends up being used for both unknown and no-value. I'd rather have two versions, one for unknown, one for no-value.

For strings, a zero-length string works for no-value. But not for dates, nor for numbers. 0 works for no-value for cardinal numbers, but not necessarily for ordinals. And there really isn't a no-value that I know of for dates. You can hack that one with Jeff's solution of an arbitrary date (usually either the top value or some value too low to be real), but that's only a valid solution if your date range doesn't include the value designated, and it reduces dates to arbitrary pointers to values, instead of atomic values on their own.

Beyond that, normalization by extending optional values into sub-tables works so long as you have a managable set of options.

I've seen some really "normalized" models for people in data that went too far and became significantly unmanageable. For example, names. I have 3: personal, middle, family. But I also have a nickname, and a "most commonly goes by on forms" name. Most databases model this with "first, middle, last", and miss out on potentially important information. My legal name is "Richard Augustin Gwynne", but nobody calls me "Richard", not even my parents. I go by "Gus". On forms that want a first name and a middle initial, it's "Gus A", which doesn't even make sense (since the "A" is the initial for "Augustin", which "Gus" is short for), but it's what appears on a large number of legally binding documents and is legally acceptable. So I ocassionally get two pieces of mail, one addressed to "Richard A", one to "Gus A", both at the same address. None of the usual methods of deduping names will catch that one, since "Gus" isn't normally synonymous with "Richard".

Some models would split the name vertically and use a sequence number to indicate their order. So, with my "ID" (surrogate key of whatever sort), the data would be sequence 1 = Richard, sequence 1 = Gus... oops, just broke that model, since I have two different valid first names.

So it has to be modeled as not only sequence, but valid combinations, so combo 1, sequence 1 = Richard; combo 1, sequence 2 = Augustin; combo 1, sequence 3 = Gwynne; combo 2, sequence 1 = Richard, combo 2, sequence 2 = A; combo 2, sequence 3 = Gwynne; combo 3, sequence 1 = Richard; etc. All the "valid" combinations add up to quite a few rows of data. Reconstructing those requires extensive code compared to other models, but it does achieve the greatest precision. At the cost of often having precision exceed accuracy, which is a waste of time and effort. It also makes for a very, very complex front-end application and a steep learning curve for end users. After all, you can't just split on spaces or some such, since "Mac Donald" is one name-segment in some cases, if "Mac" is a modifier for "Donald", and is two segments if "Mac" is a nickname or some such and stands on its own.

The main advantage to such a system is that it holds "John Smith" just as easily as "Cyrano II de Savignien II de Bergerac III" (I think I misspelled his name badly, but the point is still valid). The main disadvantage is it also makes it just as difficult to enter into the system and to reconstitute from the system.

Another flawed model is "First, Last, Full". I've seen that quite often. Again, has problems and advantages.

Allowing some null columns, makes it much easier to deal with all around.

Very few things in this world are as cut-and-dry in data modeling as many people would like to think. Even gender is complicated by things like California laws and genetic flaws. (What gender, for example, is someone who is diploidy X haploidy Y, with mixed expression on primary sexual characteristics? That's not a legal oddities question, that's a reproductive-function, biology question, and not necessarily easy to answer.)

Personally, I work for whatever achieves the greatest efficiency, scalability, speed, and is easiest to work with for devs and end-users. Always, that's been a compromise. It's never been "fully normalized by relationally pure definitions", but it's been as close as I can get within the real constraints of the data.


- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Post #875345
Posted Tuesday, March 2, 2010 1:29 PM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Thursday, December 20, 2012 1:03 PM
Points: 265, Visits: 589
Nulls in join relationships are the problem, otherwise any sensible adaptations would do. If it's just tackling name problems, such as first, last, mi, etc., use something that does not alter the main join properties of the person entity. As others have pointed out, there are techniques for getting at uniqueness, identifying duplicates, parts of a name, and such. But that issue should not complicate the design of the main relationships using the relational model.

I always took the warnings about nulls from C. J. Date and others was about the first kind -- nulls in fields participating in relationships undermines the integrity of the relational model and requires too many error-prone workarounds.
Post #875498
Posted Wednesday, March 3, 2010 3:35 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Tuesday, December 23, 2014 7:18 PM
Points: 7,930, Visits: 9,654
David,

I think that before dealing with your reply I should quote from the eminent authority to whom you appeal just to dismiss the idea that I am disagreeing with him while you agree with him (perhaps you only read material from 1969, or maybe as late as 1972, and ignored everything he did after that?)
Edward Frank Codd (14/10/1985)
Rule 3: Systematic treatment of null values:

The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of "missing information and inapplicable information" that is systematic, distinct from all regular values (for example, "distinct from zero or any other number", in the case of numeric values), and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way.

I think that puts paid, quite clearly, to the notion that Codd's relational model doesn't permit Nulls. In fact his criteria for a database being relational include a definite requirement that NULL be available.
David Portas (3/2/2010)

(a)

CREATE TABLE WidgetWithName (widget_id INT NOT NULL PRIMARY KEY /* REFERENCES Widget ?? */);

CREATE TABLE WidgetName (widget_id INT NOT NULL REFERENCES WidgetWithName (widget_id), widget_principal_name_id INT NOT NULL);

Using a null in place of a value that exists but is unknown would be a less accurate representation because null in SQL does not accurately represent the unknown case. In my example I can accurately represent a widget with a name attribute, a widget without and a widget with a name which is not known.

Let's describe what you have done in terms of Codd's RM/T: to avoid using a NULL in a P-relation you have introduced a new E-relation which is a restriction of an existing E-relation. Do you really think that's acceptable? I'm sure Chris Date would accept it (he's an anti-null nutcase), and I'm equally sure that Ted Codd would have rejected it as a confusion of domains.

David Portas (3/2/2010)
(b) No table with a null is normalized because all the classic normal forms are defined only for relations without nulls (see E.F.Codd quoted earlier in this thread). To have a sound basis for saying that a table with a null is normalized you would therefore have to reinvent those definitions, in which case I think you ought to state clearly what definitions you are going to use. For example what is a join dependency on a table with nulls? In SQL "nulls don't join" therefore a SQL table with a null is not the join of the projections on its superkeys and therefore it is not in Fifth Normal Form according to the conventional definition of a join dependency.


Try not to tell me "see Codd" when talking about something introduced by Ron Fagin and never touched by Codd. You just lose credibility by doing that. Codd defined 1NF, 2NF, 3NF and of course (with Ray Boyce) BCNF. Fagin invented 4NF, 5NF and DKNF. And Date invented 6NF. Codd couldn't accept 5NF as a requirement because it forbids the NULLs which he required to be included in the relational model.

Codd's publication of his 12 rules defining a relational database led of course to his exit from IBM, because it was regarded as heretical (he didn't mind being considered heretical - IBM had treated him as a dangerous nut since the early 70s, a fact which today IBM will of courrse deny, and ensured that he had no influence over the details of its "relational" database project) because it clearly showed that neither System R nor SQL/DS was relational; Fagin of course remained at IBM Alameda.


Tom
Post #875789
« Prev Topic | Next Topic »

Add to briefcase ««12345»»»

Permissions Expand / Collapse