First Normal Form

  • Tom.Thomson (6/30/2011)


    KaplanDBA (6/30/2011)


    Setting aside the fact that I'd rather normalize this further than just 1NF, I want to understand your argument better:

    Well, NULL should always mean "The database doesn't contain this value", so it would be used in the case where the value is applicable but unknown for one reason or another as well as in the case where there is no value because it's not applicable; if you want to be able to tell for certain that the value is inapplicable, you can't use NULL unless you can guarantee that there are no cases where the value is applicable but unknown.

    What you're suggesting is that in the case where "Person A: Cannot possibly have a phone number" it would be better to use a garbage value like 000-000-0000,

    but in the case where "Person A: Could have a phone number but we don't know it" you suggest to use NULL?

    No, that's not at all what I'm suggesting. See below.

    What is the fundamental reasoning behind that decision? Is it simply based off the original definitions of NULL from Codd et al.?

    Codd's best known NULL definition in fact defines two distinct NULL values, and he uses a 4-valued logic instead of a 3-valued logic, and I wouldn't dream of using his 4VL for anything so I am certainly not using that definition!

    It's really a question of whether you want to distinguish various cases - and of whether the people who take the database over from you in the future will perhaps want to make distinctins that you currently believe are unimportant. If you have allowed your schema to have some values that can be inapplicable, you need to have some way of expressing "this is inapplicable". If you have a real world situation where some data may have to be entered before all the attributes are known, you need some way of expressing "we haven't put this value in the database". You can't use NULL to represent both if you want to be able to distinguish the two cases - but if you don't want to distinguish the two cases and are sure that no-one in future will then of course you can use NULL for both (but it's a good idea if you do so to document it very clearly in very many places). Since real world situations always seem to result in values being absent for one silly reason or another, it's a good idea to reserve NULL for that case - which means you need something other than NULL for the "known to be inapplicable" case. That's really all there is to it - no amazing mathematical basis or complex logic, just something simple and obvious.

    Chris Date suggested that ALL nulls should be replaced by special values; I don't believe that's possible, but some people do.

    Ahhh, ok. So this answers my original query regarding usage of NULLs vs. arbitrary values. Makes sense now. For my purposes though, I think I will stick with NULLs unless the specific situation calls for a distinction between unknown value and non-applicative value.

    Thanks for explaining and looking forward to next article (and the lively debate that is sure to follow!)

    πŸ˜‰

    - George

  • When I first read this article I was surprised by the statement that having a list in a column could be 1NF if the database was never asked to process that list. This seems to fly in the face of how normalization is taught.

    However, after thinking about it I understand the point. If, for example, I have a table that stores user settings that looks like:

    USER SETTINGS

    ----------------------------

    BOB ASDFGHJKL

    TOM QWERTYUI

    The fact that each character in the settings column might mean something to the application doesn't affect the database at all. This is, in fact, no different than the phone number list, it just looks different.

    Regarding Primary Keys (Candidate keys) it might help people to think that the word Primary here is referencing Prime, not First. The Primary Key is not the "most import candidate key" All Candidate keys are Primary: They are a relatively prime (having no duplication) set of columns in the relation. The use of the term Primary Key for the "Most important Index" in SQL Engines is an unfortunate ambiguity.

    Regarding Nulls, I haven't followed the theoretical literature enough to notice that there is a raging debate here, but from a practical standpoint I have noticed that people getting started in SQL tend to get burned by NULL and then avoid it for a while. Then they start doing OUTER joins and are forced to become more familiar with accommodating NULLs. It would be nice if there was a more clean SQL syntax than (X is NULL OR X <> 'Value'), but that is implementation, not theory.

    --

    JimFive

  • James Goodwin (7/1/2011)


    When I first read this article I was surprised by the statement that having a list in a column could be 1NF if the database was never asked to process that list. This seems to fly in the face of how normalization is taught.

    However, after thinking about it I understand the point. If, for example, I have a table that stores user settings that looks like:

    USER SETTINGS

    ----------------------------

    BOB ASDFGHJKL

    TOM QWERTYUI

    The fact that each character in the settings column might mean something to the application doesn't affect the database at all. This is, in fact, no different than the phone number list, it just looks different.

    You know, we could take what you said even further... What happens to a phone number when the government decides to change phone area codes? Isn't a phone number simply a 3 part identifier. Area Code, Locale, and Phone Line identifier (names may vary). So shouldn't we create a whole table that is nothing but area codes, and a table for the locale, and a table for the phone line identifiers? Or should we just leave a single phone number? If you go to the length of decomposing the phone number inside the DB, you need to split it out. If you only store the value and place no meaning on it, you don't care what it is.

  • Thanks Tom. This is a topic that needs to be covered time and again. Good work.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • David Portas (7/1/2011)


    For example if you think that 1NF disallows certain types of value in a relation then that's something that isn't required by any other NF.

    I think we need a <joke>....<ekoj> delimiter in this forum, but I guess I can do it with smiley variants.

    Oh David, I hope you realise what you are saying! :hehe: :hehe: :hehe: :w00t:

    If NULL is a type of value (which is certainly the domain-theoretic interpretation of it) then none of the normal forms (apart from possibly 1NF) can ban it! πŸ˜€ πŸ™‚ πŸ˜€

    A relation can be in 5NF and allow NULLs ( :Wow: presumably even in prime attributes, which even I would never accept) because you can't ban values in the higher normal forms? You appear to have gone from solidly anti-null to extremely :alien: pro-NULL!

    Have you really come over to the side of the just and ditched the anti-null nonsense? πŸ˜› 😎

    It's a pity that you didn't really mean it.

    Tom

  • Now for the serious answer to the interesting points in David's post.

    David Portas (7/1/2011)


    All the normal forms are concerned with dependencies satisfied by relations (or relation variables). If 1NF is understood to be exactly equivalent to the definition of a relation then 2NF, 3NF, BCNF, etc certainly also imply 1NF, just because relations are the only things that satisfy those normal forms.

    Actually definitions of 2NF that state that the reation has to be in 1NF in order to be in 2NF are extremely common. But in fact I agree with you: 1NF is essentially the property of being a relational database theory relation.

    However, if you understand (misunderstand in my view) 1NF to stipulate other conditions over and above the basic requirement that a schema consists only of relations then the higher normal forms don't have to satisfy 1NF.

    I'm going to base what follows on the idea that you have picked this idea up from reading the 3rd manifesto (or reading about it).

    Actually, I think you have misunderstood the third manifesto here. Data and Darwen explicitly retain the atomicity principle. Unfortunately they thought they understood type systems, so didn't bother to learn how to describe type systems that permit existential types, and believed that CLOS's intent for a firm separation of the implementation of a method from its interface definition meant that no part of the method was part of an instantiation of the type and that this would be generalisable to all possible type systems, so they make a complete pigs breakfast of suggesting how support for existential types (if I remember rightly they don't even call them existential types, so even a type theorist can have trouble seeing what they are trying to get at) and unless things have changed in another edition of the book they have no concrete proposal at all for a type system, just a few general principles for which there is no certainty that they will actually work and which seem to imply that the type system will be undecidable (not necessarily a bad thing, in fact I was trying to go there in the late 80s but I came to the conclusion that it was too dangerous; but point someone like Cardelli at it and something good might eventually come out). The point here is that despite their confused and somewhat incoherent verbiage about the type system, they retain the principle of atomicity: the only domains allowed are those all of whose interfaces are made available in the language that the database system uses for it's relational calculus.

    For example if you think that 1NF disallows certain types of value in a relation then that's something that isn't required by any other NF.

    You are wrong. All normal forms disallow non-atomic values. Atomic values are those all of whose interfaces are directly implied by their type, which is available in the database system, and may have pretty complex structures (just as atoms in physics have pretty complex structures of electrons, protons, neutrons, gluons,... but are still atomic).

    Or if you think 1NF means a relation must have a "primary" key and that "primary" means something other than a candidate key then you'd also be defining something that was not required by any other NF.

    I hope no-one thionks that "a primary key" means something other than "a candidate key"; of course "a primary key" is the traditional terminology that has been used in relational theory for 41 years, and "a candidate key" was presumably introduced because lingustically challenged people couldn't understand the difference between "a primary key" and "the primary key".

    This is why I don't think it's productive to say that 1NF disallows certain types of value - even complex values like XML that the DBMS may later deconstruct into other values; or relation values; or other "exotic" objects like multimedia. Tom's article suggests a criteria for excluding certain types of value from 1NF based on how you intend to use them in the database. I disagree. If it is useful to put some value in a relation and potentially subject it to the same dependencies as any other relation then it's perfectly reasonable to want to apply the principles of normal forms to it. In Codd's day databases were concerned with numbers and strings and little else. That's not the world we are in today. As soon as we start placing limits on the type of data values a relation is "allowed" to include then we weaken the relational model and make it less useful.

    I guess we will continue to disagree then. But a definition of 1NF based on the third manifesto is not going to be any use at all to anyone working with (or planning to work with) real live databases, who are the target audience for the article, and that's why I'm sticking with the original definition.

    Another restriction that Codd insisted on was that the relational calculus was a first order calculus: a relation could not be the value of an attribute. The third manifesto has ditched this - and that ensures that no relational calculus for a 3rd manifesto database can be based on first order predicate calculus (:hehe: that's the sort of mess you get into when you let a Cambridge mathematician, like Date, mess about with a nice clean theory developed by an Oxford man, like Codd :-D). It will have to use higher order predicate calculus. Ouch! I'll stick with tradition and continue to assert that 1NF explicitly excludes that possibility. Of course given that an RDB-theoretical relation is a function on time to equivalence classes (so that column order can be permuted) of set theoretic maps (since you have to know which attribute is which somehow, and attribute order is not significant so you can't use a mathematical relation) on a fixed bag (not a set, since the elements are named) of named domains I find it a bit difficult to understand just what was intended by that 3rd manifesto statement.

    Tom

  • Tom, I think that overall this is a good primer on first normal form. You brought up the point that what constitutes an atomic value depends on the context with which it is used within the database. For example, even zip codes and phone numbers can potentially be deconstructed into separate key values, and the same for phone and even SSN. However, for the majority of databases (outside the USPS and AT&T), these codes are not decomposed by either the database or the application, so they can be considered atomic within that context.

    zip_code

    --------

    47722

    4 = Indiana

    77 = Vanderburgh County

    22 = University of Evansville

    You also pointed out that relations between something like a customer and their phone numbers(s) or spouse(s) would be best contained in separate table. Philosophical debates aside, these points, in addition to having primary key (preferably non-surrogate), are what's important to get across when introducing 1NF.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • In addition to the phone number and zip code decomposition I'd like to mention another is-it-atomic data: email address. You can't really do much without the whole thing in one piece. However if you store the username in one field and the domain in another you get the benefit of tight index scans finding all the users at a particular domain (assuming you have the domain indexed).

    Similarly the storage of IP addresses is pretty straightforward, right? varchar(15) is sufficient for period-separated 3 digit strings. Use hex and we only need varchar(11). Or is it 4 tinyint columns? Wait, isn't that the same as an int? You mean we can index every IP address on the IPv4 Internet with a single Int field? Yup. πŸ™‚ Maybe it wouldn't be the most human-readable form but it would be pretty efficient in both space-use and indexability.

    You can see that the only sure answer is the often quoted "it depends."

  • A couple of people have suggested that the normalised forms of the tables used as examples should have been included in the article (some in comments here, others in personal messages and email). I think that’s a pretty fair comment on the article, so here I have provided (a) SQL Code to generate the tables normalised to 1NF assuming that the structures of the two messy fields do have to be understood by the database system, (b) the rows that would be in those new tables to hold the data contained in the rows shown in the original un-normalised example, and (c) a couple of extra comments that people may find useful.

    Here is the SQL code:

    CREATE TABLE Employee (

    PersonName nvarchar(64) NOT Null Primary Key,

    OfficePhone varchar(18) NOT Null, -- everyone has an office phone

    HomePhone varchar(18) NOT Null, -- everyone has a home phone

    )

    CREATE TABLE FaxPhone (

    PersonName nvarchar(64) NOT Null Primary Key,

    FaxNumber varchar(18) NOT Null,

    CONSTRAINT FK_FaxPhone_Person Foreign Key

    (PersonName) references Employee(PersonName),

    )

    CREATE TABLE Spouse (

    PersonName nvarchar(64) NOT Null Primary Key,

    SpouseName nvarchar(64) NOT Null,

    SpouseDoB date NOT Null,

    Constraint FK_Spouse_Employee Foreign Key

    (PersonName) references Employee(PersonName),

    )

    And here are the representations of the original two rows in the three tables of the normalised version.

    table Employee

    PersonName OfficePhoneHomePhone

    Bill 121 643 4321121 641 3159

    Frank 121 643 4324119 861 8183

    table FaxPhone

    PersonNameFaxNumber

    Bill 121 643 4497

    Frank 121 643 4497

    Table Spouse

    PersonNameSpouseNameSpouseDoB

    Bill Mary 19951108

    An important extra point here: the the SpouseDoB column uses the date type not varchar because of course another part of normalisation to 1NF was to ensure that this date is held as an atomic item whose interface is available in the databases language; here SQL is assumed, and the language provides an interface for the date type which allows the year, month, and day components of the atomic date object to be obtained as well as allowing easy comparison and equality testing for dates. The display form I used above is of course (one of) the international standard(s), and is fully supported as an input format by SQL.

    Another extra point: if enough employees share fax numbers (as might be expected from the two shown) it might be a useful optimisation to have a separate table for Fax numbers allowing a surrogate primary key to be assigned, thus saving a few bytes per row in the FaxPhone table (the FaxNumber column in the FaxPhone table would be replaced by an integer – maybe even a tinyint - FaxNumberId column). This is not part of normalisation for the structure shown above, but could be a useful optimisation if the properties of the real data were such that it was more efficient than the schema shown above. This does sometimes happen, for example when a company provides only one or two Fax numbers for an office block occupied by more than a thousand employees, but this seems to be pretty unusual in the real world so doing this will often deliver worse performance than not doing it, although there are those who claim it is always an optimisation and never a pessimisation.

    Tom

  • Tom,

    I share JJ B's sentiments. Whether I agree with the content of the article or not (and, considering the zealous and sometimes unprofessional retorts in this discussion by others, is a fact that shall remain a mystery to all but me for I am not as well prepared to defend a position on the subject as you πŸ˜› ), I very much appreciate the extremely professional manner in which you're dealing with sometimes hostile and/or opinionated posts.

    I'll also state that you've brought some concepts and perceptions to light through your fine article which I've not previously considered and that, too, is very much appreciated.

    Very well done on all accounts, Sir. I very much look forward to your upcoming articles on this hotly contested subject.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/2/2011)


    Tom,

    I share JJ B's sentiments. Whether I agree with the content of the article or not (and, considering the zealous and sometimes unprofessional retorts in this discussion by others, is a fact that shall remain a mystery to all but me for I am not as well prepared to defend a position on the subject as you πŸ˜› ), I very much appreciate the extremely professional manner in which you're dealing with sometimes hostile and/or opinionated posts.

    I'll also state that you've brought some concepts and perceptions to light through your fine article which I've not previously considered and that, too, is very much appreciated.

    Very well done on all accounts, Sir. I very much look forward to your upcoming articles on this hotly contested subject.

    Jeff,

    Thank you very much for your comments. As you know, your opinion is one that I greatly respect, so your kind remarks are very welcome, very much appreciated. To know that I've brought new concepts and perceptions to your attention makes me very proud.

    Actually, I wouldn't go so far as to call any of the comments so far unprofessional. Some are certainly not well thought out, but there my response is of course aimed at enabling people to think things out more clearly - and I'm not sufficiently foolish to believe that all my opinions are correct, so I'm perfectly happy for people to challenge them as long as they do it without introducing personal attacks, which I don't think have been introduced in the comments here.

    Tom

  • Tom.Thomson (7/2/2011)


    Actually, I wouldn't go so far as to call any of the comments so far unprofessional.

    There was one in particular (not directed at you) that seemed it would derail this thread (as so many threads have been in the past) as I read through it. You and some of the other good folks did such a good job at keeping things on track that it's no wonder that you have missed it. Like I said, very well done.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • James Goodwin (7/1/2011)


    It would be nice if there was a more clean SQL syntax than (X is NULL OR X <> 'Value'), but that is implementation, not theory.

    There is an alternative to that, which I use quite a lot:

    http://sqlblog.com/blogs/paul_white/archive/2011/06/22/undocumented-query-plans-equality-comparisons.aspx

  • James Goodwin (7/1/2011)

    ...

    Regarding Nulls, I haven't followed the theoretical literature enough to notice that there is a raging debate here, but from a practical standpoint I have noticed that people getting started in SQL tend to get burned by NULL and then avoid it for a while. Then they start doing OUTER joins and are forced to become more familiar with accommodating NULLs. It would be nice if there was a more clean SQL syntax than (X is NULL OR X <> 'Value'), but that is implementation, not theory.

    --

    JimFive

    Speaking of NULL implementations, when I first started coding in Oracle SQL, the biggest snag I ran into was that an empty string '' is NULL.

    'Smith' > '' : False

    '' = '' : False

    '' is null : True

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Tom,

    I'm well aware that the relational model and TTM require relations that consist of atomic attributes. What I see as a problem is that your article uses another criterion for 1NF:

    Is this or isn't it in 1NF? That depends on what the database is expected to do with it

    You define 1NF based on intended usage. This doesn't make any sense to me. Suppose I create a relation variable using my RDBMS and call it T. T happens to have a string attribute that encodes multiple telephone numbers. That's OK because a string is by definition an atomic value. According to your definition T is in 1NF as long as "the database doesn't know anything about that encoding". I don't have any problem with that so far because I don't ever intend to split out the phone numbers in the database. T is in 1NF.

    Later on however, I change my mind. I want to split out the phone numbers using some code in my database. As you put it: "There will have to be code in queries to decode that complex string". Now you say that T is not in 1NF. But it's the same relation variable as before! The atomicity of the string and the structure of the relation has not changed. Just because I intend to use it differently T cannot suddenly stop being a relation. So your suggested criteria is an additional requirement for 1NF over and above the requirement for T to be a relation - a relation being a thing which is defined by structure alone and never by intended usage. According to you, the relation variable T can change from moment to moment from being in 1NF to not being in 1NF depending on what code exists in my database.

    The relations which according to your definition aren't in 1NF could also still satisfy higher normal forms like BCNF or 5NF even though you say they violate 1NF. If I've misunderstood your criterion for deciding 1NF then I think you need to clarify this point. To me it is fundamental that 1NF means nothing more than the definition of a relation, which is simply a data structure in a database. 1NF does not depend in any way on the intended usage of the data but purely on its structure.

    Tom.Thomson (7/1/2011)


    If NULL is a type of value (which is certainly the domain-theoretic interpretation of it)

    However Codd says that Null marks are not values and Date and pretty much everyone else agrees. Codd never used the the domain theory definition of a domain in any work of his that I'm aware of. In fact he specifically requires that null marks be independent of domain - by which I understand him to mean they are orthogonal to the concept of a domain. If you are putting forward a different definition of nulls and null behaviour (as you have done in another SSC thread) then you ought to make it clear that your 1NF definition isn't generally applicable to other systems using different things called "nulls".

Viewing 15 posts - 46 through 60 (of 69 total)

You must be logged in to reply to this topic. Login to reply