"Real" copy of Codd's 12 (13) Rules for RDBMS

  • Jeff Moden (8/9/2010)


    mtillman-921105 (8/9/2010)


    Actually, this breaks a cardinal rule in database design - we are using one column for more than one kind of data. It answers two questions: Was there a test taken? and What were the results of the test? Ideally, there should be one column signifying if the blood pressure was taken and another for the results.

    So... let me ask the question... have you actually implemented two columns like that and for the reason you gave when only one is needed? Be honest now...

    Yes, I have, usually you can do more with it than add a bit column. Maybe a date for when the BP was measured for example (and no date = no test, still no need for a NULL there). Most often, this step isn't that important. Or, I will use a special code in the column such as "-1" to designate odd results. But if the data is as sensitive as PaulB claims, I still think the best solution is a child table - in which case the NULLS in that column become superfluous anyway.

    But let me ask a quesion to you Jeff, that I beleive that David Portas brought up, why are there no NULLs in the business world if they're so important? Have you ever seen a spreadsheet with them showing from a non-programmer? (Of course a file originally from a progammer doesn't count - sometimes we leave them in there. I usually try to hide them.)

    Now I may be wrong about all this - I am a self-taught programmer, so it is possible that most people don't have the issues I do with NULLs.

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

  • mtillman-921105 (8/10/2010)


    Jeff Moden (8/9/2010)


    mtillman-921105 (8/9/2010)


    Actually, this breaks a cardinal rule in database design - we are using one column for more than one kind of data. It answers two questions: Was there a test taken? and What were the results of the test? Ideally, there should be one column signifying if the blood pressure was taken and another for the results.

    So... let me ask the question... have you actually implemented two columns like that and for the reason you gave when only one is needed? Be honest now...

    Yes, I have, usually you can do more with it than add a bit column. Maybe a date for when the BP was measured for example (and no date = no test, still no need for a NULL there). Most often, this step isn't that important. Or, I will use a special code in the column such as "-1" to designate odd results. But if the data is as sensitive as PaulB claims, I still think the best solution is a child table - in which case the NULLS in that column become superfluous anyway.

    But let me ask a quesion to you Jeff, that I beleive that David Portas brought up, why are there no NULLs in the business world if they're so important? Have you ever seen a spreadsheet with them showing from a non-programmer? (Of course a file originally from a progammer doesn't count - sometimes we leave them in there. I usually try to hide them.)

    Now I may be wrong about all this - I am a self-taught programmer, so it is possible that most people don't have the issues I do with NULLs.

    As more of an end user than a theorist or dba, I'll chime in on that one. The absence of data is not equivalent to zero, it means the data is not there. A measured value of zero is not the same thing. ADDING the word NULL to the spreadsheet isn't necessary, however in the context of data retrieval, the system needs to show that it did in fact try to retrieve the value, it just wasn't there. (hence the NULL value)

    That just tends to confuse the end users who aren't working directly with the database, so we remove it again before we show it to them, but we DON'T replace it with zeroes.

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • mtillman-921105 (8/10/2010)


    Maybe a date for when the BP was measured for example (and no date = no test, still no need for a NULL there)

    :blink: I'm getting so confused... wouldn't "no date" be stored as Null? 😀

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (8/10/2010)


    mtillman-921105 (8/10/2010)


    Maybe a date for when the BP was measured for example (and no date = no test, still no need for a NULL there)

    :blink: I'm getting so confused... wouldn't "no date" be stored as Null? 😀

    You're completely right PaulB. But that's missing my point. I just said that to show that it would not be needed. Yes, good old motherly SQL would put one there for us, wouldn't it? :satisfied:

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

  • mtillman-921105 (8/10/2010)


    PaulB-TheOneAndOnly (8/10/2010)


    mtillman-921105 (8/10/2010)


    Maybe a date for when the BP was measured for example (and no date = no test, still no need for a NULL there)

    :blink: I'm getting so confused... wouldn't "no date" be stored as Null? 😀

    You're completely right PaulB. But that's missing my point. I just said that to show that it would not be needed. Yes, good old motherly SQL would put one there for us, wouldn't it? :satisfied:

    Exactly! you got it spot on.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • David Portas (8/6/2010)


    [Sorry for the delayed follow up]

    Delays are not a problem - I don't think we are workiong to a project deadline in this discussion!

    No I do not want to "stick some values in there". I expect the user to define what results she wants from the data and/or come to me with the requirement and I will write the query for her. If there are no nulls in the database then there is no obvious reason to put them in queries against that database either.

    If there are no nulls in the database that means either that the database doesn't have any slots for attributes where it doesn't have the attribute value, or that it has moved every column that could have some values missing into a separate table so that the cabsence of a value can be indicated by the absence of a row, and you have to decide what to do about outer joins. The former case does occur, but nor very frequently, but with your insistence on no nulls the latter case will occur too, leading to two problems (as indicated by the silly example I provided in a previous comment, where (1) you either define 10**34 tables before you start inserting data, or accept a situation where you may have to make several schema changes every hour for the next couple of decades as new examples of incomplete data arrive and (2) you still have to decide what to do instead of use null when you do your outer join and a needed value is absent; "silly" [meaning "extreme case"] examples may be silly, but reduction ad absurdum is still a useful technique for testing the validity of an idea).

    Tom.Thomson (8/1/2010)


    If you believe that nulls generated by outer join always get removed before any user sees them you have lead a very sheltered life!

    What I meant was that nulls are not part of any business requirement so they are not required as the end product of any database solution. Nulls are a feature of the technical implementation and not part of the universe of discourse or the concern of end users - most of whom don't even know what a null is.

    I have worked with end users who needed to know when some datum was not avaialable. I don't know why you don't seem to recognise this as something that occurrs in the real world. Maybe you just don't like "NULL", so let's follow Levene and others and call it something else (Levene uses "unk") - are you familiar with work on functional dependencies in incomplete relations, eg http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=615575E030D6E7D8FCAA63FE8EED3CB3?doi=10.1.1.31.153&rep=rep1&type=pdf and would you accept that approach to the problem?

    SQL doesn't implement either of Codd's versions on NULL

    Absolutely. Do you agree with me that SQL's version of null is seriously deficient, leads to incorrect results and that SQL nulls should be avoided or minimised in SQL database design? For the benefit of everyone who might read your posts it would help if you could be clear about whether you mean your arguments in favour of nulls to apply to SQL nulls as well. I suspect you would not advocate SQL nulls but others reading this may have a different impression.

    I think I generally make it clear that I prefer the Codd version(s) of NULL to the SQL version. SQL has some of it horribly wrong (eg in SQL all aggregates over an empty set are NULL where some shouldn't be, and aggregates and NULLs generally interact incorrectly in SQL).

    My very strong impression is that MVL must always lead to incorrect results in the end (I mean results that don't accurately match the real world), but since I have no system I can test and no complete definition I can read there is only guesswork and the bad example of SQL to go by.

    Classical 2-valued logics will also lead to incorrect results (ie results which don't match the real world) because they are unable to represent correctly any case where we don't have complete knowledge (let's leave Brouwerian objections to 2VL out of this discussion, since with database systems we are concerned only with finite systems). In the real (non-database) world the Scottish criminal legal system has been working extremely well for a long time (working better than the English or American systems, which I understand deliver more miscarriages of justice than the Scottish one) and uses a 3VL in a very obvious manner - a jury can return 3 different verdicts: Guilty, Not Guilty, and Not Proven. So using 3VL for finite systems is there in the real world, it is not a bizarre aberration of a bunch of database people but something embedded in one of the best respected systems of law on the planet.

    Tom

  • David Portas (8/6/2010)


    Matt Miller (#4) (8/1/2010)


    A set with no members is what is called a NULL set.

    That's true. The empty set is sometimes called the null set. However, the null set (=empty set) has absolutely nothing to do with nulls in SQL or nulls in Codd's database model. The empty set is a value and just like any other value it is equal to itself. Nulls are markers in a database to indicate the absence of values. It is very unfortunate that the word null has gained these two totally different meanings.

    the addition operation is defined as an operation between two values.

    Not exactly. The sum of a single value is equal to that value. The sum of the empty set of values is 0. The product of the empty set of values is 1. These results follow from the basic principles. Think about the set of numbers {2,4}. The sum is 6. Take away 2 you have 4. Take away 4 you have 0. Proper support for the empty set is fundamentally important in a database but SQL fails this simple test.

    I agree with David on the result of applying SUM or PRODUCT to the empty set, but I've seen some of the follow up comments and don't think he's getting the point over. So as I'm a mathematician I'm going to go mathematical on you.

    First of all, Matt is absolutely correct - the + operator has exactly two arguments, and it can't be applied to any other number of arguments, in particular it can't be applied to 1 argument (I assume we are using non-Curried operators here). The SUM operator is a different operator, which has exactly one argument and can't be applied to two (or more) arguments. They are NOT the same operator (even as primitive a type system as the one in Russel and Whitehead's Principia Mathematica about a century ago would prevent that mistake if anyone was paying attention).

    The SUM operator is defined over sequences, bags (sets with repetitions), lists (ordered bags), sets, and other similar mathematical objects. The high-school approach defines it over sequences, and gives it a recursive definition which delivers the usual properties we want for SUM, provided we base our recursion soundly. Normally we base our recursion on the empty sequence, and this requires the result of applying the SUM operator to the empty sequence to be 0 if we want the result of applying SUM to the two element sequence <1,2> to be 3 (and all the other common sense results). In set theory (and function theory and lambda calculus) we define SUM of a list (which allows us to derive the meaning of SUM of a sequence, bag, set) as the effect of reducing the '+' operator over the list using a particular base value, and give the REDUCE operator (with three arguments list, base, op) a nice recursive definition based from the empty list. That leaves us again with determining the base value to be used in the definition of SUM, but if we want SUM[1:2:<empty>] to be 3 (and we certainly do) the base value has to be 0. (In fact the definition based on sequences and the definition based on reduction both require the '+' operator to be commutative and associative if a definition for SUM for sets and bags is to be derived from the definition for list or sequence; but '+' has those properties; and the reason that 0 comes up as the value for the empty set is that the base of the reduction has to be a unit of the reductions operator parameter - so it will be 0 for a reduction using '+' and 1 for a reduction using '*' [multiplication]).

    Of course So you can say that the choice of 0 as the sum of the empty set is a convention if you want to, but if you want to replace that convention with something else you are going to have to change our concept of the SUM operator; the recursive definitions based on the empty sequence or the empty list would go out of the window for a start. And well-known facts like

    SUM(A UNION B) = SUM(A)+SUM(B) if (A INTERSECTION B) is empty

    would cease to be facts (they would no longer be true). Now the loss of that particular fact may not worry you (if you are not a mathematician) but how about this one

    SUM(A SYMMETRIC DIFFERENCE B) = SUM(A)+SUM(B)-2*SUM(A INTERSECTION B)

    This too would cease to be true, and that's pretty frightening.

    So if you want to have a different convention for what the SUM of the empty set is, you are going to have to enforce a major change in our concept of SUM - and I don't think the mathematicians of the world will accept that.

    This might make you wonder about what happens with other aggregates: for example what is AVG of the empty set?

    Well, the count, mean, variance, standard deviation functions can all be given definitions based on the reduce operator (the operator argument to REDUCE tends to be pretty complex, though) so since REDUCE requires a 'base' argument we have to see whether those definitions need to specify what the base argument is in order to return sane results. For count, it turns out that we are pretty well forced to use 0 as the base. For the others we could use any base we like for the empty set (we could pretend there is a required value and try to derive it: this ends up discovering that if the derived value is X we have a constraint on it that X*0 = 0, and that's the only constraint we can derive) so we (mathematicians) tend to say that the function is not defined. So if the database is asked for the mean of a set which just happens to be empty the only sensible response is one that means "the database does not have that value" (in this case the reason the value is absent is that functions like mean, variance, and so on can't have any meaningful value for the empty set). SQL can use NULL here, I can't fault that; but I've no idea why it uses NULL for SUM (it can't have been a general rule about aggregates, else COUNT of the empty set would also have returned NULL).

    Tom

  • David Portas (8/8/2010)


    The count of members in a set is not the same as memberwise addition. Taking two 2 balls from a set with two balls in it results in a set with a COUNT of balls =0.

    Quite right. But if you write numbers on the balls then the sum of those numbers is just as valid whether there are 2, 1 or 0 balls. You don't need 2 balls to make a sum. The sum of any set is equal to the sum of the sums of its subsets - that is simply the definition of the addition operation. It follows that the summation of set A less set A must equal zero. There is no other possible result that preserves the commutativity of the sum operation and there is certainly no other possible result that matches the physical reality of sets of things.

    Actually that argument doesn't quite work. For any set, clearly the sum of the sums of its subsets is going to be bigger that the sum of itself if the set has more than 1 non-zero member (or even has just one nonzero member if you don't restrict it to exclude the set itself) whatever the sum of the empty set might be, but I imagine that was just careless phrasing (although I can't think quite what it might have been careless phrasing of). Also, remember that

    SUM(A UNION B) = SUM(A)+SUM(B)-SUM(A INTERSECTION B) and see if you can see how if you use that as your basic argument (instead of omitting the subtracted term) even for discrete partitions that you don't have a problem with partitioning a set and doing the sum that way even if the sum of the empty set is non-zero (that's why in my earlier post I said that you might ignore the loss of naive partitioning, but the effect of symmetric difference is a bit nastier).

    Tom

  • Jeff Moden (8/9/2010)


    mtillman-921105 (8/9/2010)


    Actually, this breaks a cardinal rule in database design - we are using one column for more than one kind of data. It answers two questions: Was there a test taken? and What were the results of the test? Ideally, there should be one column signifying if the blood pressure was taken and another for the results.

    So... let me ask the question... have you actually implemented two columns like that and for the reason you gave when only one is needed? Be honest now...

    I take a slightly different view. It is irresponsible to assume that NULL means the test was not taken. It may mean that the test was taken but failed to deliver a result (for one of a several reasons, perhaps because the test apparatus was broken); or that someone forgot to enter the data in the database; the system was down and there is a data entry backlog; the test results were lost before they could be entered; the test result was entered but subsequently overwritten with a NULL by a bug - or by malice - or by human error.

    All you can say when you see NULL is that the data is not there (of course people design schemata in which NULL has a far more specific meaning, and sometimes they get away with it; but sometimes they don't, and a responsible designer usually doesn't do that).

    But I absolutely agree with Jeff. If I don't need to (or can't in practise, which is often true) distinguish between the dozens of possible reasons why the data is absent all I need to know tis that it is absent - so I can use a NULL mark and don't need another column.

    Tom

  • PaulB-TheOneAndOnly (8/10/2010)


    I know you proposed to create an actual-pickup-date and an actual-return-date tables to hold that information but that would end with tables having a 1-to-1 relationship with Reservations table and, I do not like 1-to-1 relationships - those only add complexity and overhead so I opt to have all the attributes of a given reservation in a single row.

    Well done! I'll second Jeff's @paul-2 for President@ vote. I don't see him having 10**34 tables in his schema or doing 100-way outer joins just to avoid allowing nulls in a few columns in his tables.

    Tom

  • jcrawf02 (8/10/2010)


    As more of an end user than a theorist or dba, I'll chime in on that one. The absence of data is not equivalent to zero, it means the data is not there. A measured value of zero is not the same thing. ADDING the word NULL to the spreadsheet isn't necessary, however in the context of data retrieval, the system needs to show that it did in fact try to retrieve the value, it just wasn't there. (hence the NULL value)

    That just tends to confuse the end users who aren't working directly with the database, so we remove it again before we show it to them, but we DON'T replace it with zeroes.

    Actually it depends a lot on the end users - some end users want it drawn to their attention that we don't have a (measured) value. SO rather than saying we "remove" it I would say that we transform it (in the application, not in the DB, of course) to something the end user will recognise (perhaps the string "???????" in red on a grey background and slowly flashing, or maybe something a bit less extreme).

    Tom

  • Tom.Thomson (8/11/2010)


    as indicated by the the silly example I provided in a previous comment, where (1) you either define 10**34 tables before you start inserting data, or accept a situation where you may have to make several schema changes every hour for the next couple of decades as new examples of incomplete data arrive and (2) you still have to decide what to do instead of use null when you do your outer join and a needed value is absent; "silly" [meaning "extreme case"] examples may be silly, but reduction ad absurdum is still a useful technique for testing the validity of an idea

    In this extreme edge case your argument that the null model can do any better depends on there being a common key for all types. It seems like a very convenient assumption that such a complex problem could ever have one key for everything in it. Take away that key and suddenly your null model is no better off than the no null model. Even Codd's model doesn't allow nulls in keys (at least in base relations). So to model screeks relationally without a screekid you need to ensure that every relation has a non-nullable set of key attributes and you too will end up with vast numbers of different relations.

    What does this example prove? It proves that if you have a vastly complex problem then your schema will also be very complex. I don't find that surprising or worrying. You might want to consider using a schema-free data model instead of a relational one.

    are you familiar with work on functional dependencies in incomplete relations, eg http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=615575E030D6E7D8FCAA63FE8EED3CB3?doi=10.1.1.31.153&rep=rep1&type=pdf and would you accept that approach to the problem?

    As I suggested before, I am happy to consider any approach to the problem of missing values and MVL provided I can read an explanation of how it avoids returning incorrect results for even the most straightforward queries. If as Codd suggests the system can avoid errors only by trying to detect tautologies then I think the user needs to be provided with a list of what those supported tautologies are, so that he knows which queries give right answers and which may give wrong ones. Without that information I don't want to commit to any hypothetical MVL null system because I don't know what you are asking me to commit to. Right now it seems like you are just asking me to sign a blank cheque to MVL and to take it on trust that everything will be alright in the end.

    Classical 2-valued logics will also lead to incorrect results (ie results which don't match the real world) because they are unable to represent correctly any case where we don't have complete knowledge

    Clearly that is untrue. Not only can you represent the incomplete information in a 2VL system, you can just as easily represent very detailed information about the missing data - for example why it is missing or what the extent of knowledge about the missing data is (maybe a range of possible values). This is something no MVL system is any help with. 3VL and 4VL only represent extremely limited information about a datum that is missing and that's why they do not solve the problem of missing information.

    So using 3VL for finite systems is there in the real world, it is not a bizarre aberration of a bunch of database people but something embedded in one of the best respected systems of law on the planet.

    Why is the Scottish legal system an example of 3VL or missing information? I don't think it's anything of the kind. It's a question with three very specific answers. When the jury gives its verdict the answer is known. Before the jury gives its verdict then that information is "missing" but a missing or unknown verdict is not the same as a Not Proven verdict. If you wanted to represent the verdicts of Scottish trials as a truth value in a Codd-style model with I and A marks then you would presumably need FIVE valued-logic wouldn't you? (shudder)

    PS. thanks for the fine explanation of SUM and empty sets. Much better than my muddled attempts to explain it.

  • Tom.Thomson (8/11/2010)


    PaulB-TheOneAndOnly (8/10/2010)


    I know you proposed to create an actual-pickup-date and an actual-return-date tables to hold that information but that would end with tables having a 1-to-1 relationship with Reservations table and, I do not like 1-to-1 relationships - those only add complexity and overhead so I opt to have all the attributes of a given reservation in a single row.

    Well done! I'll second Jeff's @paul-2 for President@ vote. I don't see him having 10**34 tables in his schema or doing 100-way outer joins just to avoid allowing nulls in a few columns in his tables.

    I certainly don't let those ugly things happen on my databases 🙂

    Thank you Tom, really appreciate your support.

    By the way, excelent posts!

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • David Portas (8/11/2010)


    Why is the Scottish legal system an example of 3VL or missing information? I don't think it's anything of the kind. It's a question with three very specific answers. When the jury gives its verdict the answer is known. Before the jury gives its verdict then that information is "missing" but a missing or unknown verdict is not the same as a Not Proven verdict. If you wanted to represent the verdicts of Scottish trials as a truth value in a Codd-style model with I and A marks then you would presumably need FIVE valued-logic wouldn't you? (shudder)

    Ouch. I think you are confusing two very different things: the NULL marker needed to indicate an absent datum and the extra truth value needed for 3VL. It's probably because you thought I was conflating those two things - I didn't intend any such conflation, but perhaps I expressed myself badly.

    3VL arises in systems quite independently of NULL markers (as exemplified by the logic used in constructive, as opposed to classical, mathematics). I would not dream of suggeting that a "not proven" verdict corresponds to a NULL; it clearly corresponds to the truth value "unknown", not to a marker on a datum - and while you are absolutely correct in saying that a databse which has to represent the verdict attribute of a trial in Scotland would need a 3V domain for that attribute the real point is that that domain is the set of truth values in a particular logic (X is guilty of Y has 3 possible truth values: True(guilty), False(not guilty), and unknown(not proven). This conflation happens too often, and that's one of the reasons that I object (strongly) to the idea that "NULL" (a marker indicating the absence of the value for a datum in a database, although if I were to look at the Latin maybe I'd have to invent theterm "non-datum") can be called "unknown" (a truth value in a logic, which is certainly not absent because "unknown" is what it is and there it is).

    PS. thanks for the fine explanation of SUM and empty sets. Much better than my muddled attempts to explain it.

    Thanks for the compliment. I thought I was a bit verbose, but the discussion on that topic did seem to be missing the point (or what I thought should be the point) a bit so I let loose.

    Tom

  • I think you are confusing two very different things: the NULL marker needed to indicate an absent datum and the extra truth value needed for 3VL. It's probably because you thought I was conflating those two things - I didn't intend any such conflation

    Then I don't know what your example has to do with MVL in the Codd-style system. It seems like two Not Proven verdicts are just as equal to each other as two Guilty verdicts and could therefore be handled perfectly well by a 2VL system in the database. I don't have any problem with MVL data types of that kind in my 2VL database. What I have a problem with is MVL used as the basis for data models and query languages.

Viewing 15 posts - 76 through 90 (of 106 total)

You must be logged in to reply to this topic. Login to reply