The Logic, Mathematics, and Utility of NULLs

  • I see NULL mostly as a "we need to get this data" marker. We don't know why it's not there, don't know what's there, but we need something.

  • Matt Miller (#4) (8/18/2010)


    True except that in practice, actually being able to make that distinction is an awfully big step.What do you do WHILE you actually decipher whether the machine got unplugged, or became self-aware and stopped measuring that data you were looking for. Do you then need 3 NULLS? (the original definition AND the two latter ones?)

    Perhaps I deal in messy environments, but I find a lot of settings where the only real distinction needed is "is there data there" or "there's no data available". Putting placeholders in data when you just plain don't know and can't know WHY, just to make it "look" prettier, is like putting earrings on a pig....

    I agree that the step from 1 to 2 NULLs is a big one but there are possible reasons for wanting it. I see it as assistance for a particular performance engineering bodge (representing several distinct entity types either in a single relation, whether a base relation or a derived relation to save having to write a lot of code and/or run a lot of overcomplex queries - this is a bodge that we sometimes have to live with unless we can afford to write and run ridiculously large amounts of code) plus getting some extra functionality out of NULLs when they crop up in some of the aggregate functions (ideally NULLi contributes nothing to SUM, AVG,MIN, MAX, VAR, STDEV,...; while NULLa causes them to return NULLa) plus a bit of support for applications which care whether something is known to be inapplicable or just not available for some unspecified reason. But all this can be done by adding an applicability bit column for each column that could take NULLi without too much pain in most cases, so I would rarely choose to use two NULLs instead of just one even in a language that allowed me to do so (with SQL I can't use two nulls anyway).

    I could see having the option for either definition, but if I had to pick just one, I'd go with the original definition for the reason above.

    In the two NULL system you have the option of using the one-NULL system, you simply define your schema so that NULLi is nowhere allowed. One of the "fun" things about Codd's two-null system is realising that the two NULLs do, in a sense, overlap. Nulla means the value is not here and the database doesn't say why; Nulli means the value is not here and the database says that it's not here because the column is inapplicable. So an inapplicable case could be represented by NULLa instead of NULLi - either because whoever entered it didn't know whether it was applicable or not or because the database designer decided to keep things simple and not represent that information about the reason for absence. The designer can choose not to use NULLi either in all cases (because of a preference for the single null system) or on a horses for courses basis (only adds NULLi when it's convenient and useful). So a schema designer can work in the two NULLs system and use only NULLa so that he gets the same result as he would in Codd's one NULL system, or use only NULLi (if he's so lucky that he doesn't find he needs NULLa, which rarely happens to me) and get a different single NULL system, or use the two-nulls system where appropriate. Of course in SQL the designer is stuck in a botched one-null system where NULL sometimes behaves as NULLa and sometimes as NULLi (when MAX or SUM is used on a nonempty set, for example) and is sometimes introduced instead of a correct result ("select sum(T.x) from T where T.x>T.x" should return 0 on any rational system, but returns NULL).

    Tom

  • David Webb-200187 (8/17/2010)


    Yep, that pretty well sums it up. I've been reading Hugh Darwen for a couple of days and working through Tutorial D and the Rel implementation. 6th normal form brings its own bag of headaches, and while I tend to avoid NULLs if I can help it, I'm not sure which set of headaches I'd prefer to deal with, if given an either-or choice.

    I'm a little surprised that this thread hasn't generated more traffic. It could be that everyone has already staked out their positions and they feel no need to re-defend them.

    It's because they hi-jacked a previous thead of mine. Everyone exhausted themselves there.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @matt-2 - you're assuming pigs in earrings aren't pretty....

    @paul-2 - "In theory, theory and practice are the same. In practice, they are not" ~ Einstein

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • I'll simply agree with statements from two prior posts, slightly reworded:

    1) In practice, with normal humans using the data, NULL must have an explicit operational definition to be meaningful. Essentially, some set of people must agree on what NULL means in a given column, and then document that, and others must use that document.

    2) In practice, I try to avoid allowing NULLs as much as possible, because most of the time, they don't actually have an operational definition, and most of the time, the tri-valued logic serves only to generate incorrect results some of the time.

    e.x. For a Quantity field, almost everyone assumes 0 means 0, and 1 means 1. Different people assume NULL means something different. Further, when you actually need more than one distinct non-numeric "meaning", then you still need to fall back on defining specific values, such as "-1 means UNKNOWN" and "-2 means N/A" and "-3 means Will Be Known Later" and so on.

    The real danger of NULLs is that some people assume that their preferred definition must be the one that is being used, without asking. I would suggest that obviously invalid values raise questions more of the time.

  • Tom.Thomson (8/17/2010)


    Various parts of mathematics use a NULL-like concept. The ones which to me seem to be the closest to the NULL of relational databases crop up in Chris Strachey and Dana Scott's denotational semantics for programmes where one can have values like "bottom of D" for each domain D, and of course "bottom" for "bottom of everything". Basically "bottom of D" is a value in D of which we know nothing but that it is in D, while "bottom" is a value of which we know nothing, not even what domain it is a member of. Essentially all values exist in a lattice, and each domain is a sublattice, with the < relation in the lattice meaning "is less well defined". There is also a top value, which is overdefined (ie is contradictory), and potentially "top of D" values which are also overdefined but slightly less so (non-contradictory information on which domain they are in is available). Of course bottom and top occur naturally in lattice theory, even before it is applied to the denotational semantics of programmmes. But if one considers the natural topology of the lattices used in denotational semantics you see that the meaning of a recursive programme is the limit of the meanings of its partial evaluations (eg the meaning of a programme that computes factorials is the limit of the series of meanings of the programmes F(N) that produce factorials for numbers up to N but don't deliver a result for numbers > N, as N increases), which is quite a strong constraint on the lattices.

    Despite the closeness of the NULLs of RDBMS with the bottoms of Scott-Strachey semantics, and the fact that both indicate an absence of knowledge about a value, they are conceptually quite different.

    edit:correct the English

    Maybe I'm generalizing too far here, but if we define NULL as essentially an unknown, then aren't the variables in algebra NULLs in a sense? As you say above, it is true that they would be "conceptually quite different." We don't try to "solve for Null" for example. But as a generalized notion, isn't an unknown an unknown?

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

  • Variables in algebra can be. That's what I thought, but if I have X and Y in two equations, it's not unknown. It's unsolved, perhaps, but not unknown. It's more like having x, y, and z in two equations. One of those can't be solved for, or maybe 2 or 3, because you don't have the information.

  • Steve Jones - Editor (8/19/2010)


    Variables in algebra can be. That's what I thought, but if I have X and Y in two equations, it's not unknown. It's unsolved, perhaps, but not unknown. It's more like having x, y, and z in two equations. One of those can't be solved for, or maybe 2 or 3, because you don't have the information.

    You're right in distinguishing between solvable and unsolvable. But sometimes NULLs can represent solvable equations too. For instance, NULLs can be in columns where a calculation hasn't taken place yet (Hours worked * Pay Rate maybe). Until the actual calc is run, there can be NULLs there. So at times, we are solving for NULL, just as in algebra we solve for x, we just don't think of it that way in RDBMS.

    So why does this matter? If this is correct, then in a real way, mathematics has been using NULLs in algebra for a long, long time. Algebra just looks at one one equation at a time rather than in sets.

    I'll treat this conclusion as tentative, but will update the original post above (since this is an ongoing project).

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

  • I theory, it seems there might be a case for 2 or more NULLs simply to provide clarity and eliminate different people guessing at the intended usage - after all, we have tinyint, smallint, int, and bitint each with clearly defined boundaries, so why not NULLa and NULLi.

    In practice, I haven't found a definite need for multiple NULLs - I interpret them either as unknown or not-applicable based on my knowledge of the wider schema (NULL for MPG could mean unknown for a petrol car, and not-applicable for an electric one). Perhaps I take too simple a view, and I certainly can't claim to know all the theory, but at the end of the day I find NULL begets NULL whether it is unknown or not-applicable.

    Chris

  • jcrawf02 (8/19/2010)


    @Matt - you're assuming pigs in earrings aren't pretty....

    @paul-2 - "In theory, theory and practice are the same. In practice, they are not" ~ Einstein

    No - just assuming the pretty ones won't be any prettier WITH earrings:)

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Nadrek (8/19/2010)


    I'll simply agree with statements from two prior posts, slightly reworded:

    1) In practice, with normal humans using the data, NULL must have an explicit operational definition to be meaningful. Essentially, some set of people must agree on what NULL means in a given column, and then document that, and others must use that document.

    2) In practice, I try to avoid allowing NULLs as much as possible, because most of the time, they don't actually have an operational definition, and most of the time, the tri-valued logic serves only to generate incorrect results some of the time.

    e.x. For a Quantity field, almost everyone assumes 0 means 0, and 1 means 1. Different people assume NULL means something different. Further, when you actually need more than one distinct non-numeric "meaning", then you still need to fall back on defining specific values, such as "-1 means UNKNOWN" and "-2 means N/A" and "-3 means Will Be Known Later" and so on.

    The real danger of NULLs is that some people assume that their preferred definition must be the one that is being used, without asking. I would suggest that obviously invalid values raise questions more of the time.

    Nadrek, I think that you've given a very reasonable answer. You raise an issue that I want to draw out here. If NULL can have different meanings according to the context, then it looses some of its functionality. I'm going to go out on a limb and suggest that we create standards for what should be a NULL and different kinds of them. (Which reminds me of one of my favorite quotes:

    "I love standards, that's why I have so many of them." :w00t:)

    So maybe we can use these symbols:

    "NULL" -> Unknown (and not including "N/A")

    "-" (one dash) -> to show "nothing has been entered into this column and row yet."

    "N/A" -> not applicable

    Wouldn't this be worthwhile? Why or why not?

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

  • I guess that's a good question, phrased a different way, WHY would that be worthwhile?

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • jcrawf02 (8/20/2010)


    I guess that's a good question, phrased a different way, WHY would that be worthwhile?

    I suppose that I'm answering my own question here but...

    It would be for removing the ambiguity and in some cases, preventing errors.

    Also, consider this: "N/A" would be useful when the information for that column doesn't apply and the front end could even display that value "as is" without code interpreting the answer.

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

  • mtillman-921105 (8/20/2010)


    jcrawf02 (8/20/2010)


    I guess that's a good question, phrased a different way, WHY would that be worthwhile?

    I suppose that I'm answering my own question here but...

    It would be for removing the ambiguity and in some cases, preventing errors.

    Also, consider this: "N/A" would be useful when the information for that column doesn't apply and the front end could even display that value "as is" without code interpreting the answer.

    Ok, can you give a real world example where a user would need to know expressly whether something was unknown, not entered, or not applicable? Not following what problem we solve with this?

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • jcrawf02 (8/20/2010)


    mtillman-921105 (8/20/2010)


    jcrawf02 (8/20/2010)


    I guess that's a good question, phrased a different way, WHY would that be worthwhile?

    I suppose that I'm answering my own question here but...

    It would be for removing the ambiguity and in some cases, preventing errors.

    Also, consider this: "N/A" would be useful when the information for that column doesn't apply and the front end could even display that value "as is" without code interpreting the answer.

    Ok, can you give a real world example where a user would need to know expressly whether something was unknown, not entered, or not applicable? Not following what problem we solve with this?

    There is an example above where "N/A" would come in handy. What if there is a table about cars and one of the numeric columns is for mile per gallon, but along comes an electric car. A regular old NULL would do, but an "N/A" would show exactly the right information.

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

Viewing 15 posts - 16 through 30 (of 67 total)

You must be logged in to reply to this topic. Login to reply