• L' Eomot Inversé (8/22/2013)


    Hugo Kornelis (8/22/2013)


    So where does all the confusion with "unknown" come from? (To my surprise, I see that even Tom, a man I normally agree with on most points, is confused by this. Understandable!)

    I'm not at all confused by this, but I have the impression that you are.

    Now why am I not surprised? 😉

    Thanks for your well thought out reply and the interesting example. I'll respond to a few points.

    (...)

    4) return NULL because the record with that primary key has NULL in the tv column, perhaps because the test was not, for some reason, carried out that day although it was scheduled.

    (...)

    In case 4, the database does not know what the truth value is; it's not known. However, it is NOT 'Unknown'.

    That's three statements in a single line, and they are true, false, and true. 🙂

    - "the database does not know what the truth value is" - true, the database does not know what it is, because it is missing. I prefer to stress the "missing" part and leave the "the database doesn't know part" to the imagination of the reader, because that tends to cause confusion.

    - "it's not known" - false, you can reword "the database does not know what the truth value is" to "it's not known to the database", but you cannot simplfy that to "it's not known". It is missing, and I have no idea why. I could speculate about the reason why it's missing - but any such speculation would include knowledge of the context where this NULL is found. I'll expand on this "context" idea later.

    - "However, it is NOT 'Unknown'." - obviously true, and I don't think anyone here needs an explanation.

    Your exposition further on in the message containing the above quotation claims to prove that NULL can't mean unknown because it isn't the truth value "Unknown".

    No, you are misunderstanding me. I never said that NULL can't mean unknown. This is where that "context" thing comes in.

    Would you say that the number 42 can't mean "just the right size for large women"? I hope not, because you would be wrong - it can mean exactly that - within a specific context (US standard confection sizes for women's blouses). But without the context, you do not know whether that specific 42 does actually mean that. It might, or it might mean something completely else. So outside of all context, you would not say "42 means 'just the right size for large women'", nor "42 can't mean 'just the right size for large women'"; you would say "42 might mean 'just the right size for large women', or it might mean something else, depending on the context where it's used".

    So I don't say that NULL can't mean unknown. I only say that NULL might mean unknown, or it might mean something else, depending on the context where it's used.

    However, it is certainly possible that the real world value of some attribute is not known, and this is indicated by marking its database representation NULL.

    Oh yeah, definitely. Just as it is also certainly possible that there is an attribute is not applicable, and this is indicated by NULL (birthdate of a company). Or that the real world value of an attribute is "indefinitely", and this in indicated by NULL (ValidTo date of a price). Or that the real world value is "no matching row found", and this is indicated by a NULL (result of an outer join). Or that the real world value is "you have no access to this data", and this is indicated by a NULL (possible result of some types of row-level security implementation). NULL can mean all those things, and many more.

    And if a value in the database is marked NULL, it is quite clear that so far as the database is concerned the value of the real world attribute isn't known - and "it isn't known" is just another way of saying "it is unknown".

    Now you're doing the same as you did in the start - simplyfying "it isn't known to the database" to "it isn't known". You are far from alone in that confusion (it is, in fact, one of the two major reasons why so many people insist that NULL means unknown; the confusion with the truth value unknown being the other major reason).

    The value is missing. There is no data value at that point in the database. So yeah, you are right that if the database has to do something with that value, it doesn't know what value to use - so I'll accept "unknown to the database" - even though that formulation is risky, because it tends to lead to confusion. But simplifying that to just "unknown" is NOT correct - that is exactly the confusion I am so wary of.

    Now one important point I do need to add is that all the queries we write operate in a context. So the context I so vehemently stripped off in this message is very much present in the queries, stored procedures, and other database code. That is true for 42, and it is true for NULL. If there are two orders for 42 units of the same article, our query logic will add them up and request the manufacturer to send us 84 units. But if there are two orders for a size 42 shirt, the query will NOT add those two numbers together and send a request for a size 84 shirt to the factory. When writing the query, we know that the 42 in one column is a quantity that can be added, and the 42 in the other quantity is a confection size unit that does not lend itself to arithmetic.

    And again, it's the same for NULL. When I write "WHERE ValidFrom >= @OrderDate AND (ValidTo <= @OrderDate OR ValidTo IS NULL)", I'm not querying for prizes with an unknown ValidTo date, but for prizes that are "indefinitely" valid. When I wrote "FROM OrderDetails AS od LEFT OUTER JOIN Orders AS o ON o.OrderID = od.OrderID WHERE o.OrderID IS NULL", I am not trying to find details of orders with an unknown OrderID value; I'm looking for orphaned OrderDetails (though I would in reality code that in a different way - just sayin'). And when I write "WHERE is_smoker IS NULL", I am indeed looking for people for whom we do not know whether they smoke or not. All because I know the context that applies to the NULL values in the rows and columns I am querying.

    The context-less meaning of NULL (or of 42) is relevant for how the database handles such values. The result of an expression should always be the same, regardless of context. When I add two values of 42, the result will always be 84 - even when I am foolish enough to do this to values from the "ShirtSize" column. The database engine does not know the context and provides a well-defined treatment of the value 42 stripped of any context - a treatment that works well in most contexts, but might need special treatment in specific contexts.

    Similarly, when I concatenate 'Hugo' and NULL, the result will always be NULL - even when I am foolish enough to do this in a context where I know that NULL means 'is a nutter'. The database engine does not know the context and provides a well-defined treatment of the NULL marker stripped of any context - a treatment that works well in most contexts, but might need special treatment in specific contexts.

    I suspect that your attempt to say that a NULL mark doesn't mean the value is unknown arises from a desire to stop people confusing the truth value unknown with the meaning of the NULL mark

    No, it's way more than just that. The false notion that NULL means unknown simply causes too much trouble. I've seen people invent their own methods to handle missing values that were not unknown but missing for another reason, because they are convinced that NULL means unknown. I have seen people expect logic on NULL to return values that would be appropriate if NULL actually means unknown, and then complain that NULLs are not handled correctly and should be avoided. And I've seen people that simply struggle with how NULLs are handled and how 3VL works. I am pretty sure that Dave will no longer suffer from that - he only needs to think back of that night when my Uncle Luigi (sorry, Italian readers - don't blame me for the stereotype, blame Hollywood) disrupted his peace twice in a single evening and he'll immediately see the logic of how NULL is handled. (Well, mostly - not all NULL handling is logical, for some situations the designers of the SQL standard simply had to make a choice and no logic could favor one choice over the other).

    Incidentally, I reckon Date's "proof" that going to 2 sorts of NULL forced you to 3, 4 and so on ad infinitum was complete hogwash, so that appears to be something else we disagree on; but I think Codd was wrong in advocating two different NULLs in some of his papers, so that is something we agree on.

    Just for the record, I may not have called Date's argument "hogwash", but I did call it a fallacy - so apart from you using a stronger word than me, we do not disagree on that. And as far as I know, the only reason Codd embraced the notion of two types of NULL is because he fell for Date's fallacy. I am convinced that if he had lived longer, he would eventually have seen the fallacy and retracted his writing that include the two NULL types. SO we agree on that too.

    (Oh, one more thing - the example af the start of your post painfully reminded me of the fact that the current version of the ANSI standard for SQL actually defines a data type for truth values. They call it boolean (shudder!), and even explicitly specify that the truth value "unknown" and the situation where a truth value is missing should both be represented by the null mark (shudder even more!). I am sure that there are situations where this can be used in a very practical way (just as NULL_CONCAT_YIELDS_NULL can be practical in some situations) - but from the point of relational theory, it just makes me want to cry).


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/