• Hugo Kornelis (8/23/2013)


    Now you're doing the same as you did in the start - simplyfying "it isn't known to the database" to "it isn't known". You are far from alone in that confusion

    ....

    ....

    And when I write "WHERE is_smoker IS NULL", I am indeed looking for people for whom we do not know whether they smoke or not.

    And there you fall into exactly the simplification that I use. So if it is indeed a confusion, I have you for company.

    It's all very well to talk about the context; but often when you come to look at some SQL you don't have that context - you just have the code, including if you are lucky the scripts that generate the schema and all its constraints, and of course you have the data. Given the general behaviour of developers, if you have any documentation of the code and/or the schema, the documentation will describe a version that never worked, in fact never even succeeded in its unit tests, not the version that finally emerged from development through QA, so any context the documentation gives you is as likely to mislead as to guide you to the correct conclusion. Yes, people like you and me will often provide decent documentation; but the people who maintain and enhance our code probably won't. So the only reasonable view of the SQL language is that when we say "this is unknown" or "that is known" we are talking about the database content, not about something that we individually know because that's the way we designed it - except in the rare case that the system we are looking at is our own design and has not been altered so that the context we believe in is no longer the actual context in which the database and the data in it exist.

    Yes, in an ideal world we would always know the context - for example table definitions with nullable columns would be commented to explain why the column is nullable, table definitions for child or auxiliary tables would be commented to say why a key in the parent or main table would not be referenced from any row in the child or auxiliary, and we know exactly the significance of every null. But in the real world all we can safely say is that NULL means the data isn't present, so ithe value of the data is unknown in the database, and lacking the non-available higher level context to say why the value is unknown in the database it's reasonable when talking about the database and the data in it, since when talking about "something" we are talking in the context of that something, to cut the excess verbiage and just say "the value is unknown".

    edit: I think I should also comment on this:

    (Oh, one more thing - the example af the start of your post painfully reminded me of the fact that the current version of the ANSI standard for SQL actually defines a data type for truth values. They call it boolean (shudder!), and even explicitly specify that the truth value "unknown" and the situation where a truth value is missing should both be represented by the null mark (shudder even more!).

    I agree with your first shudder. Boolean logic is not 3-valued, so calling this type boolean is just plain silly. Your "shudder even more" I can't agree with - what else should one do when the value of a so-called Boolean is not known in the database? Apart from the name, there's nothing wrong with this type.

    Of course the name misuse just followed a long tradition of screwing up; back in the early days, there were some Fortran compilers which provided a bool type with 64 possible values, and other Fortran compilers that provided the same type with 256 values; you probably know that these were single character values on machines with 6 bit chararacters or 8 bit characters; usually the Boolean operators operated bitwise, so that an array of bools was a viable way to represent a string that you wanted to manipulate as opposed just to have a literal string that you wanted to output (the only viable way, in those early versions of Fortran). But the logical if statement only recognised two vales, and some compilers had 0 for false and non-zero for true, at least one had 255 for true and any other value for false, in another even numbers were false and off true - a complete nightmare. ANS Fortran IV cleared up that mess, but got much else wrong.

    Tom