I've been building databases since I started working with PFS File in the 1980's.
It seems like there is a literal meaning for NULL, which is "unknown", and a practical meaning, which is "data is not present in this field".
This is actually not correct.
The literal meaning of NULL is not "unknown", it is "data is not present in this field". Or, to quote the definitions section of the ANSI standard for SQL:
"188.8.131.52 null value
special value that is used to indicate the absence of any data value"
(Older versions of the standard used a slightly different wording: "r) null value (null): A special value, or mark, that is used to indicate the absence of any data value". I liked that wording better, because it presents "mark" as an alternative to "value". I don't like how any version of the ANSI standard insists on using the term "null value" instead of just "null" or "null mark", since by its very own definition, NULL denotes the absence
of a data value).
This distinction is very relevant. If you have a table with product prices, including columns ValidFrom and ValidTo, a NULL mark in the ValidTo column does not mean that the end date of that price is unknown. It means it is not applicable, because the price is still valid. (Okay, you could argue that it is not yet
If you have a table with customers and they can be either legal bodies or natural persons, then every row will have a NULL in either the Birthdate column or the FoundingDate column. That does not mean those values are unknown. The birthdate of Microsoft Inc. is unknown because it is not applicable - companies have no birthdates. Likewise, the FoundingDate of Hugo Kornelis is NULL, not because it's unknown but because I was born, not founded. (Ohh, this is SO going to attract witty remarks...)
Finally, in the table that represents my friends, the birthdate for Jane is NULL - and in this case, it is because it IS indeed unknown. I am pretty sure that Jane has been born, and that must have happened on some date - sho just close not to tell me her age, so I am stuck with not knowing her birthdate,
For each of these examples, when you look at the data in the table, you know what the NULL represents - because you know the context. But taken out of the context, just by itself, a NULL can mean any of those things, and a whole bunch of other stuff. And unless you like unwarranted speculation, you best steer clear of all those possible interpretations and stick to the one thing you do know - that NULL represents the absence of any data value.
Note that the above argument, about different things that can be represented by NULL, has been used in a famous argument ("much ado about nothing") between Date and Codd. Date, well known for his aversion of NULLs, used this argument to convince Codd that there should be not one but at least two types of NULLs, and when Codd admitted to that Date went on to prove that this resulted in the need to upgrade three-valued logic to four-valued logic, with all the accompanying consequences. There have then also been publications that went on to expose even more types of NULLs - I believe one even went on to 17 types! All this was intended to, eventually, show that avoiding NULL is the only sensible way to do.
But what Codd apparently never got (and I'm not sure about Date) is that the whole argument is based on a fallacy. To show this, let's replace NULL with 42. When presented without context, you can make all kind of assumptiions about this number. It can represent all kinds of stuff. Maybe it's a counter (42 children in a school bus). Or a measurement of elapsed time (42 minutes spent on typing this reply - okay, I'll admit that it's actually less). It could be just a meaningless number chosen to represent something else (product code 14 - for a small package of grey staples). Or it could be the answer to a very important question. So do we need to implement different kinds of 42 in the database, just to be able to distinguish all these possible interpretations? No, of course not! When you look at 42 in its context, you will know the meaning from that same context. If you see 42 in the ProductNumber column of your OrderLine table, you know that it represents that box of staples, not a time measurement. And when you then see that same number 42 in the NumberOrdered column of the same table, you know that it's a counter of the number of those boxes some customer is ordering. In the context, the meaning of any "42", like the meaning of any "NULL" is clear. Without context, the only safe thing is to reduce it to the minimum guaranteed meaning - for 42, that is "the decimal number exactly between 41 and 43"; for NULL, that is "the absence of any data value".
But what if we have a table where a NULL can represent multiple things? For instance, the customers table can have NULL in the birthdate column for Microsoft Inc (because that customer was founded, not born) as well as for Ms Davies (because she didn't want to disclose her age) - surely, that would be a great example for why Date was right and we do need multiple types of NULL, right? Again, no. Wrong. And again, I'll use 42 to prove it. Let's imagine a table with data about clothing - women's blouses and sweaters to be precise. One of the columns is called "Size", and I see the value "42" in one of the rows. What does it tell me? Yup, size 42 - but there are three different systems for sizes of women's blouses and sweaters (EU, UK, and US); all of them include a size "42" - and all of them represent a different size. (See http://en.wikipedia.org/wiki/Clothing_sizes). Now there are two possibilities. Either the company using the data knows what they
mean with this 42. Maybe because they are strictly UK based and hence use only UK clothing sizes. In that case, it is clear what this 42 means. Just as the NULL in the original example, where only companies could have a NULL birthdate. And then there is the other possibility - that the company does use different size systems interchangably. Now the value "42" in this column is suddenly a lot less informative. like the NULL in the birthdate column that was used for both Microsoft and Ms Davies. Problem? Well ... maybe. It depends.
If the company using this table is a transport company, and the only reason for storing this information is to have a double check that the information on the box that is delivered matches the information on the delivery report, they are still good. They don;t care about the actual size of the blouse that's in the box, they just want to make sure thaht Ms Davies gets the correct package. So for this case, where the distinction between size 42 (EU) or size 42 (US) is irrelevant for the owner of the data, just "42" will still do. The same might be the case for the NULL (n/a) or NULL (unknown) example - neither customer will get a birthday present, and that's the only thing I use the column for.
This leaves us with just one possible case. The company DOES care about the actual size of the blouse. They do NOT want to recommend a size 42 (EU) blouse to a customer who ordered a size 42 (US) blouse. So ... did we now finally find a case for needing several different types of 42 in the database? No ... we still didn't. We might want to feel a deep desire to store 42 (EU) or 42 (US) in the same column, and we might want to do that with two different types of 42 - but if we did, we would be violating First Normal Form. One of the requirements of this most basic normal form is that the domain of each attribute has to be atomic. A domain that includes both 42 (US) and 42 (EU) is not atomic. The attribute should be split. Or, in database terms, we should use two columns instead of one - a column to represent the size of a blouse (42), and a second column to represent the size system used for this particular blouse (EU).
Cycling back to NULL - if the company owning the data has an interest in knowing why the birthdate for a specific customer is NULL (because, apparently, they want the response of their system to an unknown birthdate to be different from the response to a not applicable birthdate), they should indeed ensure that this is stored in the database. But not by inventing two tpes of NULL - that would violate First Normal Form. Instead, they should create a second column - one that is only populated for rows with a NULL birthdate, and that contains an atomic value representing the reason why there is no data value in the birthdate column.
Oh boy, I guess I got sidetracked a little. Sorry for that. I'll wrap up now - not only because, by now, the time spent typing this IS indeed approaching the 42 minute mark, but also because my cats are hungry, and demanding to be fed. And when the cats demand, I obey,
Hugo Kornelis, SQL Server MVP
Visit my SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis