Who Likes NULL?

• Lynn Pettis - Tuesday, July 17, 2018 10:30 AM

I hope you weren't thinking that I actually was advocating for 6th normal form designs, because I wasn't. 🙂

I haven't had to design highly complex databases, the few I have written went to 3rd normal form then denormalized where appropriate.

Heh... I've actually had to design "highly complex databases" before and I find that making them highly complex is the wrong thing to do. 😀  KISS it and to me, KISS is "Keep It Super Simple".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

• YOU'RE ALL WRONG!!!!  My reasons are the only good ones!  Read-and-take-heed!  (I know, pretty conceited, but as good as any other reason in this thread.)  Pick and choose your logic.
By the way, is anyone recording all the reasons and their logic so if this question is asked again, they can just cut a paste this whole discussion and save the rest of us a lot of time and trouble?

• Perhaps we need to go back to the theory, to deal with NULLs. I pologize for a bit longer post, perhaps even boring. If you are afraid of being bored, skipp to teh end and review my contribution to number threory 🙂

If we accept definition of a relation as "Set of true propositions about something, derived from the same predicate" then NULLs simply do not fit there. let's define 'proposition' and 'predicate'

Proposition = a declarative sentence S, for which we can ask "Is it true the S?" and get answer YES or NO. There is no third value, it must be either YES or NO. Mathematically strictly, there is no 3 valued logic. We can invent it, but it does not make it right. Just like division by zero does not make sense, involving third value into mathematical logic makes no sense.

Predicate = a parametrized sentence where parameters may be replaced by constants that produce a proposition.

Here is an example of a predicate:

P = Person named {Fname] lives in [City] and is born on [DOB]

Propositions derived by supplying values for bracketed parameters:

S1 = Person named [Bob] lives in [New York] and is born on [June 16 1867].
S2 = Person named [Anna] lives in [Madrid] and is born on [May 25 1989].
S3 = Person named [Thor] lives in [Oslo] and is born on [May 25 1997].
...

which can be written in a familiar tabular form:

Fname    City            DoB
----------------------------------
'Bob'    'New York'    'June 16 1867'
'Thor'    'Oslo'        'May 25 1997'

Rows in our table represent some propositions derived from given predicate.

In order for our table to be a relation, some conditions must be met:
1) Propositions stored in the table are all TRUE
2) Each proposition must be unique, (no identical rows in the table)
3) Values in each 'column' are from the same domain - we don't store name of the persons pet or first cousin in column 'Fname', just as we do not store persons name in City column.

We can now define a relation in a shorter for,:

Relation is a set of TRUE propositions derived from common predicate.

No tables mentioned, although it is very convenient to present relations in tabular format. This definition covers all 3 conditions.

The first one, "Relation stores only TRUE propositions" is the key in understanding why NULLs generally do not fit into relations. Simply, what is unknown cannot be true. Maybe, but not 100% sure.

Just read aloud following sentence: " Person named [Bob] lives in [New York] and is born on [NULL]". It does not sound right, does it? regardless of how it sound, we cannot declare it TRUE since NULL does not represent a valid date. "maybe" is not an answer. "maybe" does not tell us if a sentence is TRUE or FALSE. What we are actually saying by using NULL is:

" Person named [Bob] lives in [New York] and we do not know their DOB".

This is quite different meaning than other sentences we stored in our relation:

S2 = Person named [Anna] lives in [Madrid] and is born on [May 25 1989].
S3 = Person named [Thor] lives in [Oslo] and is born on [May 25 1997].

It is obvious that sentences about Anna and Thor have the same meaning. If we argue that the meaning is the same, that means we believe that words
"is born on some date" is the same thing (identical) to "we don't know their DOB". in that case, I have a mountain and a bridge to sell...

Is there a solution? There is. If we expect that DOB will not be known at the moment of inserting a row in the table, we can create a new table/relation to say
"Person [Fname] is born on [DOB] and we learned this on [InsertDate] when user [UserID] made this record.". Technically, it is not 6nf, we have more than one non-key column. But then, if something is not available immediately, we might as well record when we obtained the information and who did it.

As for situations as "active/Inactive", "StartDate/EndDate", we are forcing more than one proposition at the time. For example, predicate
"Contractor [ContactorID] signed contract [ContractID] which has status 'A' from [FromDate] to [ToDate]".
With use of NULL we may want to represent open ended contract.

Table 1::
ContractorID     ContractID         SignedDate        Status     FromDate         ToDate
---------------------------------------------------------------------------------------
125                'CNT125/1'         'May 10 2018    'A'        'July 16 2018'    'July 31, 2019'
207                '207/1DBL'         'Sept 1 2015'    'A'        'July 16 2015'    'July 31, 2017'
125                'CNT125/2'         'April 1 2018'    'A'        'July 16 2018'    NULL

Try to read aloud the predicate, by supplying values from the table. Again, row with NULL sounds awkward. There is nothing wrong with substituting NULL with '9999-01-01'. After all, it is a valid date and certainly it belongs to the domain of dates. that makes it different than, say 'N/A' or 'unknown' or 'OPEN' or - NULL.

But there is more serious problem with the design. Obviously, the table designer wants to track contracts and their active dates. Predicate talks about TWO things - a contract, and period of validity of the contracts. Perhaps better design would be:
P1: "Contractor [ContractorID] signed contract [ContractID] on date [SignedDate]."
P2: "Contract [ContractID] has status [Status] from date [StatusDate]."

Tables representing would be relations:

Table 2:
ContractorID     ContractID         SignedDate
----------------------------------------------
125                'CNT125/1'         'May 10 2018
207                '207/1DBL'         'Sept 1 2015'
125                'CNT125/2'         'April 1 2018'

Table 3::
ContractID     Status             FromDate
------------------------------------------
'CNT125/1'    'Active'        'July 16,2018'
'207/1DBL'    'Active'        'July 16,2015'
'CNT125/2'    'Active'        'July 16,2018'
'CNT125/1'    'Inactive'        'July 31, 2019'
'207/1DBL'    'Inactive'        'July 31, 2017'

there is no row for end date for contract 'CNT125/2', since we don't know it.

Not a single NULL, nor a wild-card token, no magic code, and yet we can derive same information displayed by Table 1. To display data like TAble 1 we would write a query that would return exact data-set as Table 1, NULL included. Table 1 becomes output from our database, a report if you want, so we can display NULL which can have exactly one meaning - [ToDate], expiry date fro contract 'CNT125/2' is unknown. It can be unknown only because there is no record for it in Table 2. Everything is clear. Again, it would not be wrong to add a row with a future date:

ContractID     Status             FromDate
------------------------------------------
'CNT125/1'    'Active'        'July 16,2018'
'207/1DBL'    'Active'        'July 16,2015'
'CNT125/2'    'Active'        'July 16,2018'
'CNT125/1'    'Inactive'        'July 31,2019'
'207/1DBL'    'Inactive'        'July 31,2017'
'CNT125/2'    'Active'        'June 30,2022' '<-- date 'January 1, 9999' is as valid as any date

Problem with NULL is multiple meaning 'missing, but should be there','we don't have the data yet', 'deleted maybe','we have no clue why is this empty' So we try to avoid NULLs when storing data. However, in reports NULL is kind of desirable thing because it conveys exact meaning 'this value is missing'.

If we accept definition of a relation as "Set of true propositions about something, derived from the same predicate" then NULLs simply do not fit there. let's define 'proposition' and 'predicate'

Proposition = a declarative sentence S, for which we can ask "Is it true the S?" and get answer YES or NO. There is no third value, it must be either YES or NO. Mathematically strictly, there is no 3 valued logic. We can invent it, but it does not make it right. Just like division by zero does not make sense, involving third value into mathematical logic makes no sense.

Predicate = a parametrized sentence where parameters may be replaced by constants that produce a proposition.

Here is an example of a predicate:

P = Person named {Fname] lives in [City] and is born on [DOB]

Propositions derived by supplying values for bracketed parameters:

S1 = Person named [Bob] lives in [New York] and is born on [June 16 1867].
S2 = Person named [Anna] lives in [Madrid] and is born on [May 25 1989].
S3 = Person named [Thor] lives in [Oslo] and is born on [May 25 1997].
...

which can be written in a familiar tabular form:

Fname    City            DoB
----------------------------------
'Bob'    'New York'    'June 16 1867'
'Thor'    'Oslo'        'May 25 1997'

Rows in our table represent some propositions derived from given predicate.

In order for our table to be a relation, some conditions must be met:
1) Propositions stored in the table are all TRUE
2) Each proposition must be unique, (no identical rows in the table)
3) Values in each 'column' are from the same domain - we don't store name of the persons pet or first cousin in column 'Fname', just as we do not store persons name in City column.

We can now define a relation in a shorter form:

Relation is a set of TRUE propositions derived from common predicate.

No tables mentioned, although it is very convenient to present relations in tabular format. This definition covers all 3 conditions.

The first one, "Relation stores only TRUE propositions" is the key in understanding why NULLs generally do not fit into relations. Simply, what is unknown cannot be true. Maybe, but not 100% sure.

Just read aloud following sentence: " Person named [Bob] lives in [New York] and is born on [NULL]". It does not sound right, does it? regardless of how it sound, we cannot declare it TRUE since NULL does not represent a valid date. "maybe" is not an answer. "maybe" does not tell us if a sentence is TRUE or FALSE. What we are actually saying by using NULL is:

" Person named [Bob] lives in [New York] and we do not know their DOB".

This is quite different meaning than other sentences we stored in our relation:

S2 = Person named [Anna] lives in [Madrid] and is born on [May 25 1989].
S3 = Person named [Thor] lives in [Oslo] and is born on [May 25 1997].

It is obvious that sentences about Anna and Thor have the same meaning. If we argue that the meaning is the same, that means we believe that words
"is born on some date" is the same thing (identical) to "we don't know their DOB". in that case, I have a mountain and a bridge to sell...

Is there a solution? There is. If we expect that DOB will not be known at the moment of inserting a row in the table, we can create a new table/relation to say
"Person [Fname] is born on [DOB] and we learned this on [InsertDate] when user [UserID] made this record.". Technically, it is not 6nf, we have more than one non-key column. But then, if something is not available immediately, we might as well record when we obtained the information and who did it.

As for situations as "active/Inactive", "StartDate/EndDate", we are forcing more than one proposition at the time. For example, predicate
"Contractor [ContactorID] signed contract [ContractID] which has status 'A' from [FromDate] to [ToDate]".
With use of NULL we may want to represent open ended contract.

Table 1::
ContractorID     ContractID         SignedDate        Status     FromDate         ToDate
---------------------------------------------------------------------------------------
125                'CNT125/1'         'May 10 2018    'A'        'July 16 2018'    'July 31, 2019'
207                '207/1DBL'         'Sept 1 2015'    'A'        'July 16 2015'    'July 31, 2017'
125                'CNT125/2'         'April 1 2018'    'A'        'July 16 2018'    NULL

Try to read aloud the predicate, by supplying values from the table. Again, row with NULL sounds awkward. There is nothing wrong with substituting NULL with '9999-01-01'. After all, it is a valid date and certainly it belongs to the domain of dates. that makes it different than, say 'N/A' or 'unknown' or 'OPEN' or - NULL.

But there is more serious problem with the design. Obviously, the table designer wants to track contracts and their active dates. Predicate talks about TWO things - a contract, and period of validity of the contracts. Perhaps better design would be:
P1: "Contractor [ContractorID] signed contract [ContractID] on date [SignedDate]."
P2: "Contract [ContractID] has status [Status] from date [StatusDate]."

Tables representing would be relations:

Table 2:
ContractorID     ContractID         SignedDate
----------------------------------------------
125                'CNT125/1'         'May 10 2018
207                '207/1DBL'         'Sept 1 2015'
125                'CNT125/2'         'April 1 2018'

Table 3::
ContractID     Status             FromDate
------------------------------------------
'CNT125/1'    'Active'        'July 16,2018'
'207/1DBL'    'Active'        'July 16,2015'
'CNT125/2'    'Active'        'July 16,2018'
'CNT125/1'    'Inactive'        'July 31, 2019'
'207/1DBL'    'Inactive'        'July 31, 2017'

there is no row for end date for contract 'CNT125/2', since we don't know it.

Not a single NULL, nor a wild-card token, no magic code, and yet we can derive same information displayed by Table 1. To display data like TAble 1 we would write a query that would return exact data-set as Table 1, NULL included. Table 1 becomes output from our database, a report if you want, so we can display NULL which can have exactly one meaning - [ToDate], expiry date fro contract 'CNT125/2' is unknown. It can be unknown only because there is no record for it in Table 2. Everything is clear. Again, it would not be wrong to add a row with a future date:

ContractID     Status             FromDate
------------------------------------------
'CNT125/1'    'Active'        'July 16,2018'
'207/1DBL'    'Active'        'July 16,2015'
'CNT125/2'    'Active'        'July 16,2018'
'CNT125/1'    'Inactive'        'July 31,2019'
'207/1DBL'    'Inactive'        'July 31,2017'
'CNT125/2'    'Active'        'June 30,2022' '<-- date 'January 1, 9999' is as valid as any date

Number Theory in Practical Way, Ignoring theRules of Mathematics
------------------------------------------------------------------------------------------

Using NULLs when storing data therefore is going against mathematics and that usially does not end well. I will show what happens when basic mathematical rules are ignored. one of the most fundamental rules in mathematics is "Thou shall not divide by zero, like 2/0 or 3/0"
If we ignore 'no division by zero' rule, we can prove anything with numbers. Let's start from obviously true proposition: every number squared minus the same nuber squared equals thatsame number squared minus the same number squared:
X^2 - X^2 = X^2 - X^2
We can factor both sides, but in a different way (both ways correct), as we know from highs school. (If you do not know this much of math, to understand what follows, then we have nothing to talk about, sorry.)

By factoring we can write:
X (X - Y) = (X - X)*(X + X)
We can now reduce both sides of the equation by common factor (X-X) to get
X = 2X
Further reduce by X and we get finally
1 = 2
By some other elementary operations (+,-,*,/) it is possibe to prove that any number = any other number. Add 3 to both sides and get:
3+1 = 2+3 => 4=5. Multiply by 4 and add 1 to both sides 1+4*4 = 5*4 + 1 => 17 = 21. tha sky is the limit 🙂

How can this be? Because we had devided by zero. When we reduced both sides by factor (x-x) we actually divided both sides of the expression by 0 (zero) which is not allowed in mathematics. We broke the rule, and we came to a dumb conclusion that any x is equal to any y, even when they are different. Saying that something is equal to something different from it is the same as saying 'NULL is a special value that tells us there is no value'.

tThere 🙂

• mark.edwards-1115881 - Tuesday, July 17, 2018 12:04 PM

YOU'RE ALL WRONG!!!!  My reasons are the only good ones!  Read-and-take-heed!  (I know, pretty conceited, but as good as any other reason in this thread.)  Pick and choose your logic.
By the way, is anyone recording all the reasons and their logic so if this question is asked again, they can just cut a paste this whole discussion and save the rest of us a lot of time and trouble?

No Need to - it was already recorded some time back.  All kidding aside - if you can read past the insults for the actual debate (not including the various ad-hominem attacks), it was a fairly comprehensive discussion, starting with a solid article on the topic.

https://www.sqlservercentral.com/Forums/339807/NULL-Equals-NULL

As far as wading any further into this particular thread, I'm out.  No doubt an interesting topic, just a well-trodden one, and not likely to change anyone's mind.. Enjoy!

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

• Matt Miller (4) - Tuesday, July 17, 2018 6:42 PM

mark.edwards-1115881 - Tuesday, July 17, 2018 12:04 PM

YOU'RE ALL WRONG!!!!  My reasons are the only good ones!  Read-and-take-heed!  (I know, pretty conceited, but as good as any other reason in this thread.)  Pick and choose your logic.
By the way, is anyone recording all the reasons and their logic so if this question is asked again, they can just cut a paste this whole discussion and save the rest of us a lot of time and trouble?

No Need to - it was already recorded some time back.  All kidding aside - if you can read past the insults for the actual debate (not including the various ad-hominem attacks), it was a fairly comprehensive discussion, starting with a solid article on the topic.

https://www.sqlservercentral.com/Forums/339807/NULL-Equals-NULL

As far as wading any further into this particular thread, I'm out.  No doubt an interesting topic, just a well-trodden one, and not likely to change anyone's mind.. Enjoy!

Heh... me too.  I'm out of popcorn and I have to make a beer run. 😀  It was entertaining to say the least.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

• Zidar - Tuesday, July 17, 2018 3:23 PM

Proposition = a declarative sentence S, for which we can ask "Is it true the S?" and get answer YES or NO. There is no third value, it must be either YES or NO. Mathematically strictly, there is no 3 valued logic. We can invent it, but it does not make it right. Just like division by zero does not make sense, involving third value into mathematical logic makes no sense.

This all boils down to this, which is an example of a logical fallacy called "Begging the Question".  You're assuming the very thing that you are trying to prove.  There is absolutely no reason to make this assumption and a lot to be gained by assuming that, in fact, there are three possible answers to any proposition: YES, NO, and UNKNOWN.

Drew

J. Drew Allen

• erikb 90350 - Friday, July 6, 2018 5:43 AM

From a programming perspective (i.e. the typical consumers of the database data), NULLs are a massive source of defects. NULL generally doesn't play well with code so it has to be trapped all over the place, leading to fragile code and insidious bugs.  There's a reason that Tony Hoare referred to NULL as a billion dollar mistake.

Tony Hoare's NULL wasn't about values in a relational database (after all, it antedates Codds first stab at a relational model by 6 years), it was about NULL pointers in OO languages (or so-called OO languages - specifically it was about Algol W) where reference was important.  His NULL could never be an integer NULL, the nearest it could get was to that was a null reference that would have been a reference to an integer if the reference had not been null. If I remember correctly (I last time I looked at his work on Algol W was in 1968) he proposed NULLs only for references, and didn't put enough effort into defining behaviour of computations that encountered these nulls (the programmer had to do checks, because the system didn't do any for him).  That's very different from null in SQL.  The general handling of NULL in SQL (apart from some idocies like the sum of an empty set of integers being defined as NULL instead of as zero) doesn't create the sort of problems that null addresses and references did in procedural languages, so when CARH called his NULLs a billion dollar mistake he was saying absolutely nothing about SQL's NULLs.

Tom

• really nice editorial.  But one fault: instead of calling those stupidly dangerous defaults advocated by the anti-NULL brigade "magic values" they should be called "black magic values" because they are usually clearly evil, and are almost never beneficial.

Tom

• Just to use a different spoon to stir the pot a little, it's amazing how bad the use of NULL is when you work with indexes.  For example, you can have a perfectly formed clustered index that meets all of the rules of being static, narrow, unique, and ever increasing and be guaranteed that you'll never ever need to defragment it... until someone updates a variable width column that contained a NULL and then you have massive fragmentation.

One of the greatest problems occurs when people use "poor man's auditing".  You know... a column for Created_Date, Created_By, Modified_Date, and Modified_By.  The two date columns aren't a problem because they're fixed width.  The Created_By column usually is never a problem because it's usually populated during the INSERT and suffers no expansion.  But, that Modified_Date column... most people leave it NULL because they don't consider the initial INSERT to be a modification.  Some do it to "save space" because a row may not ever be modified.  When they change the NULL to something, then BOOM!  Lots of page splits due to the "Expansive" updates and, generally speaking, there isn't a Fill Factor that's low enough to prevent the splits.

One way to help prevent such actions from causing page splits, especially on tables where rows are almost always INSERTed and then UPDATEd is to prefill things like Modified_BY with either an appropriate number of spaces (representing the 80 or 90 percent solution) or make it the same as the CREATED_BY during the INSERT.  Of course, the best thing is to not use such "poor man's" auditing techniques but the same problem holds true for any variable width column that starts out as NULL and is later updated to be something.

So, on the subject of "black magic values", I agree that they're a PITA compared to NULLs if you only consider criteria and the like.  But, if you step back and look at other bigger pictures, they actually can be quite beneficial.

And, no... I'm not even coming close to suggesting that NULLs should never be used but they shouldn't be used as often as they usually are.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.