Database Design Follies: NULL vs. NOT NULL

Question

Database Design Follies: NULL vs. NOT NULL

Viewing 15 posts - 91 through 105 (of 137 total)

You must be logged in to reply to this topic. Login to reply

ScottPletcher SSC Guru Points: 100949 More actions · Answer 1

Zidar (3/2/2016)
I am not done 🙂
In T-SQL it takes two CHECK constraints to enforce the rule:
C1: CHECK (NOT [Job] = 'DBA" OR [CertificateNum] IS NOT NULL)
C2: CHECK ([CertificateNum] IS NOT NULL OR [Job] = 'DBA')
Funny, in MS Access, which has a crippled SQL compared to MS SQL T-SQL, we can write a simple table level validation rule:
[Job] = 'DBA' EQV [CertificateNum] IS NOT NULL

Two CHECKs? Besides, those are the most convoluted CHECK conditions for something so simple I've ever seen. Why not the CHECK below, assuming Job cannot be NULL (as you did):

CHECK((Job = 'DBA' AND CertificateNum IS NOT NULL) OR (Job <> 'DBA' AND CertificateNum IS NULL))

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

rstone SSCertifiable Points: 6209 More actions · Answer 2

How about defining the null to mean that this column should exist in a separate table where there would be no record for this value? Then construct the T-SQL logic of the null handling of the "null" column to match the results where the column is truly in a separate table?

RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

Zidar SSChasing Mays Points: 651 More actions · Answer 3

And processing 100K rows would take minutes with every nullable column is in its own table. That's just crazy. Normalization is not the ultimate goal -- enough normalization, say to BCNF, to prevent most data anamalies yet provide a practical system. SAP does that, and there are thousands and thousands and thousands of tables. Good luck with that!

There is assumption in relational theory, called Closed World assumption, saying "Whatever is in database, we assume is true, whatever is NOT in database we assume FALSE". If we have a relation to store american presidents, defined by predicate:

P = (Person [Name] was president of USA from year [BeginPres] to year [EndPres])

and there is no row for Bill Clinton, then as far as database is concerned Bill Clinton was never a president of USA, even if there are no rows for years 1993-2001. So a row is either in a table, or is not. If you allow something to be unknown, then it is not really important.

SAP is not good example of relational engine, and I agree, it is horrible design. I put my kids through university by developing databases to make up what SAS deficiences.

Actually, the real question is not use or not NULLs or tokens and defaults. The real question is "Do we think deep enough of constraint that follow from business rules, stated or unstated". Most of the people see only normalization as 'database design'. Normalization is fine, necessary but not sufficient for good design. The goal is good design = database that truly represents part of business we are trying to support. Even when normalizing, we often forget the goal of normalization. Is it to reduce redundancy as much as possible, or to enforce referential integrity, or both, up to some 'optimal' degree, where criterion for optimization is enforcing as many business rules as we can in declarative way. Speed of execution of queries is certainly not the goal of database design or normalization. Once we start thinking beyond ER diagrams (not part of relational mathematics anyways) the need for using NULLs and tokens is greatly reduced, and need for programming on all levels is greatly reduced too.

Time permitting, I hope to demonstrate how it can be done.

rstone SSCertifiable Points: 6209 More actions · Answer 4

Grasshopper, if the "null" columns are stored in separate tables (6NF?), then a join (left) to build the original table would return nulls for the missing value. I see the difference in the logical design, but not in the physical. Would a proper design not try to build the "original" table?

I have to ask, do left joins fall into the same category as nulls? We either do an inner join when the "null" column must be present or we don't do the join?

PS How about a joke? Q: How does a developer pronounce 6NF? A: WTF

RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

Zidar SSChasing Mays Points: 651 More actions · Answer 5

Grasshopper, if the "null" columns are stored in separate tables (6NF?), then a join (left) to build the original table would return nulls for the missing value. I see the difference in the logical design, but not in the physical. Would a proper design not try to build the "original" table?
I have to ask, do left joins fall into the same category as nulls? We either do an inner join when the "null" column must be present or we don't do the join?

Yes, LEFT (RIGHT, too) join falls into the same category as nulls. All operations that involve relations must return relations. Outer joins do not return relations - because of nulls. Original relational operators are restrict, project, product, union, intersection, difference, natural join, relational division. Even INNER JOIN is not considered relational operator. Up to recently, SQL products simply did not support most of the operators, and almost all of them introduced non-relational operators, often proprietary. Some operators have been implemented in disguised form:

restrict - WHERE

project - SELECT some columns but not all

product - JOIN without ON condition

union - it does exist, together with UNION ALL which returns duplicates, thus, not relational

intersection and difference were introduced into T-SQL very recently, as INTERSECT and EXCEPT

natural join - not implemented, but combining INNER JOIN and PROJECTION we could achieve similar effect

How to replace OUTER JOIN with truly relational operators? Something like this perhaps

-- Table Parent {ParentPK int NOT NULL UNIQUE }

-- Table Child {ParentPK int NOT NULL UNIQUE, NullColumn as varchar }

SELECT P.ParentPK, C.NullColumn

FROM ParentTable AS P

NATURAL JOIN ChildTable AS C ON P.PareentPK = C.ParentPK

UNION

SELECT ParentPkVAlue, NullColumn = 'MISSING'

FROM ParentTable

EXCEPT

SELECT ParentPkVAlue FROM ChildTable

All operators except NATURAL JOIN exist in T-SQL as of today. Since ParentPK is the only column that both ParentTable and ChildTable have in common, then INNER JOIN returns same thing as NATURAL JOIN.

So, it is possible to avoid OUTER JOINs even in T-SQL. It was not possible until recently (was it MS SQL 2005 or 2008 that brought set operators INTERSECT and EXCEPT ?). At early days of SQL, we did not have FOREIGN KEYs, and now we have them. Is this query more deficient than LEFT JOIN? I don't know, it depends on implementation, may be faster in ORACLE, slower in T-SQL or vice versa. That is not the point. the point is - we avoided NULLs, in table design and in our query.

Done for tonight, will be back tomorrow 🙂

x SSC-Insane Points: 23660 More actions · Answer 6

Zidar (3/2/2016)
Grasshopper, if the "null" columns are stored in separate tables (6NF?), then a join (left) to build the original table would return nulls for the missing value. I see the difference in the logical design, but not in the physical. Would a proper design not try to build the "original" table?
I have to ask, do left joins fall into the same category as nulls? We either do an inner join when the "null" column must be present or we don't do the join?
Yes, LEFT (RIGHT, too) join falls into the same category as nulls. All operations that involve relations must return relations. Outer joins do not return relations - because of nulls. Original relational operators are restrict, project, product, union, intersection, difference, natural join, relational division. Even INNER JOIN is not considered relational operator. Up to recently, SQL products simply did not support most of the operators, and almost all of them introduced non-relational operators, often proprietary. Some operators have been implemented in disguised form:
restrict - WHERE
project - SELECT some columns but not all
product - JOIN without ON condition
union - it does exist, together with UNION ALL which returns duplicates, thus, not relational
intersection and difference were introduced into T-SQL very recently, as INTERSECT and EXCEPT
natural join - not implemented, but combining INNER JOIN and PROJECTION we could achieve similar effect
How to replace OUTER JOIN with truly relational operators? Something like this perhaps
-- Table Parent {ParentPK int NOT NULL UNIQUE }
-- Table Child {ParentPK int NOT NULL UNIQUE, NullColumn as varchar }
SELECT P.ParentPK, C.NullColumn
FROM ParentTable AS P
NATURAL JOIN ChildTable AS C ON P.PareentPK = C.ParentPK
UNION
SELECT ParentPkVAlue, NullColumn = 'MISSING'
FROM ParentTable
EXCEPT
SELECT ParentPkVAlue FROM ChildTable
All operators except NATURAL JOIN exist in T-SQL as of today. Since ParentPK is the only column that both ParentTable and ChildTable have in common, then INNER JOIN returns same thing as NATURAL JOIN.
So, it is possible to avoid OUTER JOINs even in T-SQL. It was not possible until recently (was it MS SQL 2005 or 2008 that brought set operators INTERSECT and EXCEPT ?). At early days of SQL, we did not have FOREIGN KEYs, and now we have them. Is this query more deficient than LEFT JOIN? I don't know, it depends on implementation, may be faster in ORACLE, slower in T-SQL or vice versa. That is not the point. the point is - we avoided NULLs, in table design and in our query.
Done for tonight, will be back tomorrow 🙂

When you come back tomorrow, I'm wondering why

NullColumn = 'MISSING'

is not what you call a token?

kenambrose SSCrazy Points: 2513 More actions · Answer 7

Is there some thinking that there is no difference between a REPORT (extracting, analyzing, formatting, and displaying datato a user) and RELATIONAL DATA PERSISTENCE?

They are not the same thing!

No matter how old you are 😉

rstone SSCertifiable Points: 6209 More actions · Answer 8

The SELECT clause can arbitrarily format the results. If it is a token, I would consider it an application token, not a database token.

Back to an earlier question. Can we consider the NULL implementation as an equivalent alternative? For example, consider the perfect non-null database design that avoids nulls everywhere, including in the queries where UNION/EXCEPT are used. Then consider something similar with nulls. Is using the where clause to filter out nulls, if required, and using ISNULL(NULLCOLUMN, 'IsMissing') as [NULLCOLUMN] in the select clause similar - the same logically? Is the real issue that nulls can be used in other ways that would make the two not logically the same or lead to mistakes?

RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

kenambrose SSCrazy Points: 2513 More actions · Answer 9

one problem with that thinking is that sql server does not handle "null" flag consistently. it screws up grouping, screws up string handling, sorting, even keys etc. etc.

Just one example of the silliness. You can make a nullable column a unique constraint. In that case the sql server engine will allow ONE null value in the column. How irrational, silly, and unexplainable is that?

If you don't know the value of something, how can you say one occurence of an unknown value is unique! LOL

there is more info out there about the inconsistencies in how sql server "treats" null flag (and I believe it has even changed over time!)

this one is sort of ok

https://www.simple-talk.com/sql/t-sql-programming/how-to-get-nulls-horribly-wrong-in-sql-server/

Mr. Pascal of databasedebukings does a better job of explaining the silliness, in my opinion.

Zidar SSChasing Mays Points: 651 More actions · Answer 10

When you come back tomorrow, I'm wondering why
NullColumn = 'MISSING'
is not what you call a token?

I believe that Old Hand provided adequate answer in his post 1766165. NullColumn = 'MISSING' is simply an expression used in a query, and that is OK with both relational theory and SQL standard. Whether we call it token or not, does not matter much. Unlike NULL which has at least 3 meanings (unknown, missing, not applicable) this expression is clear - the data is missing. In different context, we might have chosen NullColumn = 'Not Applicable'. In any case, it is under our control - depending on context and particular problem we can explain NULL like data any way we see it fit.

As for observation that T-SQL treats allows only one NULL per column in UNIQUE constraints, true, it does not follow the standard, which allows infinite number of rows with NULL in unique constraints. Contrary to common belief (?), NULL uniqueness can be a very useful thing. People who have read Defensive Database Programming by Alex Kuznetsov will know what am I talking about. Story is too long to be explained in a single post, and it would drift away from the article's topic - to NULL or not to NULL. NULLs can be put to useful work, if handled in a wise way - if we treat them as well defined token, with clear meaning, or, as in T-SQWL case, if we allow only one NULL per column.

Again, the real problem is not to NULL or not to NULL. The problem is how we think about constraints. If we make several constraints work together then we will achieve much more than thinking of constraints independent of each other, which is unfortunately the case most of the time. Problem with NULLs is that we cannot figure out their meaning in particular situations - NULLs are uncontrollable. Once we harness NULLs with constraints and logic, then, yes, we can safely use them.

Long time ago, at the time of punched cards, there was a science 'Cybernetics', which was explained to us as 'science about information' (and had nothing to do with computers). We learned then that information is a piece of data that decreases uncertainty of some problem - information increases our knowledge about the problem in question. NULLs, if left uncontrolled, do not increase our knowledge about something, therefore they do not meet definition of 'information' or definition of data. Remember, that a database is a set of TRUE facts, and relations happened to be just a convenient way to manage the data in rigorous and consistent way. NULLs are neither TRUE nor FALSE, they have no value, therefore, they are not data at all. If we manage to control NULLs (by means of constraints), they become tokens with meaning and we can try to take advantage of them. If there is no meaning, there is nothing to talk about, eh?

Cheers 🙂 ?

roger.plowman SSChampion Points: 10265 More actions · Answer 11

ScottPletcher (3/1/2016)
roger.plowman (3/1/2016)
patrickmcginnis59 10839 (2/29/2016)
roger.plowman (2/29/2016)
I still don't see why relational tables can't have records for TBD, N/A, and UNK.
If you need to know why a value is missing, you're going to have to code it in anyways. So why not allocate a column to it? Yes its an extra column, but trying to read anything like "unknown" into the null / lack of a value is just going to leave you mystified about what null actually is. Null means a value isn't in your database column. It doesn't mean its unknown, missing, due to be filled in, it only means a value has not been stored in that column, it doesn't matter if its a column in a table or a column in a missing row in a left join.
Heck, if the other column says TBD, N/A, or UKN, you don't even need to see if your nullable column has a value, your status column says as much already!
Its up to the application / use case to determine what having a null in the column means. I've seen it mean "truck is parked in loading dock."
If you have multiple reasons why a column lacks a value, stick another column in there and encode your TBD, N/A, UNK, because again, in the most fundamental sense, null by default doesn't mean any of those things. But for the sanity of folks who use database software, DON'T HAVE YOUR RDBMS VENDOR HARD WIRE MEANINGS INTO NULLS.
Because for our application TBD/NA/UNK was necessary for (amost) every relation--over 400 of them. By standardizing the values of the TBD/NA/UNK records it vastly simplified both the DB design and the code.
Using magic values in other contexts (like dates) does have coding implications, but on balance those complexities are handled outside SQL Server (for example, creating a function for formatting dates to include them on a report) that made the code MUCH MUCH MUCH simpler.
Business rules inside SQL Server were also simplified. Trivalue logic quickly becomes very messy at scale.
Besides, no one said nulls should have hard-coded meaning (other than "data is missing"). The whole argument is about better options than allowing nulls in the first place.
I happened to pick a dates starting at 1900 and going back for my flags (since birthdates weren't an issue), but I can certainly see the advantage of using a date like 12/31/9999 or whatever too.
The point is, allowing nulls in data is inviting inconsistencies and worse, ambiguities into it. A null says data is missing, it doesn't give a clue as to why. Flag values may be evil but Null (outside a very specific usage) is the Devil Incarnate...
No, there really should not be any magic values. Instead, you need a separate column(s) to properly map the data/information.
Btw, what specifically is the difference between "TDB" and "UNK"??

UNK means "verified unknown", as in the data is forever unknowable. For example, if an item was purchased at some point in the past but all records of the purchase have been lost, then the purchase date would be UNK. If the purchase date information is available and just hasn't been entered yet, then it's TBD.

Likewise, a "warehoused date" would be when the object was stored (ie it had been in use originally, but is no longer needed, but might be needed in the future). If the item is in current use the warehoused date would be N/A as it's not applicable to an item in use. TBD in this case would mean the status hasn't been entered yet, but the data can be found and entered. UNK in this case wouldn't make sense because that would mean you couldn't figure out if the item was warehoused or not--which should be physically impossible. 🙂

ScottPletcher SSC Guru Points: 100949 More actions · Answer 12

In the real world, there can be unknown values at a given point in time in a data store. For example, birth date. If you need to create the row now, but don't know the person's birth date, the only reasonable value to use is NULL.

'MISSING', like any other magic value, is wrong on its face. Add an appropriate indicator for that. If the missing status relates to the keys, then the indicator will relate to the keys: trying to say that using an indicator instead of a magic value violates 3NF is just bizarre. Yes, you could model separate tables every time, but that "cure" is worse than the disease. At some point you have to be practical. If dbms creators had waited until they could meet every one of Codd's rules to create a db engine, we'd still be waiting!

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

x SSC-Insane Points: 23660 More actions · Answer 13

Zidar (3/3/2016)
When you come back tomorrow, I'm wondering why
NullColumn = 'MISSING'
is not what you call a token?
I believe that Old Hand provided adequate answer in his post 1766165. NullColumn = 'MISSING' is simply an expression used in a query, and that is OK with both relational theory and SQL standard. Whether we call it token or not, does not matter much. Unlike NULL which has at least 3 meanings (unknown, missing, not applicable) this expression is clear - the data is missing.

I think this is the root of my dissatisfaction with this, NULL is only supposed to mean one thing, the column that is NULL has no value stored in it. Now whether this is a column persisted in a table or a column in a result set in a left join with missing right hand row, it still means there is no value there.

In my understanding, its the programmers choice on what conclusions to make when the column lacks a value, not any sort of relational theory. If you instead replace this NULL with a value subtracted from the types available range of values (by a NOT NULL column, a COALESCE in a left join, or in your case the UNION), this in my understanding of what you're saying is a "token" saying the same thing except this token is actually being stored into the column position but it represents the same condition.

Once you now use the "token" instead of SQL's NULL facility, you now have to manually code tests to see if values from your problem domain have been stored, whereas NULL propogates through expressions saving code.

Sure, I'll accept that maybe this "token" is more relationally pure, but I'll have to add the caveat that it adds code complexity, and you lose the advantage that SQL offers in the first place, less bugs. Every time you add code complexity without an increase in complexity of the problem you're solving, in my opinion it is a net loss.

akljfhnlaflkj SSC Guru Points: 76202 More actions · Answer 14

akljfhnlaflkj

SSC Guru

Points: 76202

March 9, 2016 at 4:58 pm

#1864157

Thanks for the article.

kenambrose SSCrazy Points: 2513 More actions · Answer 15

[[that it adds code complexity]]

Nah,

try rating 1000 reports where you have to "if isnull blah, blah, blah " a pile of columns for null because it's the "programmer's choice" what null should mean.