Database Design Follies: NULL vs. NOT NULL

  • Zidar (3/2/2016)


    I am not done 🙂

    In T-SQL it takes two CHECK constraints to enforce the rule:

    C1: CHECK (NOT [Job] = 'DBA" OR [CertificateNum] IS NOT NULL)

    C2: CHECK ([CertificateNum] IS NOT NULL OR [Job] = 'DBA')

    Funny, in MS Access, which has a crippled SQL compared to MS SQL T-SQL, we can write a simple table level validation rule:

    [Job] = 'DBA' EQV [CertificateNum] IS NOT NULL

    Two CHECKs? Besides, those are the most convoluted CHECK conditions for something so simple I've ever seen. Why not the CHECK below, assuming Job cannot be NULL (as you did):

    CHECK((Job = 'DBA' AND CertificateNum IS NOT NULL) OR (Job <> 'DBA' AND CertificateNum IS NULL))

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • How about defining the null to mean that this column should exist in a separate table where there would be no record for this value? Then construct the T-SQL logic of the null handling of the "null" column to match the results where the column is truly in a separate table?

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • And processing 100K rows would take minutes with every nullable column is in its own table. That's just crazy. Normalization is not the ultimate goal -- enough normalization, say to BCNF, to prevent most data anamalies yet provide a practical system. SAP does that, and there are thousands and thousands and thousands of tables. Good luck with that!

    There is assumption in relational theory, called Closed World assumption, saying "Whatever is in database, we assume is true, whatever is NOT in database we assume FALSE". If we have a relation to store american presidents, defined by predicate:

    P = (Person [Name] was president of USA from year [BeginPres] to year [EndPres])

    and there is no row for Bill Clinton, then as far as database is concerned Bill Clinton was never a president of USA, even if there are no rows for years 1993-2001. So a row is either in a table, or is not. If you allow something to be unknown, then it is not really important.

    SAP is not good example of relational engine, and I agree, it is horrible design. I put my kids through university by developing databases to make up what SAS deficiences.

    Actually, the real question is not use or not NULLs or tokens and defaults. The real question is "Do we think deep enough of constraint that follow from business rules, stated or unstated". Most of the people see only normalization as 'database design'. Normalization is fine, necessary but not sufficient for good design. The goal is good design = database that truly represents part of business we are trying to support. Even when normalizing, we often forget the goal of normalization. Is it to reduce redundancy as much as possible, or to enforce referential integrity, or both, up to some 'optimal' degree, where criterion for optimization is enforcing as many business rules as we can in declarative way. Speed of execution of queries is certainly not the goal of database design or normalization. Once we start thinking beyond ER diagrams (not part of relational mathematics anyways) the need for using NULLs and tokens is greatly reduced, and need for programming on all levels is greatly reduced too.

    Time permitting, I hope to demonstrate how it can be done.

  • Grasshopper, if the "null" columns are stored in separate tables (6NF?), then a join (left) to build the original table would return nulls for the missing value. I see the difference in the logical design, but not in the physical. Would a proper design not try to build the "original" table?

    I have to ask, do left joins fall into the same category as nulls? We either do an inner join when the "null" column must be present or we don't do the join?

    PS How about a joke? Q: How does a developer pronounce 6NF? A: WTF

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • Grasshopper, if the "null" columns are stored in separate tables (6NF?), then a join (left) to build the original table would return nulls for the missing value. I see the difference in the logical design, but not in the physical. Would a proper design not try to build the "original" table?

    I have to ask, do left joins fall into the same category as nulls? We either do an inner join when the "null" column must be present or we don't do the join?

    Yes, LEFT (RIGHT, too) join falls into the same category as nulls. All operations that involve relations must return relations. Outer joins do not return relations - because of nulls. Original relational operators are restrict, project, product, union, intersection, difference, natural join, relational division. Even INNER JOIN is not considered relational operator. Up to recently, SQL products simply did not support most of the operators, and almost all of them introduced non-relational operators, often proprietary. Some operators have been implemented in disguised form:

    restrict - WHERE

    project - SELECT some columns but not all

    product - JOIN without ON condition

    union - it does exist, together with UNION ALL which returns duplicates, thus, not relational

    intersection and difference were introduced into T-SQL very recently, as INTERSECT and EXCEPT

    natural join - not implemented, but combining INNER JOIN and PROJECTION we could achieve similar effect

    How to replace OUTER JOIN with truly relational operators? Something like this perhaps

    -- Table Parent {ParentPK int NOT NULL UNIQUE }

    -- Table Child {ParentPK int NOT NULL UNIQUE, NullColumn as varchar }

    SELECT P.ParentPK, C.NullColumn

    FROM ParentTable AS P

    NATURAL JOIN ChildTable AS C ON P.PareentPK = C.ParentPK

    UNION

    SELECT ParentPkVAlue, NullColumn = 'MISSING'

    FROM ParentTable

    EXCEPT

    SELECT ParentPkVAlue FROM ChildTable

    All operators except NATURAL JOIN exist in T-SQL as of today. Since ParentPK is the only column that both ParentTable and ChildTable have in common, then INNER JOIN returns same thing as NATURAL JOIN.

    So, it is possible to avoid OUTER JOINs even in T-SQL. It was not possible until recently (was it MS SQL 2005 or 2008 that brought set operators INTERSECT and EXCEPT ?). At early days of SQL, we did not have FOREIGN KEYs, and now we have them. Is this query more deficient than LEFT JOIN? I don't know, it depends on implementation, may be faster in ORACLE, slower in T-SQL or vice versa. That is not the point. the point is - we avoided NULLs, in table design and in our query.

    Done for tonight, will be back tomorrow 🙂

  • Zidar (3/2/2016)


    Grasshopper, if the "null" columns are stored in separate tables (6NF?), then a join (left) to build the original table would return nulls for the missing value. I see the difference in the logical design, but not in the physical. Would a proper design not try to build the "original" table?

    I have to ask, do left joins fall into the same category as nulls? We either do an inner join when the "null" column must be present or we don't do the join?

    Yes, LEFT (RIGHT, too) join falls into the same category as nulls. All operations that involve relations must return relations. Outer joins do not return relations - because of nulls. Original relational operators are restrict, project, product, union, intersection, difference, natural join, relational division. Even INNER JOIN is not considered relational operator. Up to recently, SQL products simply did not support most of the operators, and almost all of them introduced non-relational operators, often proprietary. Some operators have been implemented in disguised form:

    restrict - WHERE

    project - SELECT some columns but not all

    product - JOIN without ON condition

    union - it does exist, together with UNION ALL which returns duplicates, thus, not relational

    intersection and difference were introduced into T-SQL very recently, as INTERSECT and EXCEPT

    natural join - not implemented, but combining INNER JOIN and PROJECTION we could achieve similar effect

    How to replace OUTER JOIN with truly relational operators? Something like this perhaps

    -- Table Parent {ParentPK int NOT NULL UNIQUE }

    -- Table Child {ParentPK int NOT NULL UNIQUE, NullColumn as varchar }

    SELECT P.ParentPK, C.NullColumn

    FROM ParentTable AS P

    NATURAL JOIN ChildTable AS C ON P.PareentPK = C.ParentPK

    UNION

    SELECT ParentPkVAlue, NullColumn = 'MISSING'

    FROM ParentTable

    EXCEPT

    SELECT ParentPkVAlue FROM ChildTable

    All operators except NATURAL JOIN exist in T-SQL as of today. Since ParentPK is the only column that both ParentTable and ChildTable have in common, then INNER JOIN returns same thing as NATURAL JOIN.

    So, it is possible to avoid OUTER JOINs even in T-SQL. It was not possible until recently (was it MS SQL 2005 or 2008 that brought set operators INTERSECT and EXCEPT ?). At early days of SQL, we did not have FOREIGN KEYs, and now we have them. Is this query more deficient than LEFT JOIN? I don't know, it depends on implementation, may be faster in ORACLE, slower in T-SQL or vice versa. That is not the point. the point is - we avoided NULLs, in table design and in our query.

    Done for tonight, will be back tomorrow 🙂

    When you come back tomorrow, I'm wondering why

    NullColumn = 'MISSING'

    is not what you call a token?

  • Is there some thinking that there is no difference between a REPORT (extracting, analyzing, formatting, and displaying datato a user) and RELATIONAL DATA PERSISTENCE?

    They are not the same thing!

    No matter how old you are 😉

  • The SELECT clause can arbitrarily format the results. If it is a token, I would consider it an application token, not a database token.

    Back to an earlier question. Can we consider the NULL implementation as an equivalent alternative? For example, consider the perfect non-null database design that avoids nulls everywhere, including in the queries where UNION/EXCEPT are used. Then consider something similar with nulls. Is using the where clause to filter out nulls, if required, and using ISNULL(NULLCOLUMN, 'IsMissing') as [NULLCOLUMN] in the select clause similar - the same logically? Is the real issue that nulls can be used in other ways that would make the two not logically the same or lead to mistakes?

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • one problem with that thinking is that sql server does not handle "null" flag consistently. it screws up grouping, screws up string handling, sorting, even keys etc. etc.

    Just one example of the silliness. You can make a nullable column a unique constraint. In that case the sql server engine will allow ONE null value in the column. How irrational, silly, and unexplainable is that?

    If you don't know the value of something, how can you say one occurence of an unknown value is unique! LOL

    there is more info out there about the inconsistencies in how sql server "treats" null flag (and I believe it has even changed over time!)

    this one is sort of ok

    https://www.simple-talk.com/sql/t-sql-programming/how-to-get-nulls-horribly-wrong-in-sql-server/

    Mr. Pascal of databasedebukings does a better job of explaining the silliness, in my opinion.

  • When you come back tomorrow, I'm wondering why

    NullColumn = 'MISSING'

    is not what you call a token?

    I believe that Old Hand provided adequate answer in his post 1766165. NullColumn = 'MISSING' is simply an expression used in a query, and that is OK with both relational theory and SQL standard. Whether we call it token or not, does not matter much. Unlike NULL which has at least 3 meanings (unknown, missing, not applicable) this expression is clear - the data is missing. In different context, we might have chosen NullColumn = 'Not Applicable'. In any case, it is under our control - depending on context and particular problem we can explain NULL like data any way we see it fit.

    As for observation that T-SQL treats allows only one NULL per column in UNIQUE constraints, true, it does not follow the standard, which allows infinite number of rows with NULL in unique constraints. Contrary to common belief (?), NULL uniqueness can be a very useful thing. People who have read Defensive Database Programming by Alex Kuznetsov will know what am I talking about. Story is too long to be explained in a single post, and it would drift away from the article's topic - to NULL or not to NULL. NULLs can be put to useful work, if handled in a wise way - if we treat them as well defined token, with clear meaning, or, as in T-SQWL case, if we allow only one NULL per column.

    Again, the real problem is not to NULL or not to NULL. The problem is how we think about constraints. If we make several constraints work together then we will achieve much more than thinking of constraints independent of each other, which is unfortunately the case most of the time. Problem with NULLs is that we cannot figure out their meaning in particular situations - NULLs are uncontrollable. Once we harness NULLs with constraints and logic, then, yes, we can safely use them.

    Long time ago, at the time of punched cards, there was a science 'Cybernetics', which was explained to us as 'science about information' (and had nothing to do with computers). We learned then that information is a piece of data that decreases uncertainty of some problem - information increases our knowledge about the problem in question. NULLs, if left uncontrolled, do not increase our knowledge about something, therefore they do not meet definition of 'information' or definition of data. Remember, that a database is a set of TRUE facts, and relations happened to be just a convenient way to manage the data in rigorous and consistent way. NULLs are neither TRUE nor FALSE, they have no value, therefore, they are not data at all. If we manage to control NULLs (by means of constraints), they become tokens with meaning and we can try to take advantage of them. If there is no meaning, there is nothing to talk about, eh?

    Cheers 🙂 ?

  • ScottPletcher (3/1/2016)


    roger.plowman (3/1/2016)


    patrickmcginnis59 10839 (2/29/2016)


    roger.plowman (2/29/2016)


    I still don't see why relational tables can't have records for TBD, N/A, and UNK.

    If you need to know why a value is missing, you're going to have to code it in anyways. So why not allocate a column to it? Yes its an extra column, but trying to read anything like "unknown" into the null / lack of a value is just going to leave you mystified about what null actually is. Null means a value isn't in your database column. It doesn't mean its unknown, missing, due to be filled in, it only means a value has not been stored in that column, it doesn't matter if its a column in a table or a column in a missing row in a left join.

    Heck, if the other column says TBD, N/A, or UKN, you don't even need to see if your nullable column has a value, your status column says as much already!

    Its up to the application / use case to determine what having a null in the column means. I've seen it mean "truck is parked in loading dock."

    If you have multiple reasons why a column lacks a value, stick another column in there and encode your TBD, N/A, UNK, because again, in the most fundamental sense, null by default doesn't mean any of those things. But for the sanity of folks who use database software, DON'T HAVE YOUR RDBMS VENDOR HARD WIRE MEANINGS INTO NULLS.

    Because for our application TBD/NA/UNK was necessary for (amost) every relation--over 400 of them. By standardizing the values of the TBD/NA/UNK records it vastly simplified both the DB design and the code.

    Using magic values in other contexts (like dates) does have coding implications, but on balance those complexities are handled outside SQL Server (for example, creating a function for formatting dates to include them on a report) that made the code MUCH MUCH MUCH simpler.

    Business rules inside SQL Server were also simplified. Trivalue logic quickly becomes very messy at scale.

    Besides, no one said nulls should have hard-coded meaning (other than "data is missing"). The whole argument is about better options than allowing nulls in the first place.

    I happened to pick a dates starting at 1900 and going back for my flags (since birthdates weren't an issue), but I can certainly see the advantage of using a date like 12/31/9999 or whatever too.

    The point is, allowing nulls in data is inviting inconsistencies and worse, ambiguities into it. A null says data is missing, it doesn't give a clue as to why. Flag values may be evil but Null (outside a very specific usage) is the Devil Incarnate...

    No, there really should not be any magic values. Instead, you need a separate column(s) to properly map the data/information.

    Btw, what specifically is the difference between "TDB" and "UNK"??

    UNK means "verified unknown", as in the data is forever unknowable. For example, if an item was purchased at some point in the past but all records of the purchase have been lost, then the purchase date would be UNK. If the purchase date information is available and just hasn't been entered yet, then it's TBD.

    Likewise, a "warehoused date" would be when the object was stored (ie it had been in use originally, but is no longer needed, but might be needed in the future). If the item is in current use the warehoused date would be N/A as it's not applicable to an item in use. TBD in this case would mean the status hasn't been entered yet, but the data can be found and entered. UNK in this case wouldn't make sense because that would mean you couldn't figure out if the item was warehoused or not--which should be physically impossible. 🙂

  • In the real world, there can be unknown values at a given point in time in a data store. For example, birth date. If you need to create the row now, but don't know the person's birth date, the only reasonable value to use is NULL.

    'MISSING', like any other magic value, is wrong on its face. Add an appropriate indicator for that. If the missing status relates to the keys, then the indicator will relate to the keys: trying to say that using an indicator instead of a magic value violates 3NF is just bizarre. Yes, you could model separate tables every time, but that "cure" is worse than the disease. At some point you have to be practical. If dbms creators had waited until they could meet every one of Codd's rules to create a db engine, we'd still be waiting!

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Zidar (3/3/2016)


    When you come back tomorrow, I'm wondering why

    NullColumn = 'MISSING'

    is not what you call a token?

    I believe that Old Hand provided adequate answer in his post 1766165. NullColumn = 'MISSING' is simply an expression used in a query, and that is OK with both relational theory and SQL standard. Whether we call it token or not, does not matter much. Unlike NULL which has at least 3 meanings (unknown, missing, not applicable) this expression is clear - the data is missing.

    I think this is the root of my dissatisfaction with this, NULL is only supposed to mean one thing, the column that is NULL has no value stored in it. Now whether this is a column persisted in a table or a column in a result set in a left join with missing right hand row, it still means there is no value there.

    In my understanding, its the programmers choice on what conclusions to make when the column lacks a value, not any sort of relational theory. If you instead replace this NULL with a value subtracted from the types available range of values (by a NOT NULL column, a COALESCE in a left join, or in your case the UNION), this in my understanding of what you're saying is a "token" saying the same thing except this token is actually being stored into the column position but it represents the same condition.

    Once you now use the "token" instead of SQL's NULL facility, you now have to manually code tests to see if values from your problem domain have been stored, whereas NULL propogates through expressions saving code.

    Sure, I'll accept that maybe this "token" is more relationally pure, but I'll have to add the caveat that it adds code complexity, and you lose the advantage that SQL offers in the first place, less bugs. Every time you add code complexity without an increase in complexity of the problem you're solving, in my opinion it is a net loss.

  • Thanks for the article.

  • [[that it adds code complexity]]

    Nah,

    try rating 1000 reports where you have to "if isnull blah, blah, blah " a pile of columns for null because it's the "programmer's choice" what null should mean.

Viewing 15 posts - 91 through 105 (of 137 total)

You must be logged in to reply to this topic. Login to reply