Database Design Follies: NULL vs. NOT NULL

  • patrickmcginnis59 10839 (2/29/2016)


    roger.plowman (2/29/2016)


    I still don't see why relational tables can't have records for TBD, N/A, and UNK.

    If you need to know why a value is missing, you're going to have to code it in anyways. So why not allocate a column to it? Yes its an extra column, but trying to read anything like "unknown" into the null / lack of a value is just going to leave you mystified about what null actually is. Null means a value isn't in your database column. It doesn't mean its unknown, missing, due to be filled in, it only means a value has not been stored in that column, it doesn't matter if its a column in a table or a column in a missing row in a left join.

    Heck, if the other column says TBD, N/A, or UKN, you don't even need to see if your nullable column has a value, your status column says as much already!

    Its up to the application / use case to determine what having a null in the column means. I've seen it mean "truck is parked in loading dock."

    If you have multiple reasons why a column lacks a value, stick another column in there and encode your TBD, N/A, UNK, because again, in the most fundamental sense, null by default doesn't mean any of those things. But for the sanity of folks who use database software, DON'T HAVE YOUR RDBMS VENDOR HARD WIRE MEANINGS INTO NULLS.

    Because for our application TBD/NA/UNK was necessary for (amost) every relation--over 400 of them. By standardizing the values of the TBD/NA/UNK records it vastly simplified both the DB design and the code.

    Using magic values in other contexts (like dates) does have coding implications, but on balance those complexities are handled outside SQL Server (for example, creating a function for formatting dates to include them on a report) that made the code MUCH MUCH MUCH simpler.

    Business rules inside SQL Server were also simplified. Trivalue logic quickly becomes very messy at scale.

    Besides, no one said nulls should have hard-coded meaning (other than "data is missing"). The whole argument is about better options than allowing nulls in the first place.

    I happened to pick a dates starting at 1900 and going back for my flags (since birthdates weren't an issue), but I can certainly see the advantage of using a date like 12/31/9999 or whatever too.

    The point is, allowing nulls in data is inviting inconsistencies and worse, ambiguities into it. A null says data is missing, it doesn't give a clue as to why. Flag values may be evil but Null (outside a very specific usage) is the Devil Incarnate...

  • roger.plowman (3/1/2016)


    patrickmcginnis59 10839 (2/29/2016)


    roger.plowman (2/29/2016)


    I still don't see why relational tables can't have records for TBD, N/A, and UNK.

    If you need to know why a value is missing, you're going to have to code it in anyways. So why not allocate a column to it? Yes its an extra column, but trying to read anything like "unknown" into the null / lack of a value is just going to leave you mystified about what null actually is. Null means a value isn't in your database column. It doesn't mean its unknown, missing, due to be filled in, it only means a value has not been stored in that column, it doesn't matter if its a column in a table or a column in a missing row in a left join.

    Heck, if the other column says TBD, N/A, or UKN, you don't even need to see if your nullable column has a value, your status column says as much already!

    Its up to the application / use case to determine what having a null in the column means. I've seen it mean "truck is parked in loading dock."

    If you have multiple reasons why a column lacks a value, stick another column in there and encode your TBD, N/A, UNK, because again, in the most fundamental sense, null by default doesn't mean any of those things. But for the sanity of folks who use database software, DON'T HAVE YOUR RDBMS VENDOR HARD WIRE MEANINGS INTO NULLS.

    Because for our application TBD/NA/UNK was necessary for (amost) every relation--over 400 of them. By standardizing the values of the TBD/NA/UNK records it vastly simplified both the DB design and the code.

    Using magic values in other contexts (like dates) does have coding implications, but on balance those complexities are handled outside SQL Server (for example, creating a function for formatting dates to include them on a report) that made the code MUCH MUCH MUCH simpler.

    Business rules inside SQL Server were also simplified. Trivalue logic quickly becomes very messy at scale.

    Besides, no one said nulls should have hard-coded meaning (other than "data is missing"). The whole argument is about better options than allowing nulls in the first place.

    I happened to pick a dates starting at 1900 and going back for my flags (since birthdates weren't an issue), but I can certainly see the advantage of using a date like 12/31/9999 or whatever too.

    The point is, allowing nulls in data is inviting inconsistencies and worse, ambiguities into it. A null says data is missing, it doesn't give a clue as to why. Flag values may be evil but Null (outside a very specific usage) is the Devil Incarnate...

    /shrug I'm just not seeing what's so hard about coding for NULL values, they work consistently and predictably and are standard across all datatypes. I find using a field for multiple uses, in your case a status and a value, to be much more irritating and complex to code for in the long run especially when you start trying to use magic values...

  • ZZartin (3/1/2016)


    roger.plowman (3/1/2016)


    patrickmcginnis59 10839 (2/29/2016)


    roger.plowman (2/29/2016)


    I still don't see why relational tables can't have records for TBD, N/A, and UNK.

    If you need to know why a value is missing, you're going to have to code it in anyways. So why not allocate a column to it? Yes its an extra column, but trying to read anything like "unknown" into the null / lack of a value is just going to leave you mystified about what null actually is. Null means a value isn't in your database column. It doesn't mean its unknown, missing, due to be filled in, it only means a value has not been stored in that column, it doesn't matter if its a column in a table or a column in a missing row in a left join.

    Heck, if the other column says TBD, N/A, or UKN, you don't even need to see if your nullable column has a value, your status column says as much already!

    Its up to the application / use case to determine what having a null in the column means. I've seen it mean "truck is parked in loading dock."

    If you have multiple reasons why a column lacks a value, stick another column in there and encode your TBD, N/A, UNK, because again, in the most fundamental sense, null by default doesn't mean any of those things. But for the sanity of folks who use database software, DON'T HAVE YOUR RDBMS VENDOR HARD WIRE MEANINGS INTO NULLS.

    Because for our application TBD/NA/UNK was necessary for (amost) every relation--over 400 of them. By standardizing the values of the TBD/NA/UNK records it vastly simplified both the DB design and the code.

    Using magic values in other contexts (like dates) does have coding implications, but on balance those complexities are handled outside SQL Server (for example, creating a function for formatting dates to include them on a report) that made the code MUCH MUCH MUCH simpler.

    Business rules inside SQL Server were also simplified. Trivalue logic quickly becomes very messy at scale.

    Besides, no one said nulls should have hard-coded meaning (other than "data is missing"). The whole argument is about better options than allowing nulls in the first place.

    I happened to pick a dates starting at 1900 and going back for my flags (since birthdates weren't an issue), but I can certainly see the advantage of using a date like 12/31/9999 or whatever too.

    The point is, allowing nulls in data is inviting inconsistencies and worse, ambiguities into it. A null says data is missing, it doesn't give a clue as to why. Flag values may be evil but Null (outside a very specific usage) is the Devil Incarnate...

    /shrug I'm just not seeing what's so hard about coding for NULL values, they work consistently and predictably and are standard across all datatypes. I find using a field for multiple uses, in your case a status and a value, to be much more irritating and complex to code for in the long run especially when you start trying to use magic values...

    That may be where the misunderstanding is. We don't consider TBD etc to be a status, we consider it a value. In the court example TBD would be a judge, just not a judge that can hear cases... 🙂

  • roger.plowman (3/1/2016)


    ZZartin (3/1/2016)


    roger.plowman (3/1/2016)


    patrickmcginnis59 10839 (2/29/2016)


    roger.plowman (2/29/2016)


    I still don't see why relational tables can't have records for TBD, N/A, and UNK.

    If you need to know why a value is missing, you're going to have to code it in anyways. So why not allocate a column to it? Yes its an extra column, but trying to read anything like "unknown" into the null / lack of a value is just going to leave you mystified about what null actually is. Null means a value isn't in your database column. It doesn't mean its unknown, missing, due to be filled in, it only means a value has not been stored in that column, it doesn't matter if its a column in a table or a column in a missing row in a left join.

    Heck, if the other column says TBD, N/A, or UKN, you don't even need to see if your nullable column has a value, your status column says as much already!

    Its up to the application / use case to determine what having a null in the column means. I've seen it mean "truck is parked in loading dock."

    If you have multiple reasons why a column lacks a value, stick another column in there and encode your TBD, N/A, UNK, because again, in the most fundamental sense, null by default doesn't mean any of those things. But for the sanity of folks who use database software, DON'T HAVE YOUR RDBMS VENDOR HARD WIRE MEANINGS INTO NULLS.

    Because for our application TBD/NA/UNK was necessary for (amost) every relation--over 400 of them. By standardizing the values of the TBD/NA/UNK records it vastly simplified both the DB design and the code.

    Using magic values in other contexts (like dates) does have coding implications, but on balance those complexities are handled outside SQL Server (for example, creating a function for formatting dates to include them on a report) that made the code MUCH MUCH MUCH simpler.

    Business rules inside SQL Server were also simplified. Trivalue logic quickly becomes very messy at scale.

    Besides, no one said nulls should have hard-coded meaning (other than "data is missing"). The whole argument is about better options than allowing nulls in the first place.

    I happened to pick a dates starting at 1900 and going back for my flags (since birthdates weren't an issue), but I can certainly see the advantage of using a date like 12/31/9999 or whatever too.

    The point is, allowing nulls in data is inviting inconsistencies and worse, ambiguities into it. A null says data is missing, it doesn't give a clue as to why. Flag values may be evil but Null (outside a very specific usage) is the Devil Incarnate...

    /shrug I'm just not seeing what's so hard about coding for NULL values, they work consistently and predictably and are standard across all datatypes. I find using a field for multiple uses, in your case a status and a value, to be much more irritating and complex to code for in the long run especially when you start trying to use magic values...

    That may be where the misunderstanding is. We don't consider TBD etc to be a status, we consider it a value. In the court example TBD would be a judge, just not a judge that can hear cases... 🙂

    His honorable judge Tobias Beauford Delaney would probably wonder why he is getting all the work.

    I think a strict interpretation of 1NF would not allow the use of nulls. We can alter the domain to avoid NULL by using a token. On the other hand, we can add the "NULL token" to any domain via the infrastructure by allowing the column to have a NULL value. I see these as the same, except one has a well defined infrastructure.

    Is it wrong to allow NULL, TBD, or UNKNOWN tokens and judge names in the same domain? Why not define the domain to be either the name of a judge, judge initials, a token, an array of judges, or a phone number? I can't seem to recall any limitation on the domain other than type. The SQLXML type allows almost anything.

    I would think that the domain should be defined in such a way that operations on values do not depend upon the value. If the code treats "TDB" as special, then perhaps this means the domain is not well defined. At the same time, if the normal NULL rules are not enough to avoid extra NULL handling, then perhaps it's a case where null should not be used as part of the domain. On the other hand, is a little special handling better than the alternative - as form of denormalization?

    Does anybody have a link re proper domain design?

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • [proper domain design]

    sure do. check out databasedebunkings

    warning: "procedural programmers" will detest just about everything recommended there, claim it is all impractical theory, won't perform acceptably,is impossible in the real world, etc. etc. etc. 😛

  • kenambrose (2/29/2016)


    [[You'll have up to 8 bits to designate just "how" the data is missing. You certainly do NOT want to store 'TBD' | 'NA' | etc. in any column(s).]]

    Oh well, have to disagree again.

    And what happens if the reasons become > 8 ? Never happen right 🙂

    And now we are encoding hidden meaning that has to be parsed out of numeric data?

    How about a "reason" table and make a column in the missing data table with a foreign key? Kind of more relational way to persist meaning?

    ken

    SQL Server has a data type called "smallint" that has 16 bits, just so you're aware of it.

    If you need more than a few "non-value" values, then yeah, take another approach. But to me, as a practical matter, it's ridiculous to set up a separate table just for TBD|NA|NONE.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • roger.plowman (3/1/2016)


    patrickmcginnis59 10839 (2/29/2016)


    roger.plowman (2/29/2016)


    I still don't see why relational tables can't have records for TBD, N/A, and UNK.

    If you need to know why a value is missing, you're going to have to code it in anyways. So why not allocate a column to it? Yes its an extra column, but trying to read anything like "unknown" into the null / lack of a value is just going to leave you mystified about what null actually is. Null means a value isn't in your database column. It doesn't mean its unknown, missing, due to be filled in, it only means a value has not been stored in that column, it doesn't matter if its a column in a table or a column in a missing row in a left join.

    Heck, if the other column says TBD, N/A, or UKN, you don't even need to see if your nullable column has a value, your status column says as much already!

    Its up to the application / use case to determine what having a null in the column means. I've seen it mean "truck is parked in loading dock."

    If you have multiple reasons why a column lacks a value, stick another column in there and encode your TBD, N/A, UNK, because again, in the most fundamental sense, null by default doesn't mean any of those things. But for the sanity of folks who use database software, DON'T HAVE YOUR RDBMS VENDOR HARD WIRE MEANINGS INTO NULLS.

    Because for our application TBD/NA/UNK was necessary for (amost) every relation--over 400 of them. By standardizing the values of the TBD/NA/UNK records it vastly simplified both the DB design and the code.

    Using magic values in other contexts (like dates) does have coding implications, but on balance those complexities are handled outside SQL Server (for example, creating a function for formatting dates to include them on a report) that made the code MUCH MUCH MUCH simpler.

    Business rules inside SQL Server were also simplified. Trivalue logic quickly becomes very messy at scale.

    Besides, no one said nulls should have hard-coded meaning (other than "data is missing"). The whole argument is about better options than allowing nulls in the first place.

    I happened to pick a dates starting at 1900 and going back for my flags (since birthdates weren't an issue), but I can certainly see the advantage of using a date like 12/31/9999 or whatever too.

    The point is, allowing nulls in data is inviting inconsistencies and worse, ambiguities into it. A null says data is missing, it doesn't give a clue as to why. Flag values may be evil but Null (outside a very specific usage) is the Devil Incarnate...

    No, there really should not be any magic values. Instead, you need a separate column(s) to properly map the data/information.

    Btw, what specifically is the difference between "TDB" and "UNK"??

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • heck why not use bigint?

    and then someone will ask you to write a report to see all the records that have certain of those encoded values.

    So then are you going to do a join on the parsing algorithm?

    And then someone wants to "add" new reasons. Using a relationship table, you simply add a row to the table.

    Using encoded parsing, now you have to go in and hack the parsing code every time someone thinks up a new reason to be persisted.

    at what point does it start to be more work, and more error prone, and a maintenance horror show parsing numeric values with encoded meaning vs. using a table in a relation?

    almost immediately...

  • The debate reminds me of a quote, "I'm a mathematical optimist: I deal only with positive integers."

    I'm glad I came across this discussion. 6NF with only one non-key column removes the need for nulls. I've seen these in a vendor database. At the time, I thought the design was just a very effective form of obfuscation. I really don't think we are ready for non-nulls.

    https://www.dcs.warwick.ac.uk/~hugh/TTM/Missing-info-without-nulls.pdf

    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.128.9243&rep=rep1&type=pdf

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • By definition, relations can not 'store' NULLs. A relation is simply a shorthand notation for set of logical propositions, derived from the same logical predicate. Heavy words, indeed. Propositions? "Some cat is named[Princ] and was born i [2001] and is of [male] gender ". We can write many sentences like this one, by supplying various values for bracketed columns. If we supply various values to something, then that 'something' can be called a function with parameters. In relational theory, such function is called logical predicate. For mentioned cats, the predicate (function) would look like this:

    P(catname, BirthYear, gender) = Some cat is named [<CatName>], was born in [<BirthYear>] and is of [<gender>] gender.

    All mathematical functions can be specified as tables of values. That is how we got to wrong belief "tables ARE relations". In relational theory, we call variables of predicates 'attributes'. Further, we define domains for attributes, as sets of possible values to be possibly supplied to attributes. Thus, domain for <gender> can be defined for example as {male, female}. For <BirthYear> we can require, say, that domain is set of positive integers in interval [1900,2099]. With name, we can allow non empty string of characters, letters and digits. Regardless of domain specification, attribute CatName must contain exactly that - a name of a cat. Any string we use must have the same meaning - name of a cat. Name of a dog is not allowed in our relation, even if it looks the same as name of some cat. Hence, if Garfield is a cat, it is OK, dog name Garfield is not acceptable. Otherwise, given proposition would not be true.

    Thus, if we accept that all values in given 'column' represent similar things, belonging to the same domain, follows that we cannot accept NULLs nor tokens for TBD, UNK, NA etc. When system or user read something from the relation, they msut assume that the values belong to their domains. Proposition that odes not represent our predicate about cats may look like this:

    Some cat is named [TBD], was born in [9999] and is of [UNK] gender.

    TBD means 'To be Determined', which is a statement, a declaration, certainly not a name of a cat. If we choose to represent unknown year with 9999 we are wrong again, because 9999 does not belong to defined domain. Same for UNKnown gender. Same for marker NULL which means 'unknown or missing or to be determined'.

    I hope it is clear, mathematically speaking, NULLs and tokens do not represent domain values for attributes. Thus, they must not be used in relational design. In real life, we do break rules of relational theory here and there. It is not as bad as ignoring Pythagoras's Theorem, but it is bad. Relational theory is based on mathematical logs, and ignoring logic is generally a bad thing. Just like points in geometry - have no dimensions, yet we draw them on the paper, lines have no thickness, yet we do draw them with thickness. But we do not ignore logic of geometry, yet we choose to ignore logic in relational theory.

    Do as you please, you are on your own. Good luck.

  • I hope it is clear, mathematically speaking, NULLs and tokens do not represent domain values for attributes. Thus, they must not be used in relational design.

    Then how do we represent that we have not stored a value in a row? I know we're all willing to do the right thing, if we just know what it is!

  • [represent having no value]

    it was in earlier post. put nullable column into it's own table, enforce 1 to 1 relationship to parent table.

    no data for a row in parent table? don't create a row in child table.

    have data for a row in related parent table? put related row in child table.

    could not be simpler, is 100% relational, and will solve every, repeat every, logical conundrum regarding missing data.

    don't listen to the procedural programmers here, they just will never get it.

  • kenambrose (3/2/2016)


    [represent having no value]

    it was in earlier post. put nullable column into it's own table, enforce 1 to 1 relationship to parent table.

    no data for a row in parent table? don't create a row in child table.

    have data for a row in related parent table? put related row in child table.

    could not be simpler, is 100% relational, and will solve every, repeat every, logical conundrum regarding missing data.

    don't listen to the procedural programmers here, they just will never get it.

    And processing 100K rows would take minutes with every nullable column is in its own table. That's just crazy. Normalization is not the ultimate goal -- enough normalization, say to BCNF, to prevent most data anamalies yet provide a practical system. SAP does that, and there are thousands and thousands and thousands of tables. Good luck with that!

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • kenambrose (3/2/2016)


    [represent having no value]

    it was in earlier post. put nullable column into it's own table, enforce 1 to 1 relationship to parent table.

    no data for a row in parent table? don't create a row in child table.

    have data for a row in related parent table? put related row in child table.

    could not be simpler, is 100% relational, and will solve every, repeat every, logical conundrum regarding missing data.

    don't listen to the procedural programmers here, they just will never get it.

    All I see is increased overhead and complexity for no gain, it has nothing to do with procedural or non procedural. If a data element doesn't have a value and someone tries to see that column it doesn't matter whether they're trying to see it through an outer join or just pulling the row from the main table, they're not getting the data.

  • I am not done 🙂

    Not ignoring the logic can be very difficult in practice, indeed. Somebody mentioned adding status columns to describe what is happening with NULLable columns. Here we break normalization requirement for BCNF- each attribute (column) must depend only on a unique key, whole key and nothing else. So what to do?

    If you look carefully, many examples in favor of using NULLs or tokens for missing/unknown/not applicable data deal with time-sensitive transactions. Account balance cannot be 0 if no transaction hes been recorded. In theory, balance is computed value, and as such should not be kept in tables. If we follow this, zero balance problem disappears altogether. However, in practice we must do exactly that - store current balance somewhere. Good, but not in the table where we keep account number, client number, and such. Alex Kuznetsov, in his book "Defensive Database Programming", offers a solution to keeping track of inventory running balance, in chapter 7, "Constraints and Rock Solid Inventory Systems". Constraints is the key word. It is worth reading and once understood (not easy) it will open a whole new world of possibilities and decrease need for NULLs and tokens. A post on a forum is not enough space to explain what Alex does and why, it would require a full article. So, read it and hope to understand. Then read articles by Joe Celko, on this site and Simple Talk, about constraints, in particular 'state transition constraints', that should help.

    There is one case when NULLs (not tokens) are the solution - nulls that mean "NA", not applicable, exactly that and nothing more. Say we have a relation with to store following predicate:

    P = Employee [EmpID] holds [Job], and if [Job] = 'DBA' then employee is required to posses certificate[CertificateNum].

    This means: if Job = 'DBA' then attribute CertificateNum MUST NOT be NULL, in all other cases it MUST be NULL. If you can write a constraint to enforce this rule, then it is safe to use NULL, with meaning "This guy is not 'DBA' so CertificateNum IS NULL"

    We know exact meaning of NULL, and we control it completely by a CHECK constraint.

    This is the constraint, mathematically speaking:

    Job = 'DBA' <=> CertificateNum IS NOT NULL

    In T-SQL it takes two CHECK constraints to enforce the rule:

    C1: CHECK (NOT [Job] = 'DBA" OR [CertificateNum] IS NOT NULL)

    C2: CHECK ([CertificateNum] IS NOT NULL OR [Job] = 'DBA')

    Funny, in MS Access, which has a crippled SQL compared to MS SQL T-SQL, we can write a simple table level validation rule:

    [Job] = 'DBA' EQV [CertificateNum] IS NOT NULL

Viewing 15 posts - 76 through 90 (of 137 total)

You must be logged in to reply to this topic. Login to reply