Database Design Follies: NULL vs. NOT NULL

  • rstone (2/29/2016)


    I think nulls are used more often than required, but they have their place. If we start with a rule that forbids nulls in the database, then we are doing physical design before logical design. If nulls are not allowed, then another means is required to avoid or allow the same option without null. How much extra design and code justifies not using a null? Is such a design is really less complex?

    My point exactly.

  • j-1064772 (2/29/2016)


    rstone (2/29/2016)


    I think nulls are used more often than required, but they have their place. If we start with a rule that forbids nulls in the database, then we are doing physical design before logical design. If nulls are not allowed, then another means is required to avoid or allow the same option without null. How much extra design and code justifies not using a null? Is such a design is really less complex?

    My point exactly.

    Is there anything wrong with a rule like

    A Natural Key should be not null.

    Attributes should be nullible except in lookup tables.

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • I still don't see why relational tables can't have records for TBD, N/A, and UNK.

    This A) gets rid of nulls, which are just messy to code for and B) doesn't require any extra work.

    In the example the hearing table is related to the judge table, right? So if the judge hasn't been decided yet it's TBD (a record in the Judge table). If the hearing doesn't need a judge (?) it's N/A (ditto). And if you can't know who the judge is (??!) it's UNK (ditto yet again).

    No extra 1:1 tables needed. No tri-value logic code needed.

    The whole surrogate/natural key is another argument for another day. 🙂

    Simple, clean, relational design. Done and dusted. Simple to enter, simple to choose, simple to report.

  • roger.plowman (2/29/2016)


    I still don't see why relational tables can't have records for TBD, N/A, and UNK.

    This A) gets rid of nulls, which are just messy to code for and B) doesn't require any extra work.

    In the example the hearing table is related to the judge table, right? So if the judge hasn't been decided yet it's TBD (a record in the Judge table). If the hearing doesn't need a judge (?) it's N/A (ditto). And if you can't know who the judge is (??!) it's UNK (ditto yet again).

    No extra 1:1 tables needed. No tri-value logic code needed.

    The whole surrogate/natural key is another argument for another day. 🙂

    Simple, clean, relational design. Done and dusted. Simple to enter, simple to choose, simple to report.

    Default values still require special code because you still have to account for the data not existing and oftentimes it makes more sense just to use NULL. I would rather see no data than a default value that may or may not make sense.

  • ZZartin (2/29/2016)


    roger.plowman (2/29/2016)


    I still don't see why relational tables can't have records for TBD, N/A, and UNK.

    This A) gets rid of nulls, which are just messy to code for and B) doesn't require any extra work.

    In the example the hearing table is related to the judge table, right? So if the judge hasn't been decided yet it's TBD (a record in the Judge table). If the hearing doesn't need a judge (?) it's N/A (ditto). And if you can't know who the judge is (??!) it's UNK (ditto yet again).

    No extra 1:1 tables needed. No tri-value logic code needed.

    The whole surrogate/natural key is another argument for another day. 🙂

    Simple, clean, relational design. Done and dusted. Simple to enter, simple to choose, simple to report.

    Default values still require special code because you still have to account for the data not existing and oftentimes it makes more sense just to use NULL. I would rather see no data than a default value that may or may not make sense.

    This wouldn't be default values, they would be the actual values. TBD, N/A, or UNK might not be able to sit on the bench but they are valid values none the less. How they are handled by automation would, of course, be up to the developers, but in terms of code they do not require special processing the way Nulls do and they're a lot easier to use in a where clause.

    Besides, as values they tell you WHY the data is missing, not just that it is. I've always found that information invaluable.

  • roger.plowman (2/29/2016)


    I still don't see why relational tables can't have records for TBD, N/A, and UNK.

    If you need to know why a value is missing, you're going to have to code it in anyways. So why not allocate a column to it? Yes its an extra column, but trying to read anything like "unknown" into the null / lack of a value is just going to leave you mystified about what null actually is. Null means a value isn't in your database column. It doesn't mean its unknown, missing, due to be filled in, it only means a value has not been stored in that column, it doesn't matter if its a column in a table or a column in a missing row in a left join.

    Heck, if the other column says TBD, N/A, or UKN, you don't even need to see if your nullable column has a value, your status column says as much already!

    Its up to the application / use case to determine what having a null in the column means. I've seen it mean "truck is parked in loading dock."

    If you have multiple reasons why a column lacks a value, stick another column in there and encode your TBD, N/A, UNK, because again, in the most fundamental sense, null by default doesn't mean any of those things. But for the sanity of folks who use database software, DON'T HAVE YOUR RDBMS VENDOR HARD WIRE MEANINGS INTO NULLS.

  • "But for the sanity of folks who use database software, DON'T HAVE YOUR RDBMS VENDOR HARD WIRE MEANINGS INTO NULLS."

    Speaking from bitter experience perhaps ?

  • roger.plowman (2/29/2016)


    ZZartin (2/29/2016)


    roger.plowman (2/29/2016)


    I still don't see why relational tables can't have records for TBD, N/A, and UNK.

    This A) gets rid of nulls, which are just messy to code for and B) doesn't require any extra work.

    In the example the hearing table is related to the judge table, right? So if the judge hasn't been decided yet it's TBD (a record in the Judge table). If the hearing doesn't need a judge (?) it's N/A (ditto). And if you can't know who the judge is (??!) it's UNK (ditto yet again).

    No extra 1:1 tables needed. No tri-value logic code needed.

    The whole surrogate/natural key is another argument for another day. 🙂

    Simple, clean, relational design. Done and dusted. Simple to enter, simple to choose, simple to report.

    Default values still require special code because you still have to account for the data not existing and oftentimes it makes more sense just to use NULL. I would rather see no data than a default value that may or may not make sense.

    This wouldn't be default values, they would be the actual values. TBD, N/A, or UNK might not be able to sit on the bench but they are valid values none the less. How they are handled by automation would, of course, be up to the developers, but in terms of code they do not require special processing the way Nulls do and they're a lot easier to use in a where clause.

    Besides, as values they tell you WHY the data is missing, not just that it is. I've always found that information invaluable.

    I would see that information being stored in some kind of a status field used in the process that fills that information. NULL is a universal value that means the same thing in any field or column it's used in and I really don't see why adding IS NULL or IS NOT NULL checks is messy. And while in your example having a pseudo NULL value might work there's other cases where it really doesn't like number or date fields.

    For example let's say I have an order record with a shipment date on it what makes more sense? Leaving the shipment date null and setting some kind of status to pending shipping or setting the shipment date to say 9999-12-31 and then trying to code for that value everywhere?

  • Using a single record in a related table to define "missing" or "unknown" is an option that I have seen in use. I thought it resulted in some very complicated and confusing code. If there are many states possible for the data, then a state table might be something to include. I don't think the added complexity is worth avoiding null just for the sake of avoiding a null. While a null can be a little odd to work with at times, it has fixed and stable rules. If the default null rules work, the result will probably be better than the alternative database design - at least judging from what I've come across.

    Handling the unknown is a complexity that will have to be handled someplace. There is no magic that will replace the need to check something, be it null or a magic value defined to be unknown. WHERE value IS null or value = "unknown"? Some developers like to not have to deal with null values in code at the expense of the database design. (And some think UDFs are great for code encapsulation.) A database designer should not want to deal with using a column for more than one thing - it is either a value or the state of the value, not both. A purest might say that null falls into this category and should be avoided. However, the SQL spec allows for nulls for this one very special and very common case. Using "unknown" as a value is harder to justify.

    Even if the client defines the domain to include "unknown" in the spec, the physical design does not have to use it as a value in the column. If there are other business rules that make the "unknown" value more of a value than a truly unknown value, then perhaps it is a value.

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • Ed, I think you're right. The only exception I would like to mention what occurred is in my "previous life" as a mainframe (DB2) DBA. Our data tables were large (multi-billion row tables). At design time, if you specified VARCHAR or NULL datatypes, you were introducing a potential problem when performing mass updates or adding new columns. I would describe a scenario like this: your data is laid out on database pages. You have a lot of columns with a NULLABLE datatype. At design time you calculate how many rows per page you will have for this large table with a generous amount left over to add new columns. Over time though, the "free space" in each page for you given number of rows diminishes. Now you attempt to add a new column, but don't realize that you are now dropping down in number of rows per page. Instead of an expected immediate action, your ALTER TABLE ADD column takes eight hours to reformat all the rows in the table. Customers and management are most unhappy with you, the hapless DBA.

    If you have a QA database that contains production data, you will probably catch this. But for very large tables, especially ones with financial or health-related data, you probably won't because of regulatory compliance restrictions. I

    When designing for these larger tables, as a rule I would use dummy null data. However, instead of 1900-01-01, I would do something more obvious as 9999-12-31.

    Even though it is a rare problem, I wanted to mention it to your readers to keep in mind in the future in case they might find themselves in this situation.

  • j-1064772 (2/29/2016)


    "But for the sanity of folks who use database software, DON'T HAVE YOUR RDBMS VENDOR HARD WIRE MEANINGS INTO NULLS."

    Speaking from bitter experience perhaps ?

    heh vendors *rolls eyes*

  • If you want to use TBD/NA/etc., you might as well just a one-byte status column. You'll have up to 8 bits to designate just "how" the data is missing. You certainly do NOT want to store 'TBD' | 'NA' | etc. in any column(s).

    Edit: And you certainly don't want any "magic" values, such as 19000101 or 99990101 (except as "unlimited end date", its natural meaning).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • [[You'll have up to 8 bits to designate just "how" the data is missing. You certainly do NOT want to store 'TBD' | 'NA' | etc. in any column(s).]]

    Oh well, have to disagree again.

    And what happens if the reasons become > 8 ? Never happen right 🙂

    And now we are encoding hidden meaning that has to be parsed out of numeric data?

    How about a "reason" table and make a column in the missing data table with a foreign key? Kind of more relational way to persist meaning?

    ken

  • MadAdmin (2/29/2016)


    j-1064772 (2/29/2016)


    rstone (2/29/2016)


    I think nulls are used more often than required, but they have their place. If we start with a rule that forbids nulls in the database, then we are doing physical design before logical design. If nulls are not allowed, then another means is required to avoid or allow the same option without null. How much extra design and code justifies not using a null? Is such a design is really less complex?

    My point exactly.

    Is there anything wrong with a rule like

    A Natural Key should be not null.

    Attributes should be nullible except in lookup tables.

    Can you decide now which should be used by everybody in every case? I can't think of why a natural key would ever be null. If an attribute can't ever be missing, then it might need to be "not null" for performance and data consistency. (Please define C1 as not null in table xray_treament as a null results in failure to place the target foil in front of the electron beam resulting in patient death.) If a price lookup table includes a date range for which the price is valid, perhaps a null price is okay for some future date ranges as part of a price change workflow. (Might be nice to buy stuff from a site where 0 is used rather than null for future prices.)

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • kenambrose (2/29/2016)


    [[You'll have up to 8 bits to designate just "how" the data is missing. You certainly do NOT want to store 'TBD' | 'NA' | etc. in any column(s).]]

    Oh well, have to disagree again.

    And what happens if the reasons become > 8 ? Never happen right 🙂

    And now we are encoding hidden meaning that has to be parsed out of numeric data?

    How about a "reason" table and make a column in the missing data table with a foreign key? Kind of more relational way to persist meaning?

    ken

    The business rules would determine if the "reason" column is being abused and storing more than one real value. Is it just a value that is returned as a single unit of information? A relation table would make more sense if the individual reasons are used separately. Even so, perhaps deformalizing to a byte might make life easier - at least until the next report is required.

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

Viewing 15 posts - 61 through 75 (of 137 total)

You must be logged in to reply to this topic. Login to reply