Who Likes NULL?

  • Jeff Moden - Friday, July 13, 2018 1:57 PM

    Heh... always fun to watch discussions concerning NULL.  😀

    Agreed - I couldn't help but recall a whole month long debacle on this topic from 10 years ago :hehe:

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (4) - Friday, July 13, 2018 2:27 PM

    Jeff Moden - Friday, July 13, 2018 1:57 PM

    Heh... always fun to watch discussions concerning NULL.  😀

    Agreed - I couldn't help but recall a whole month long debacle on this topic from 10 years ago :hehe:

    Heh... that's the one I was thinking of.  It'll still never top the two "Banker's Rounding" discussions, though. 😉  Remember those?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Friday, July 13, 2018 6:27 PM

    Matt Miller (4) - Friday, July 13, 2018 2:27 PM

    Jeff Moden - Friday, July 13, 2018 1:57 PM

    Heh... always fun to watch discussions concerning NULL.  😀

    Agreed - I couldn't help but recall a whole month long debacle on this topic from 10 years ago :hehe:

    Heh... that's the one I was thinking of.  It'll still never top the two "Banker's Rounding" discussions, though. 😉  Remember those?

    oh yeah... That one was fun too !

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • drew.allen - Friday, July 13, 2018 12:49 PM

    Dennis Q Miller - Friday, July 13, 2018 12:30 PM

    I couldn't agree more.  We debate incessantly about how to represent "unknown" with NULL or a magic number in a datetime column, but we neglect considering schema designs that don't require it in the first place.

    Right..."Unknown" is not valid for a date datatype. And a date datatype does not support a domain that includes "unknown"  as one of its legitimate cases.  Using either NULL or magic number is a trick to pretend it does.  Domains are not defined by the values of a datatype; they are defined by the business requirements at a logical level and, in most cases, can only be approximated by a datatype.

    If you're going to make such a bold statement, you better be prepared to back it up with facts, specifically, you should be able to provide an example of such a schema that doesn't require representing UNKNOWN.  I think that such a schema is untenable, because there will always be gaps in our knowledge where we have to represent "unknown" values.

    Drew

    Consider this EMPLOYEE, non-example which attempts to represent an unknown date with NULL:

    EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date (nullable)
    termination_reason (nullable)

    Here's some alternatives that do not need the NULLable date:

    (1) EMPLOYEE

    employee_no (pk)
    employee_name

    termination_date (nullable)

    termination_reason (nullable)

       EMPLOYEE_TERMINATION

    employee_no (pk, fk->EMPLOYEE.employee_no)
    termination_date (not nullable)
    termination_reason (not nullable)

    --------------------------------
    (2) EMPLOYEE

    employee_no (pk)
    employee_name
    employee_status
    termination_date (not nullable)

    termination_reason (not nullable)

    ------------------------------------------
    (3) EMPLOYEE

    employee_no (pk)
    employee_name

    termination_date (varchar, not nullable)

    --------------------------------------------------------------
    (4) EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date_id (not nullable, fk -> EXTENDED_DATE->date_id)
    termination_reason (varchar, not nullable)

        EXTENDED_DATE
    date_id (pk) 
    date_name (varchar not nullable)
        CALENDAR_DATE
              date_id (pk, fk->SPECIAL_DATE, not nullable)

    ------------------------------
    (5) EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date_id (not nullable, fk -> EXTENDED_DATE->date_id)
    termination_reason (varchar, not nullable)

        EXTENDED DATE
              date_id  (pk, datetime)
              date_name (varchar)


               

  • roger.plowman - Friday, July 13, 2018 12:37 PM

    We're quibbling about semantics. Logical domains can and must be modeled physically with ill-matching physical representations--but that's true even for DATES, which are encoded as the number of days since a milestone date. 🙂 Try storing March 3, 1610 in a DATE or DATETIME datatype for instance.

    Also, the "verified unknown" value, whatever trick you use to represent it, can and in fact must be part of the date domain. For example a purchase date for a piece of equipment that you have lost the purchase order for because it was bought 25 years ago.

    I don't see much difference between using 1/1/9999 to represent "unknown" and using 3/3/9610 to represent 3/3/1610.

    I think we agree. Perhaps this is semantics, but I do want to be careful with the term "date domain".  It suggests assigning a domain to "date" in the general sense, as if all dates have the same domain or as if the datatype is the domain. The impetus for using NULL is often because the datatype does not intrinsically support the domain.

  • Dennis Q Miller - Sunday, July 15, 2018 10:18 AM

    I don't see much difference between using 1/1/9999 to represent "unknown" and using 3/3/9610 to represent 3/3/1610.

    I think we agree. Perhaps this is semantics, but I do want to be careful with the term "date domain".  It suggests assigning a domain to "date" in the general sense, as if all dates have the same domain or as if the datatype is the domain. The impetus for using NULL is often because the datatype does not intrinsically support the domain.

    No - the impetus for using NULL is because you need to have access to the metadata that a value is not known.  No matter what technology or model you happen to be using, unknown isn't actually a value, so it's not data, it's metadata.  It may be expedient or useful for a number of reasons (among others -  easy of querying) to approximate or conglomerate certain "unknowns" as a specific value, but that's really based on a choice to not keep your domain modeling pure. 

    From my own experience though - there are a LOT of real-life cases where all you know for sure that some data isn't known, isn't relevant yet, or just not present:  which one of the three isn't easy to determine and frankly it isn't useful to what it is that you're representing (or worth slowing down the user) to demand higher purity on the data.

    The third manifesto was a good read, and a cute academic discussion of how to expunge NULLs.  As an architect, though I can tell you it falls into the "white tower" anti-pattern, in that it's so pure it can't actually handle any real-life scenarios.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Dennis Q Miller - Sunday, July 15, 2018 10:18 AM

    I don't see much difference between using 1/1/9999 to represent "unknown" and using 3/3/9610 to represent 3/3/1610.

    I think we agree. Perhaps this is semantics, but I do want to be careful with the term "date domain".  It suggests assigning a domain to "date" in the general sense, as if all dates have the same domain or as if the datatype is the domain. The impetus for using NULL is often because the datatype does not intrinsically support the domain.

    Heh... there is no real difference except that '9999' is a lot easier to remember. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Dennis Q Miller - Sunday, July 15, 2018 9:10 AM

    Consider this EMPLOYEE, non-example which attempts to represent an unknown date with NULL:

    EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date (nullable)
    termination_reason (nullable)

    Here's some alternatives that do not need the NULLable date:

    (1) EMPLOYEE

    employee_no (pk)
    employee_name

    termination_date (nullable)

    termination_reason (nullable)

       EMPLOYEE_TERMINATION

    employee_no (pk, fk->EMPLOYEE.employee_no)
    termination_date (not nullable)
    termination_reason (not nullable)

    --------------------------------
    (2) EMPLOYEE

    employee_no (pk)
    employee_name
    employee_status
    termination_date (not nullable)

    termination_reason (not nullable)

    ------------------------------------------
    (3) EMPLOYEE

    employee_no (pk)
    employee_name

    termination_date (varchar, not nullable)

    --------------------------------------------------------------
    (4) EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date_id (not nullable, fk -> EXTENDED_DATE->date_id)
    termination_reason (varchar, not nullable)

        EXTENDED_DATE
    date_id (pk) 
    date_name (varchar not nullable)
        CALENDAR_DATE
              date_id (pk, fk->SPECIAL_DATE, not nullable)

    ------------------------------
    (5) EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date_id (not nullable, fk -> EXTENDED_DATE->date_id)
    termination_reason (varchar, not nullable)

        EXTENDED DATE
              date_id  (pk, datetime)
              date_name (varchar)


               

    Heh... that brings up another hated example for me... "Date_ID".  If you're forced into using the damned things, at least try to convince them that the Date_ID should be the actual "Date Serial Number" from the given date.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Dennis Q Miller - Sunday, July 15, 2018 9:10 AM

    Consider this EMPLOYEE, non-example which attempts to represent an unknown date with NULL:

    EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date (nullable)
    termination_reason (nullable)

    Here's some alternatives that do not need the NULLable date:

    (1) EMPLOYEE

    employee_no (pk)
    employee_name

    termination_date (nullable)

    termination_reason (nullable)

       EMPLOYEE_TERMINATION

    employee_no (pk, fk->EMPLOYEE.employee_no)
    termination_date (not nullable)
    termination_reason (not nullable)

    --------------------------------
    (2) EMPLOYEE

    employee_no (pk)
    employee_name
    employee_status
    termination_date (not nullable)

    termination_reason (not nullable)

    ------------------------------------------
    (3) EMPLOYEE

    employee_no (pk)
    employee_name

    termination_date (varchar, not nullable)

    --------------------------------------------------------------
    (4) EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date_id (not nullable, fk -> EXTENDED_DATE->date_id)
    termination_reason (varchar, not nullable)

        EXTENDED_DATE
    date_id (pk) 
    date_name (varchar not nullable)
        CALENDAR_DATE
              date_id (pk, fk->SPECIAL_DATE, not nullable)

    ------------------------------
    (5) EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date_id (not nullable, fk -> EXTENDED_DATE->date_id)
    termination_reason (varchar, not nullable)

        EXTENDED DATE
              date_id  (pk, datetime)
              date_name (varchar)


               

    And when do the following

    SELECT *
    FROM Employee e
    LEFT OUTER JOIN Employee_Terminate et
        ON e.employee_no = et.employee_no

    you're right back to having NULL values that represent unknown data.  You haven't gotten rid of the NULL values, you've just buried them to make it appear that you've gotten rid of them.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen - Monday, July 16, 2018 7:45 AM

    And when do the following

    SELECT *
    FROM Employee e
    LEFT OUTER JOIN Employee_Terminate et
        ON e.employee_no = et.employee_no

    you're right back to having NULL values that represent unknown data.  You haven't gotten rid of the NULL values, you've just buried them to make it appear that you've gotten rid of them.

    Drew

    I provided an example of how to get rid of NULL; re-introducing NULL with an outer join is counter productive. My instinctive responses is: don't use that query.
    With alternative (1), it's trivial to identify "unknown" termination_dates and it's also trivial to identify the known ones.  Perhaps I should have offered the solution where ACTIVE_EMPLOYEES and TERMINATED_EMPLOYEES are in separate tables--at least a join would not be so tempting.

  • Dennis Q Miller - Monday, July 16, 2018 8:50 PM

    I provided an example of how to get rid of NULL; re-introducing NULL with an outer join is counter productive. My instinctive responses is: don't use that query.
    With alternative (1), it's trivial to identify "unknown" termination_dates and it's also trivial to identify the known ones.  Perhaps I should have offered the solution where ACTIVE_EMPLOYEES and TERMINATED_EMPLOYEES are in separate tables--at least a join would not be so tempting.

    You demonstrated how to eliminate storing NULL values, but sometimes you still have NULL values being reported.  Some reports may require the use of OUTER JOINs to satisfy the requested report.  For instance, wanting a list of all employees that worked for the company in 2017 including those that no longer work for the company and when their employment was terminated.

    It should also be noted that doing so, which if I remember correctly is what 6th normal form accomplishes, may not be the best idea as you also increase the number tables that may need to be joined to satisfy reporting requirements.

  • Dennis Q Miller - Monday, July 16, 2018 8:50 PM

    I provided an example of how to get rid of NULL; re-introducing NULL with an outer join is counter productive. My instinctive responses is: don't use that query.
    With alternative (1), it's trivial to identify "unknown" termination_dates and it's also trivial to identify the known ones.  Perhaps I should have offered the solution where ACTIVE_EMPLOYEES and TERMINATED_EMPLOYEES are in separate tables--at least a join would not be so tempting.

    The business needs are the determining factor, and sometimes the business needs require mixing known and unknown data together.  You can do all the contortions you want to fit the business needs into your ideal model, but if you are unable to efficiently meet the business needs, you won't last long at your job.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Lynn Pettis - Tuesday, July 17, 2018 9:15 AM

    Dennis Q Miller - Monday, July 16, 2018 8:50 PM

    drew.allen - Monday, July 16, 2018 7:45 AM

    Dennis Q Miller - Sunday, July 15, 2018 9:10 AM

    drew.allen - Friday, July 13, 2018 12:49 PM

    Dennis Q Miller - Friday, July 13, 2018 12:30 PM

    I couldn't agree more.  We debate incessantly about how to represent "unknown" with NULL or a magic number in a datetime column, but we neglect considering schema designs that don't require it in the first place.

    Right..."Unknown" is not valid for a date datatype. And a date datatype does not support a domain that includes "unknown"  as one of its legitimate cases.  Using either NULL or magic number is a trick to pretend it does.  Domains are not defined by the values of a datatype; they are defined by the business requirements at a logical level and, in most cases, can only be approximated by a datatype.

    If you're going to make such a bold statement, you better be prepared to back it up with facts, specifically, you should be able to provide an example of such a schema that doesn't require representing UNKNOWN.  I think that such a schema is untenable, because there will always be gaps in our knowledge where we have to represent "unknown" values.

    Drew

    Consider this EMPLOYEE, non-example which attempts to represent an unknown date with NULL:

    EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date (nullable)
    termination_reason (nullable)

    Here's some alternatives that do not need the NULLable date:

    (1) EMPLOYEE

    employee_no (pk)
    employee_name

    termination_date (nullable)

    termination_reason (nullable)

       EMPLOYEE_TERMINATION

    employee_no (pk, fk->EMPLOYEE.employee_no)
    termination_date (not nullable)
    termination_reason (not nullable)

    --------------------------------
    (2) EMPLOYEE

    employee_no (pk)
    employee_name
    employee_status
    termination_date (not nullable)

    termination_reason (not nullable)

    ------------------------------------------
    (3) EMPLOYEE

    employee_no (pk)
    employee_name

    termination_date (varchar, not nullable)

    --------------------------------------------------------------
    (4) EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date_id (not nullable, fk -> EXTENDED_DATE->date_id)
    termination_reason (varchar, not nullable)

        EXTENDED_DATE
    date_id (pk) 
    date_name (varchar not nullable)
        CALENDAR_DATE
              date_id (pk, fk->EXTENDED_DATE, not nullable)

    ------------------------------
    (5) EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date_id (not nullable, fk -> EXTENDED_DATE->date_id)
    termination_reason (varchar, not nullable)

        EXTENDED DATE
              date_id  (pk, datetime)
              date_name (varchar)


               

    And when do the following

    SELECT *
    FROM Employee e
    LEFT OUTER JOIN Employee_Terminate et
        ON e.employee_no = et.employee_no

    you're right back to having NULL values that represent unknown data.  You haven't gotten rid of the NULL values, you've just buried them to make it appear that you've gotten rid of them.

    Drew

    I provided an example of how to get rid of NULL; re-introducing NULL with an outer join is counter productive. My instinctive responses is: don't use that query.
    With alternative (1), it's trivial to identify "unknown" termination_dates and it's also trivial to identify the known ones.  Perhaps I should have offered the solution where ACTIVE_EMPLOYEES and TERMINATED_EMPLOYEES are in separate tables--at least a join would not be so tempting.

    You demonstrated how to eliminate storing NULL values, but sometimes you still have NULL values being reported.  Some reports may require the use of OUTER JOINs to satisfy the requested report.  For instance, wanting a list of all employees that worked for the company in 2017 including those that no longer work for the company and when their employment was terminated.

    It should also be noted that doing so, which if I remember correctly is what 6th normal form accomplishes, may not be the best idea as you also increase the number tables that may need to be joined to satisfy reporting requirements.

    This was the root problem for any application of the 3rd manifesto:  you can get around nulls by creating oodles upon oodles of tables for each and every single status or type of element that might be null otherwise., 6th normal proliferates tables so quickly you can't scale the pattern to any business realistic application. 

    By the time you're done enumerated the thousands of tables needed you wouldn't be able to locate your data anyway.  Looks great on paper until you think about ALL of the consequences (90% of the NULLs you're trying to avoid just don't demand that level of accuracy or immediate fix, so you're demanding data purity for stuff that just doesn't need it).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (4) - Tuesday, July 17, 2018 9:53 AM

    Lynn Pettis - Tuesday, July 17, 2018 9:15 AM

    Dennis Q Miller - Monday, July 16, 2018 8:50 PM

    drew.allen - Monday, July 16, 2018 7:45 AM

    Dennis Q Miller - Sunday, July 15, 2018 9:10 AM

    drew.allen - Friday, July 13, 2018 12:49 PM

    Dennis Q Miller - Friday, July 13, 2018 12:30 PM

    I couldn't agree more.  We debate incessantly about how to represent "unknown" with NULL or a magic number in a datetime column, but we neglect considering schema designs that don't require it in the first place.

    Right..."Unknown" is not valid for a date datatype. And a date datatype does not support a domain that includes "unknown"  as one of its legitimate cases.  Using either NULL or magic number is a trick to pretend it does.  Domains are not defined by the values of a datatype; they are defined by the business requirements at a logical level and, in most cases, can only be approximated by a datatype.

    If you're going to make such a bold statement, you better be prepared to back it up with facts, specifically, you should be able to provide an example of such a schema that doesn't require representing UNKNOWN.  I think that such a schema is untenable, because there will always be gaps in our knowledge where we have to represent "unknown" values.

    Drew

    Consider this EMPLOYEE, non-example which attempts to represent an unknown date with NULL:

    EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date (nullable)
    termination_reason (nullable)

    Here's some alternatives that do not need the NULLable date:

    (1) EMPLOYEE

    employee_no (pk)
    employee_name

    termination_date (nullable)

    termination_reason (nullable)

       EMPLOYEE_TERMINATION

    employee_no (pk, fk->EMPLOYEE.employee_no)
    termination_date (not nullable)
    termination_reason (not nullable)

    --------------------------------
    (2) EMPLOYEE

    employee_no (pk)
    employee_name
    employee_status
    termination_date (not nullable)

    termination_reason (not nullable)

    ------------------------------------------
    (3) EMPLOYEE

    employee_no (pk)
    employee_name

    termination_date (varchar, not nullable)

    --------------------------------------------------------------
    (4) EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date_id (not nullable, fk -> EXTENDED_DATE->date_id)
    termination_reason (varchar, not nullable)

        EXTENDED_DATE
    date_id (pk) 
    date_name (varchar not nullable)
        CALENDAR_DATE
              date_id (pk, fk->EXTENDED_DATE, not nullable)

    ------------------------------
    (5) EMPLOYEE

    employee_no (pk)
    employee_name
    termination_date_id (not nullable, fk -> EXTENDED_DATE->date_id)
    termination_reason (varchar, not nullable)

        EXTENDED DATE
              date_id  (pk, datetime)
              date_name (varchar)


               

    And when do the following

    SELECT *
    FROM Employee e
    LEFT OUTER JOIN Employee_Terminate et
        ON e.employee_no = et.employee_no

    you're right back to having NULL values that represent unknown data.  You haven't gotten rid of the NULL values, you've just buried them to make it appear that you've gotten rid of them.

    Drew

    I provided an example of how to get rid of NULL; re-introducing NULL with an outer join is counter productive. My instinctive responses is: don't use that query.
    With alternative (1), it's trivial to identify "unknown" termination_dates and it's also trivial to identify the known ones.  Perhaps I should have offered the solution where ACTIVE_EMPLOYEES and TERMINATED_EMPLOYEES are in separate tables--at least a join would not be so tempting.

    You demonstrated how to eliminate storing NULL values, but sometimes you still have NULL values being reported.  Some reports may require the use of OUTER JOINs to satisfy the requested report.  For instance, wanting a list of all employees that worked for the company in 2017 including those that no longer work for the company and when their employment was terminated.

    It should also be noted that doing so, which if I remember correctly is what 6th normal form accomplishes, may not be the best idea as you also increase the number tables that may need to be joined to satisfy reporting requirements.

    This was the root problem for any application of the 3rd manifesto:  you can get around nulls by creating oodles upon oodles of tables for each and every single status or type of element that might be null otherwise., 6th normal proliferates tables so quickly you can't scale the pattern to any business realistic application. 

    By the time you're done enumerated the thousands of tables needed you wouldn't be able to locate your data anyway.  Looks great on paper until you think about ALL of the consequences (90% of the NULLs you're trying to avoid just don't demand that level of accuracy or immediate fix, so you're demanding data purity for stuff that just doesn't need it).

    I hope you weren't thinking that I actually was advocating for 6th normal form designs, because I wasn't. 🙂

    I haven't had to design highly complex databases, the few I have written went to 3rd normal form then denormalized where appropriate.

  • Understood 😛

    Was actually recalling the outcome of the infamous thread from 10 years that Jeff and I were reminiscing about on the same topic, at least - before it completely devolved into ugliness 😀

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 15 posts - 121 through 135 (of 143 total)

You must be logged in to reply to this topic. Login to reply