Four Rules for NULLs

  • Paul Nicklin (3/28/2008)


    If you knew the person was unemployed you'd have to put NULL in current_employer_id

    The alternative is "special" values in your table "Not Employed".. apply that to every relation and you've got a nightmare

    Or have a separate table with PERSON_ID and CURRENT_EMPLOYER_ID which would eliminate the need for NULL values.

  • I agree with the main thrusts of this article. However, I do not consider the NULL to be "missing data". The NULL must simply be considered as either "no value", or "value not [yet] assigned".

    It must be emphasised that the use of "null string" is confusing: a "zero length string" is definitely not the same as a NULL, because a zero length string is a string, and is also identical to every other zero length string. On the other hand, no two NULLs are ever equal, as an object with no value cannot be evaluated. Therefore, the correspondent who asks "Is 'some_object = NULL' more or less efficient then 'some_object IS NULL' " is asking for trouble! Generally, in both SQL Server and Access, Microsoft have improved their syntax checking of this kind of misplaced logic so it is becoming less common. Just remember, NULLs cannot be compared because they cannot be evaluated.

    Finally, I think that the parts of the article covering NULL propagation in logic and maths were helpful. But remember that NULLs occur in query results as well - that's an area for another article. Many may regard the NULL as an almost-illegal object to encounter or use, but it is not.

    Kenneth Spencer

    You never know: reading my book: "All about your computer" might just tell you something you never knew!
    lulu.com/kaspencer

  • ..yes I thought about that as I wrote it - however there are many places where you don't want to create a link table as the relationship is genuinely 1:N not N:N (I renamed my "employer_id" to "current_employer_id" to try to force 1:n but I admit it was a little contrived)

    Also you still have a similar issue - is the absence of an entry in your link table knowledge about the absence of employment or absence of knowledge about employment... well it can be either..

  • Paul Nicklin (3/28/2008)


    ..yes I thought about that as I wrote it - however there are many places where you don't want to create a link table as the relationship is genuinely 1:N not N:N (I renamed my "employer_id" to "current_employer_id" to try to force 1:n but I admit it was a little contrived)

    Also you still have a similar issue - is the absence of an entry in your link table knowledge about the absence of employment or absence of knowledge about employment... well it can be either..

    I'm sure the 1:N/N:N issue could be handled by how you define the primary key on the link table (i.e. just on EMPLOYEE_ID for 1:N or on both columns for N:N)? Not sure though as I'm a little hungover. ๐Ÿ˜€

    Second part is a good point.

  • (assume for now person:emp is N:1)

    ..it could (by defining the PK as person_id) , but it still creates an extra unnecessary table

    ..and the second point about absence of a row still applies

  • Excellent article that has stood the test of time.

    My confusion comes in trying to apply the statement

    Simply put, [when] we try to compare NULLs to data values[] or other NULLs [per] the ANSI-92 SQL Standard, [the] comparisons with NULL always result in Unknown.

    to the SQL three-valued logic table for AND/OR just above the statement. I would interpret โ€œalwaysโ€ in the statement to require "p AND q" or "p OR q" to return Unknown if p and/or q were Unknown. The table shows discrepancies from this interpretation in the following cases:

    1) True OR Unknown should return Unknown (table says it's True)

    2) False AND Unknown should return Unknown (table says it's False)

    3) Unknown OR True should return Unknown (table says it's True) and

    4) Unknown AND False should return Unknown (table says it's False).

    In writing up this reply, I was trying to explain these discrepancies based on assuming it had something to do with SQL's short-circuited logic evaluation. I just couldn't come up with a "rule" for this short-circuited evaluation based on order of evaluation to explain all the discrepancies.

    I wonder if it's just that we need a more detailed statement. How about the following:

    When we compare NULLs to data values or other NULLs per the ANSI-92 SQL Standard, the comparisons with NULL will result in Unknown unless the non-NULL value can unilaterally determine the result of the expression. Knowing that either "p" or "q" is False lets us know that the result of "p AND q" is False without knowing whether the other is True, False, or Unknown. Likewise, knowing that either "p" or "q" is True lets us know that "p OR q" is True.

    Is this proposed statement accurate and complete? Is there a more succinct (and still complete) way to express this?

  • Kudos to Micheal for writing a good article......

    ๐Ÿ™‚

  • Good article... and I realize it's a couple years old...

    I normally don't work NULLs against "TRUE/FALSE" values even in WHERE clauses... I normally work them against some form of data... so, here's what my truth table looks like...

    DECLARE @T INT -- True

    DECLARE @F INT --False

    DECLARE @N INT --Unknown or NULL

    SELECT @T = 1, --True

    @F = 0, --False

    @N = NULL

    --==== No unknown values

    SELECT 1 AS RowNum, @T AS p, @T AS q, @T & @T AS [p AND q], @T | @T AS [p OR q] UNION ALL

    SELECT 2 AS RowNum, @T AS p, @F AS q, @T & @F AS [p AND q], @T | @F AS [p OR q] UNION ALL

    SELECT 3 AS RowNum, @F AS p, @T AS q, @F & @T AS [p AND q], @F | @T AS [p OR q] UNION ALL

    SELECT 4 AS RowNum, @F AS p, @F AS q, @F & @F AS [p AND q], @F | @F AS [p OR q] UNION ALL

    --==== True vs unknown values

    SELECT 5 AS RowNum, @T AS p, @N AS q, @T & @N AS [p AND q], @T | @N AS [p OR q] UNION ALL

    SELECT 6 AS RowNum, @N AS p, @T AS q, @N & @T AS [p AND q], @N | @T AS [p OR q] UNION ALL

    --==== False vs unkown values

    SELECT 7 AS RowNum, @N AS p, @F AS q, @N & @F AS [p AND q], @N | @F AS [p OR q] UNION ALL

    SELECT 8 AS RowNum, @F AS p, @N AS q, @F & @N AS [p AND q], @F | @N AS [p OR q] UNION ALL

    --==== All unknown values

    SELECT 9 AS RowNum, @N AS p, @N AS q, @N & @N AS [p AND q], @N | @N AS [p OR q]

    ... and here's the resulting truth table...

    RowNum p q p AND q p OR q

    ----------- ----------- ----------- ----------- -----------

    1 1 1 1 1

    2 1 0 0 1

    3 0 1 0 1

    4 0 0 0 0

    5 1 NULL NULL NULL

    6 NULL 1 NULL NULL

    7 NULL 0 NULL NULL

    8 0 NULL NULL NULL

    9 NULL NULL NULL NULL

    Just a couple of personal thoughts about NULL which, of course, are not directed at the author (Micheal Cole)...

    I hope they never do away with NULLs because I use the tri-state logic of NULL to my advantage all the time. For example, I frequently run into requirements that say something like "if any field is null, critical data is missing and you need to ...". So, if it's all numeric data or all text data, where most people would have "OR" out the wazoo in a CASE statement in the WHERE clause, I just have...

    WHERE colA+colB+colC+colD+... IS NULL

    Yeah, I know... probably no chance of using an index... who wants a a super wide covering index for something like this anyway? If I can, I'll add a calculated column to the table and index that, instead. No chance of that with a bazillion OR's in the WHERE clause.

    Here's another one of my favorites... and it just shows how little some folks know about NULLS... the requirement is "return everything that is not null and not blank" and people immediately spit out something like...

    WHERE COALESCE(somecol,' ') <> ' '

    ... not realizing that WHERE somecol > ' ' works faster and will allow an index seek to be used and still skip the NULLs because NULL cannot be compared using relational operators even to another NULL (unless you've messed with the default server settings which means you screwed everything up for tri-state logic ๐Ÿ˜‰ ) .

    There's a lot of other places where I use NULL for just exactly what it was designed for... to easily consider missing data. No chance of that with simple binary logic.

    There's a lot of good and understandable examples about what a pain in the patooti NULL values sometimes are... thats fine... if you don't understand tri-state logic, find it difficult to use, or simply don't see the value of it, then don't use it... use blanks or 0's instead of NULLs (and wonder, did the user simple skip that field... or not?)... and use '99991231' as an end date for unknown end dates instead of a NULL. Add the NOT NULL constraint to every column in every table and you'll never have to worry about pains of tri-state logic ever again...

    But, don't punish the rest of us that do understand the high value of tri-state logic by getting rid of NULLs (or changing the server settings) or requiring us to use binary logic in our code. Tri-state logic is much too valueable to some of us.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Great article and entertaining read! Thanks!

  • mohler (3/28/2008)


    Excellent article that has stood the test of time.

    My confusion comes in trying to apply the statement

    Simply put, [when] we try to compare NULLs to data values[] or other NULLs [per] the ANSI-92 SQL Standard, [the] comparisons with NULL always result in Unknown.

    to the SQL three-valued logic table for AND/OR just above the statement. I would interpret โ€œalwaysโ€ in the statement to require "p AND q" or "p OR q" to return Unknown if p and/or q were Unknown. The table shows discrepancies from this interpretation in the following cases:

    1) True OR Unknown should return Unknown (table says it's True)

    2) False AND Unknown should return Unknown (table says it's False)

    3) Unknown OR True should return Unknown (table says it's True) and

    4) Unknown AND False should return Unknown (table says it's False).

    Hi mohler thanks for the kind words. I think the confusion comes from considering OR and AND to be comparison operators. The "always" refers to comparing NULLs to data values or other NULLs; the truth table shows the combination of the results of multiple comparison expressions--i.e., compound predicates.

    In writing up this reply, I was trying to explain these discrepancies based on assuming it had something to do with SQL's short-circuited logic evaluation. I just couldn't come up with a "rule" for this short-circuited evaluation based on order of evaluation to explain all the discrepancies.

    Short-circuiting is more of a physical implementation detail. AFAIK SQL vendors can optimize in any way they want, so long as the results produced are the same as if they hadn't applied the optimizations. I think it's a matter of differentiating between the result of a single comparison in a simple predicate and multiple comparison results combined with AND and OR in a compound predicate.

    I wonder if it's just that we need a more detailed statement. How about the following:

    When we compare NULLs to data values or other NULLs per the ANSI-92 SQL Standard, the comparisons with NULL will result in Unknown unless the non-NULL value can unilaterally determine the result of the expression. Knowing that either "p" or "q" is False lets us know that the result of "p AND q" is False without knowing whether the other is True, False, or Unknown. Likewise, knowing that either "p" or "q" is True lets us know that "p OR q" is True.

    Is this proposed statement accurate and complete? Is there a more succinct (and still complete) way to express this?

    In the sentence "When we compare NULLs to data values or other NULLs per the ANSI-92 SQL Standard, the comparisons with NULL will result in Unknown unless the non-NULL value can unilaterally determine the result of the expression" it sounds like a comparison with NULL can result in a non-Unknown value in some situations. In a comparison (barring use of the deprecated SET ANSI_NULLS OFF statement) I don't believe this is the case. The remainder of the statement appears to be a correct description of short-circuit evaluation of compound predicates, but there's no guarantee that any implementation will follow short-circuit rules in all cases.

  • Jeff Moden (3/28/2008)


    Good article... and I realize it's a couple years old...

    I normally don't work NULLs against "TRUE/FALSE" values even in WHERE clauses... I normally work them against some form of data... so, here's what my truth table looks like...

    Thanks jeff. That's an interesting method of addressing 3VL - it looks similar to the SQL:1999 Boolean data type model. I'll have to play around with it when I get a chance.

    Just a couple of personal thoughts about NULL which, of course, are not directed at the author (Micheal Cole)...

    I hope they never do away with NULLs because I use the tri-state logic of NULL to my advantage all the time. For example, I frequently run into requirements that say something like "if any field is null, critical data is missing and you need to ...". So, if it's all numeric data or all text data, where most people would have "OR" out the wazoo in a CASE statement in the WHERE clause, I just have...

    I doubt they'll do away with SQL NULLs in our lifetime. It would break far too much existing code all over the planet. Of course at some point SQL might be replaced by some newer query language like "D" or something, but they'll need to get it out of the classroom and build a demand for it in the real world first.

  • Jeff Moden (3/28/2008)


    Good article... and I realize it's a couple years old...

    I normally don't work NULLs against "TRUE/FALSE" values even in WHERE clauses... I normally work them against some form of data... so, here's what my truth table looks like...

    the thing though is that the handling of the value of criteria is handled differently in a "binary operation" OR like you showed here, than it is when use in the OR statement of a query.

    For example:

    select top(1) *

    from Tally

    where

    (1=1) --TRUE

    OR

    (1=NULL) --unknown

    the query returns a row, since the first predicate is TRUE and is sufficient in making the OVERALL condition TRUE.

    You will note that this is different from the "truth table" in the binary operation. There's a bit of confusion because the two scenarios are a little different. Meaning, BINARY OR is not the equivalent of CLAUSE OR (so to speak.)

    Also - note that in the tri-state logic you describe, as apply to a WHERE clause, rows are returned only when the OVERALL operation evaluates to TRUE. So - I'm not sure that has the same bearing on AND in a WHERE clause, since both FALSE and UNKNOWN are excluded using WHERE.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • csaba.marko (3/28/2008)


    Hi!

    You could also write about how NULL values effect the aggregate functions: count, sum, avg etc.

    MCs

    Hi csaba. See http://www.sqlservercentral.com/articles/Advanced+Querying/gotchasqlaggregatefunctionsandnull/1947/[/url].

  • Paul Nicklin (3/28/2008)


    Hi

    I'm not sure I agree exactly with your rule 1- or at least how you've used the idea of "unknown" to say what NULL is.

    Consider when using foreign keys..

    table EMPLOYER

    id

    name

    table person

    id

    NAME

    CURRENT_EMPLOYER_ID

    you'd constrain CURRENT_EMPLOYER_ID to EMPLOYER(ID)

    If you knew the person was unemployed you'd have to put NULL in current_employer_id

    The alternative is "special" values in your table "Not Employed".. apply that to every relation and you've got a nightmare

    In this case null doesn't mean "Unknown employer" it means we know there isn't an employer.

    Maybe this is wrong in pure terms, but I suspect that it's the more common use - especialy given the way that constraints work with NULL

    Paul

    PS yes, I agree with 4, and if ever I port to oracle I have to change isnull to nvl.. but I think that will be the least of my problems...!

    I'd agree that this is probably a common usage, and as long as you're comfortable with using NULL to represent multiple states that might be adequate. The problems come when you want to run a data quality report: for instance, you need to know how many employer id's weren't entered in the system because the employer id was not known at the time? How do you separate that from the "unemployed" entries using the same NULL to represent two statuses? And if you decide to expand the scope of your data collection to include the reason for unemployment - "unemployed: quit", "unemployed: fired", "unemployed: laid off", etc. All of these states would be stored using NULL as well? Seems like there should be a better way to represent these types of attributes without overloading NULL.

  • Excellent article and feedback.

    I did find one point that I disagree with in rule #1:

    Rule #1: Use NULLs to indicate unknown/missing information only. Do not use NULLs in place of zeroes, zero-length strings or other "known" blank values. Update your NULLs with proper information as soon as possible

    Everything works except the last sentence. NULLs are "proper" information. It only makes sense to update to non-NULL values if the data becomes known. I realize this sounds a bit nit-picky but I have actually had programmers ask me if they should populate Nullable columns with zero or blank values to avoid NULL issues. My answer has always been that if the column is Nullable then NULL is a perfectly reasonable value and should not be avoided. This does assume that the original architect set Nullability appropriately.

    Thank you for the excellent article.

Viewing 15 posts - 76 through 90 (of 152 total)

You must be logged in to reply to this topic. Login to reply