Four Rules for NULLs

  • The main reason I can think of is for questionnaires and the like, where you need to generate statistical analyses after they have been filled out by your users.  Marketing firms use this type of analyses to determine the types of questions to ask people.  As an example, if 10% of people surveyed decline to answer a single question because they 'found it offensive', it might be a good indicator that the question should be changed or discarded.  Likewise, if 80% of your customers in a survey responded 'N/A' to a particular question, it might clue you in to change your marketing focus away from the areas they consider "Not Applicable" to them.

    For the example in the article, middle name, it's probably not all that important - except possibly in some very specialized environments such as credit bureaus, law enforcement, etc., where every piece of identifying information is very important.

  • Not surprisingly, I would argue that FirstSQL succeeds rather than merely attempts to provide a comprehensive implementation of 3VL. It fixes the EXISTS problem (discovered by Date) and provides a fundamentally sound implementation where 3VL is thoroughly and optimally integrated into the processing. AFAIK, it is the only RDBMS that supports both A-Marks and I-Marks as required by Ted Codd (the true father of the Relational Model), thus avoiding the logical problems of using special values for Inapplicable as recommended by some here.

    Also, to correct some other comments made here: Chris Date (as well as, Fabian Pascal, Hugh Darwen, David McGoveran ... all contributors to http://www.dbdebunk.com) opposes database Nulls but does not suggest a viable alternative, leading to the flawed practice of using special values. Celko supports the use of Nulls, and Codd never wavered in his support of Nulls, in fact, supporting two types of Nulls (A-Mark, I-Mark). Late in life, Codd and Date had the "Great Nulls Debate" in which Codd argued for Nulls.

    Check http://www.firstsql.com for white papers on all the topics touched on above.

  • Rule #1: Use NULLs to indicate unknown/missing information only. Do not use NULLs in place of zeroes, zero-length strings or other "known" blank values. Update your NULLs with proper information as soon as possible.

    First thanks for the Article. It was great. Thanks guys for your posts. I learned a lot.

    Based on the posts, however, the above first rule appears to be only TEXTBOOK logic, but there is no standardized implementation of it. Therefore how you implement this logic is up to you.

    Further if implementation is up to you, then there seems to be no sin in violating this rule in regards to the zls. Only when there is a reason to know why Nulls are present is this rule applicable.

    Basically i'm trying to sum up what i've learned to make sure my conclusions are correct. I'm not throwing away this info. I'm just verifying that adhering to this rule is not always necessary in all situations. The knowledge broadens your horizons and expands your understanding, thereby giving you more options, but is by no means useless.

    But i could be completely wrong, so that i why i'm posting here.

    Thanks!

  • I think you're on the right track. In truth, the only hard-and-fast rule should be "Be Consistent". If you are deviating from common standards it is good to document them but most importantly, "Be Consistent".

    Everything in moderation and all that...

    [font="Tahoma"]Bryant E. Byrd, BSSE MCDBA MCAD[/font]
    Business Intelligence Administrator
    MSBI Administration Blog

  • First off, thanks for the comments!

    Don't be fooled by the A-mark/I-mark discussion.  This solution unfortunately leaves us with only two reasons "why" data is missing.

    FIPS 127-2 (http://www.itl.nist.gov/fipspubs/fip127-2.htm) defines a different solution:  "Multiple Null States" which would give the user the ability to define as many different NULL meanings as they wish ("Unknown", "Not Applicable", "Pending", "Missing", etc.)

    There are many reasons why you should avoid NULLs when possible:

    • "IS NULL", "IS NOT NULL" in the WHERE clause can create non-sargable conditions which prevent SQL Server from using indexes
    • queries become more complex with "IS NULL", "IS NOT NULL", "COALESCE", etc.
    • nullable columns restrict your ability to create some constraints (e.g., Primary Key)

    To accurately model your data (obviously using the current set of tools given), NULL use should be restricted to 'missing' values; not 'non-existent' values.

    As an example, let's say you run a call center and tell your employees that when a customer calls they should ask for all 'missing' data for that customer.  Around the third or fourth time that "Joe (no middle name) Johnson" calls up and is asked once again for his middle name, he'll probably start getting a little upset.  If we reserve NULL usage for the 'missing' data, and populate Joe (no middle name) Johnson's middle name with a ZLS we'll know not to keep asking that same question every time he calls in.  That's why a ZLS is a more accurate way to represent non-existent data than a NULL.

  • Thanks again guys. I will take a look at the link too Mike. Greatly appreciated!

  • "...flawed practice of using special values"?  And NULL isn't a special value?  OK, by definition, NULL isn't a value at all, it is more accurate to say that it is a special mark.  At least with the special values approach you have the ability to define the MEANING of those values, NULL is meaningless because of its ambiguity.  BTW, you don't explain why you think using special values is so flawed particularly compared to NULL.  But I'll save you the trouble by explaining it below.

    The whole a-mark/i-mark proposal misses a key point.  If an attribute is truly not applicable to a record then you have done a very poor job of designing your table(s).  The whole idea of NULL's was to indicate missing information any attempt to expand that to not applicable information is folly.

    Now, there are really two cogent lines of thought surrounding handling missing information.  The simplest and easiest to implement is to use special values to indicate something about WHY the data is missing.  The more logically correct approach (but much more difficult) is to realize that the reason why data is missing is NOT actually data, rather it is meta data (this is the primary reason why the special value approach is flawed, but I still think that it is far less flawed than NULL) and record it in the appropriate meta data repository.  Building, maintaing, and then using such a repository is a difficult proposition given that there is no support for it within any of the available products.  Fabian Pascal gives examples of this approach in his book Practical Issues in Database Management. 

    Then you have some special cases like middle name.  The absence of a middle name is not likely to change the type of person you are any more than the existence or absence of a prefix or postfix.  Contrast this to the absence of a salary for an employee or weight for a part...  Cases like middle name are very easy to handle with special values that indicate their non-existence.  Unlike using NULL you can logically assert that a person has no middle name, or has declined to provide it with special values.  Of course there are some other difficulties that arise from special values particularly when dealing with numerical data, namely you have to select your special values very carefully to fall outside the range of possible real-life values.

    Incidentally, I recommend against the idea of using zero length strings for special values because of their similarity to NULL.  ZLS's and spaces have caused me more problems over the years than NULLs have.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • First of all, thanks for a nice article, Michael. I liked it.

    I have two comments. In the "ANSI Versus Microsoft" section, you give the example

    SET ANSI_NULLS OFF
    SELECT * FROM [Northwind].[dbo].[Employees]
    WHERE [Region] = NULL

    and write "This allows you to perform basic equality comparisons to NULL". Correct. column = NULL is true if ANSI_NULLS is OFF. One could easily get the impression that column = column is always true (also for null values) if ANSI_NULLS is OFF. But this is not the case, as can be seen from the following example, which returns no rows:

    SET ANSI_NULLS OFF

    SELECT * FROM [Northwind].[dbo].[Employees]

    WHERE [Region] = [Region] AND [Region] = NULL

    This caught me by surprise once when maintaining a database in which all stored procedures were compiled with ANSI_NULLS OFF. NULL = NULL is true when you compare variables (or columns to variables) and unknown when you compare columns to columns. But you were probably aware of this already, and ANSI_NULLS OFF wasn't the main subject of your paper, which justifies not mentioning this... I definitely agree that one should always use ANSI_NULLS ON.

     

    My second comment applies to the "NULL math" section. String concatenation with null depends on "concat_null_yields_null", as can be seen by running the following:

    set concat_null_yields_null off

    select 'Joe' + null

    set concat_null_yields_null on

    select 'Joe' + null

    The first select returns Joe, the second returns NULL (as you said in your article).

  • Maybe a little off-topic, but you started the quote :

    "As we know,

    There are known knowns.

    There are things we know we know.

    We also know

    There are known unknowns.

    That is to say

    We know there are some things

    We do not know.”- Donald Rumsfeld

    Shouldn't we add (as we know he knows that we know he wants us not to know...   )

    "There are Knowns I know, That You all will never know!"

  • I respect your comments and research on using NULLS but can't understand why you just don't use the correct querying method with NULLS.

    Your comment: "Because of the inherent issues surrounding NULLs and data integrity, three-valued logic, scalar arithmetic and performance, DBA's and Database Designers tend to try to minimize the use of NULLABLE columns in their tables." is an assumption of the worst degree.

    NULL is perfectly adequate for descerning unassigned column data. If you want to compare data using NULL use:

    SELECT *

    FROM t1

    WHERE NOT Col1 IS NULL

    OR

    SELECT *

    FROM t1

    WHERE Col1 IS NULL

    Amit Lohia responded that "COALESCE is fair better than isnull and I make my DBAs use COALESCE" because of overflow considerations using a tinyint and assigning 256, really! Since when was returning a value out of bounds a good thing! What is happening to the standard of SQL professionals?

    Max

  • imho Nulls can be avoided at all times, and should be, except in the case of an empty foreign key. For example, in a tree the rootnode has no parent_id.

    Robbert

  • Yes, well, some of us started feeling a little queasy when the name Donald Rumsfeld popped up on our periphary but managed to relegate it to the suppressed memory partition. Thanks Oli for making sure that we read it again.

    Max

  • Now on-topic,

    You say we use NULL for the 'known unknowns', yet in some cases we don't. We can also use NULL for the 'not applicables'.

    Say you have //silly example // a table with vehicles. One of the columns is 'engine_power'.

    Some records hold company cars, and hold the horsepower amount in this field.

    Some other records hold the company bicycles. And what do you enter as 'engine_power'? Yes, you don't enter anything, and leave it NULL...

    Say you need to make a calculation in which a multiplication by engine_power is done. Although the idea is that you only multiply by it if there is an engine, you want the calculation also to include the bicycles. In that case, you'll use 'COALESCE(engine_power,1)' (or ISNULL(...,1) to neutralise the NULL.

     

    In this use of NULL, you never update the NULL values and they keep giving you headaches

  • At least we seem to share thoughts here  (or better )

  • The two the most common bugs that I have had to deal with are:

    where a nullable column is tested for <>

    eg If there is a Status column and one tests

    Select/Update/Delete TableName Where Status <> 'A'

    The rows with null values are not included.

    The other is where joins are done that involve nullable columns.

Viewing 15 posts - 31 through 45 (of 152 total)

You must be logged in to reply to this topic. Login to reply