Four Rules for NULLs

  • COALESCE is fair better than isnull and I make my DBAs use COALESCE.

    Example

    DECLARE @a TinyInt

    SELECT ISNULL(@A,256)

    This will lead into Arithmetic overflow error for data type tinyint, value = 256.

    But not with

    DECLARE @a TinyInt

    SELECT COALESCE(@A,256)

    but if you want your query to fail then obviously use isnull

     

     


    Kindest Regards,

    Amit Lohia

  • Thanks for the suggestion!  You can't go wrong with Date, Celko and the rest of the gurus.

    I am with you on eliminating NULLs as much as possible; but as you also point out, sometimes they are the best/only tool SQL gives us to deal with missing information.  I think the key is to use NULLs with caution and keep in mind that they do affect your logic and overall results.

    As for your suggestion about storing salary information in a separate table, it makes sense; but when you do a LEFT JOIN on the salary table, you'll still have to deal with NULLs popping up in the result set.  One method I've used to deal with "Not Applicable" and other specific statuses is to use a CHAR(1) column as a status code flag to indicate the specific reason the data is missing.  It requires a little more programming logic on the front end, but it tends to be well worth it if the specific status information is important to your situation.

    I'm submitting a second piece to go along with this article that describes how NULLs affect results in ANSI aggregate functions - that seemed to be a big question on another discussion board here at SQL Server Central.

    Thanks again!

  • Good points as presented. One might however also note that SQL as it exists is not an especially logically "clean" implementation of  >2VL i.e.(arguably, it may be better to avoid nulls or go to a more comprehnsive implementaion of >2VL as for example firstsql attempts).

    Specifically, SQL rather "mixes" A Marks and I Marks

    I Marks - (Inapplicable - for the tuple in question - the column does not adhere to the table predicate). (Zero length string "user tokens" are suggested for I Marks in the article.) 

    A Marks - ("Applicable - value at present unknown")

    Such mixing can mean lost data. A good example of this muddling from the article is the example of "John Jones":

    In the case of John Jones, however, we have set the Middle_Name column to NULL. This indicates that we do not know whether John has a middle name or not. We may discover, in the future, that John has a middle name; likewise, we may discover he has no middle name. In either case, we will update his row accordingly. So, NULLs represent our "known unknowns"; the only thing we know for a fact is that we do *not* know whether John has a middle name or not."

    Consider if it was known by the data entry technician that "John Jones" did have a middle name e.g.(but refused to divulge it) - basically such information is lost with a null - but not with an A Mark.

    Also, some problems with the suggestion of implementing Zero-length strings, i.e.(instead of proper I Marks which are not available in SQL), include the issues of what to do about numeric columns; and the issue of distinguishing between when a zero length string "user token" - I Mark - is intended to be represented, and when a zero length string is intended to be represented. HTH

  • Thank you for the feedback!

    ANSI SQL doesn't provide differentiation for 'states' of missing data; only a mark indicating that the data is indeed missing.  In order to differentiate between different states of missing data, I would use another column (possibly a CHAR(1)) indicating the state/reason for the missing data.

    The name table is a simplistic example, and the reason why a middle name is missing is probably not compelling enough to go through all the effort to track it (excepting possibly some very specialized businesses).  A more likely scenario might be when tracking answers from a questionnaire from which you will be building statistical analyses.  Some questions might not be applicable to your respondents; others might be disinclined to answer certain questions.  For these types of situations, I find that an additional column indicating the reason for missing data works fairly well.

    To my mind there are far too many possible reasons that data could be missing to try to implement a separate 'NULL' type 'mark' for each situation, in the same column that contains your actual data .  At that point, I think you're headed back towards COBOL, where you place '999999999' in a row for 'N/A' and '88888888' for 'Refusal to answer', etc.

    Thanks again!

  • "To my mind there are far too many possible reasons that data could be missing to try to implement a separate 'NULL' type 'mark' for each situation, in the same column that contains your actual data .  At that point, I think you're headed back towards COBOL, where you place '999999999' in a row for 'N/A' and '88888888' for 'Refusal to answer', etc."

    Perhaps, though I Marks and A Marks were Codd's invention /  suggestion, one point of which was presumably to suggest such different DBMS provided NULL marks or tokens specifically so that users would not have to resort to a proliferation of user contrived marks such as as in the Zero length string "user mark" suggested in the article, or with the '999999999' in a row for 'N/A' and '88888888'for 'Refusal to answer' of COBOL.

    By this logic, supporting marks in the SQL standard may arguably be "better". Both the article and COBOL examples mentioned would seem to support implementing additional DBMS supported NULL marks rather than user implemented marks and single kind of NULL (to avoid heading back towards COBOL where many user encoded marks were the norm)?

  • Agreed on NULL avoidance!

    Unfortunately, due to the way OUTER JOINs work, all DBAs still must be educated on the proper use of NULL and 3VL, even if the database has no nullable columns...

    --
    Adam Machanic
    whoisactive

  • Technically speaking, zero-length strings are suggested in the article as the actual physical CHAR/VARCHAR representation of a middle name consisting of exactly zero characters.  I think that a ZLS is a very accurate representation of a middle name consisting of zero characters, don't you?

    I also see the state/reason for missing data as a separate attribute from the data itself.  That's why I believe it should be separated into its own column from the outset.  The attribute 'state/reason for missing middle name' should not be combined with the attribute 'middle name'.

    Going further, even the I and A mark system is far from complete from a business perspective.  Let's look at the "A marks".  Now management wants to know why a particular question was not answered by a particular person.  Here are some possibilities:

    - Federal law prohibits the asking of the question.

    - State law prohibits the asking of the question.

    - Local law/ordinance prohibits the asking of the question.

    - Of course if one of the above is true we may need to know which law and which jurisdiction.  (i.e., U.S. Federal Privacy Act law?  Alabama State Employment Rights Act law?  Canadian Medical Privacy Law?)

    - It was not required to be answered per company policy.

    - It was required to be answered by company policy, but the user decided not to answer anyway.

    - The user left the question incomplete, but will return to it later.

    And there are plenty more; so now we have dozens of marks, many of which might not apply to your particular business - but they need to be included in the system for the business next door where they might be needed.  Next year when the new laws are passed, you'll need to upgrade you system to get the newest system 'marks', or your data will not be compatible with some other businesses out there.  After all, that 'Z Mark' you have in the middle_name column of some of your rows will not be intelligible to another company you are trying to share data with if they haven't upgraded their system.

    Additionally, this could turn into a real mess, since people who are presented with too many codes tend to use a lot of incorrect/inaccurate codes in their implementations.  Then others they share data with have to propagate the mistakes for compatibility, and so on.  ANSI X12 EDI implementations are proof enough of that.

    I think this is why ANSI SQL-92 decided to leave it to the DBA's and developers to define their own methods of tracking this particular information; i.e., creating a CHAR(1) column and placing "I" for "I Mark", "A" for "A Mark", and "P" for "Incomplete due to Federal Privacy Act", or whatever other codes apply to your business logic.  This narrows the scope of this attribute to your particular business rules, but leaves plenty of room for future expansion when the "Johnson-Johnson Medical Privacy Act of 2010" is passed.

  • "I also see the state/reason for missing data as a separate attribute from the data itself.  That's why I believe it should be separated into its own column from the outset.  The attribute 'state/reason for missing middle name' should not be combined with the attribute 'middle name'."

    Fair enough, but then why have any mark at all (SQL NULL or otherwise) - would it not make more sense to simply represent the missing data separately as suggested (and avoid NULL marks amongst recorded data)?

    "I think that a ZLS is a very accurate representation of a middle name consisting of zero characters, don't you?"

    No, for the logical reason previously indicated. Consider if someone's middle name happens to be defined to be a ZLS as suggested... how is one to the distinguish between the representation for those who actually have no middle name, and someone else who has a middle name that also happens to be defined to be a ZLS? The example may be trivial - the point is that user marks cannot easily guarantee that the representation chosen will not inadvertently corrupt data for those tuples for which the chosen "mark" represents actual data.

  • The fundamental problem with ANY kind of NULL scheme is that a record in a table (at least a relational table) represents an assertion of fact.  It is impossible to assert an unknown.  You can assert that you don't know something, but that is completely different than asserting the unknown.

    Unfortunately, the real world circumstances and products that we have to deal with, force us to live with NULLs.  As much as I dislike them, I have yet to find a method or product that allows me to eliminate them.  So I do the best I can to minimize them and then educate myself on how not to get tripped up by their existence.

    At the same time, I seldom pass up an opportunity to influence others to my way of thinking in the hopes that eventually we'll be able to put enough pressure on the vendors to do something about it.  The same goes for XML and all other wrong-headed, non-Relational "solutions" to data management.  I seldom pass up an opportunity to provide feedback to MS and other vendors on these areas. 

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • The logic describing the "fundamental problem" is fairly sound.

    "This illustrates a major problem with SQL NULLS and the way they are used.  There are essentially two reasons for missing information.  The data is unknown, or the data is not applicable.  SQL NULLs lump both reasons together."

    And, those would be Codd's A Marks and I Marks as mentioned in earlier posts, (again they might make more sense than SQL NULLS and are implemented into at least one DBMS).

      In reality, if an attribute is not applicable to a given record then that record (and all others to which the attribute is not applicable) is actually of a different type than the records to which the attribute does apply and should logically be placed in a different table."

    And this is logically correct at least as far as a relational table i.e.(one with a coherent predicate) is concerned.

  • A Marks and I Marks do address some of the issues with SQL NULLs but before the end of his life, Dr. Codd had changed his mind about the acceptability of NULLs in a relational database.  C.J Date and Hugh Darwen have gone further in their research and have proved fairly conclusively that NULLs of any sort do not belong in relational databases. 

    However, as I stated, we currently have little choice than to deal with them, but we should, first and foremost, try to minimize their use.

    /*****************

    If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek

    *****************/

  • Hi,

    What about values "known but hidden" What we would use for that? I think there is a need for that too. To develop our middle name example, someone does not want to enter her middle name, but she has to make sure that other people know that the name exists. With NULL we say that we don't know what is the middle name or if it exists at all. What we should use when we want to say: yes, we know the name exists, we just would not like to show what are the characters in this name.

    The opposite example is when we know for sure that someone  does not have the middle name.

    Many websites that deal with profiles come up with some funny defaults for similar situation to define the income for example. I was just wondering about mathematical definitions (logical values). I think I have to hit the books again

    Yelena

    Regards,Yelena Varsha

  • Just a quick question... In what situation would you want to know that information (exists but doesn't wanna say, or we kow it doesn't exists). Why not just go, we don't have the value and that's just it (assuming null is acceptable)?

  • Fair enough, but then why have any mark at all (SQL NULL or otherwise) - would it not make more sense to simply represent the missing data separately as suggested (and avoid NULL marks amongst recorded data)?

    The way SQL is designed, you would still need to place 'something' in the column that contains the data.  In this instance you're placing a NULL, although you could also place garbage of any kind in it to hold it's place.  NULLs just provide a more consistent method of placeholding.  For instance, going with the suggestion to eliminate NULLs, let's say we decide to place random 'garbage characters' in the column when we have an "I" or "A" mark indicator in the state/reason column.  We could end up with "ZxUrcFASs" in one row and "uUiO29D!-+" in the next row with missing data.  Of course by looking at our state/reason column first we know not to use the values in these columns...  but since eliminated NULLs, we had to populate it with 'something'...  NULLs give us a little more consistency, since we can do a one query to determine exactly which rows are missing data in a column.  If we populate the column with garbage data we lose that consistency.  The next level is the state/reason attribute column that tells us, out of the rows that are missing data, which ones are for reason A and which ones are for reason I.

    No, for the logical reason previously indicated. Consider if someone's middle name happens to be defined to be a ZLS as suggested... how is one to the distinguish between the representation for those who actually have no middle name, and someone else who has a middle name that also happens to be defined to be a ZLS?

    As pointed out in the article, ZLS should be reserved for when you know the value is blank or empty.  If you absolutely know that Jim Jones has no middle name, you populate the middle_name column for his row with a ZLS.  A ZLS is an accurate representation of a middle name consisting of zero characters.  If, on the other hand, you don't know if he has a middle name or not, then you have no business putting any known values in his middle_name column.  Hence the NULL value for 'unknown'.  If I wasn't sure what George Bush's middle name was, it wouldn't make sense to populate it in my database with "Allen", "L", or a ZLS.  Since I don't know what his middle name is, I populate it with a NULL until the day I find out what it actually is... 

    The example may be trivial - the point is that user marks cannot easily guarantee that the representation chosen will not inadvertently corrupt data for those tuples for which the chosen "mark" represents actual data.

    I think that falls more on training data entry operators and having solid business processes in place.  Data input is not a function of the database itself, or of the DBA's and database designers; it is a function of the system operators and end users.  If they are not properly trained as to when different columns should be marked 'unknown' and when they should be marked 'blank', they will completely destroy our ability to generate business intelligence information from even the best and most robust systems we can create.  It's just like the old saying 'Garbage In, Garbage Out' - and without proper training, you'll get a lot more Garbage In than can be reasonably dealt with.

  • For this, the additional CHAR(1) column with your status/reason attribute will work quite nicely.  If the name does exist, populate your CHAR(1) column for that row with a designated indicator; i.e., we might choose "K" for "Known but Hidden".  Again, using this method you can designate as many reasons for missing/incomplete data as you feel are appropriate to your particular business needs.

Viewing 15 posts - 16 through 30 (of 152 total)

You must be logged in to reply to this topic. Login to reply