Null Defaults

  • patrickmcginnis59 10839 - Friday, March 3, 2017 7:42 AM

    I like the thinking that null is an indicator that a particular referenced area for storing values has not had a value stored. This even applies if that area represents a column sourced by a left joined table in a query. I like to think it doesn't mean "unknown", it doesn't  mean "missing." It means a value has not been stored or alternatively, a value is not available to be returned. Any meaning attached to nulls further than that is up to the design using that referenced information storage area, but because of that, SQL Server is completely agnostic on any further meaning other than the lack of a value being stored in the referenced storage area.

    I have no problem with people thinking otherwise, but its gotta be a pretty persuasive argument to change MY thinking on this.

    I get exactly what you mean and I agree with this distinction at the low level of processing. However, application software needs to handle this condition, and "has not been stored" or persisted is a condition that should be transparent to client software.

  • I've seen 2 practices; using so called "magic numbers" or using nulls. When I first went to work at my previous job, I had very little experience with any SQL DBMS. And we were writing Windows applications for Windows 95/98 using controls that didn't understand the concept of a null for a datetime field. So I came up with the idea of using the date of 1800-01-01 to represent a null datetime. That way we could check for that  value in code and force the control to show a blank. But over time the programming environment got better, controls got smarter and the whole point of using some arbitrary value like 1/1/1800 to represent a null became ridiculous. I now look back at that and think it was just a stupid idea. As time went by, and other developers came and went, we always had to go through the process of telling them that 1/1/1800 wasn't a "real" date, that it really meant a null date. It became impossible to get rid of, because of all of the software and reports that we built up around the idea of looking for 1/1/1800 (or 1800-01-01) and replacing that with a null or blank. I look back on that as one of the most boneheaded ideas I ever came up with.

    However, I've noticed that I wasn't the only one who came up with using magic numbers. At that old job we participated in several Federal grants, where we had to supply the Feds with data to satisfy the grant. In all cases the Feds specified that if any value was unknown, it had to be replaced with 9's. How long the string of 9's it took was dependent entirely upon the length of the field of data the Feds were looking for. Sometimes it was as short as just 99. Other times it was like you said Steve, 99999. As far as I know the Feds still have this thing for 9's.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • Nulls are an abomination.

    The problem is their ambiguity. Not in the literal sense of "data is missing" but in the sense of "WHY is this data missing".

    If you're importing data into a database through an API allowing nulls means allowing unknown (read: erroneous) data into your system. It also means opening your validation defenses to allow users to enter erroneous or incomplete data--and that's just begging for trouble down the road.

    Religious wars aside, flag values (magic values) serve a valid purpose. Assuming they are A) universal (for a given domain), B) unambiguous, and C) outside a sub-domain's allowed range of values (such as using 1900-01-01 etc.) when dealing with dates that CANNOT be that early in time), is a solid way around the null issue, because with Null you only get one answer. Personally, I define 3 magic values, "to be determined" (not known yet), "not applicable", and "verified unknowable" (i.e. the data can never be determined, the data is permanently lost to humanity).

    The argument against flag values is usually the arbitrary values chosen.  Another is that not all domains are large enough to allow flag values (Booleans being the extreme case).

    Someone argued that correct coding can make nulls more tractable, but that sets aside the added complexity of the necessary code. Code complexity = increased attack surface, increased potential for bugs, increased size of testing, and on and on. Code complexity is BAD.

    I hadn't realized the originators of SQL had intended multiple null values (in other words, an "official" version of extra-domain flags). That would have rendered a lot of the current null value arguments null 🙂--except for code complexity.

    One major advantage magic values have is they are (at least canonically) part of the domain. Therefore the same logic applies to them as any other domain member, no additional code is required. At worst, an additional WHERE clause might be needed to explicitly exclude them, or perhaps a function to translate them into words instead of values on reports, but over all I find them superior to nulls on just about every front.

    The single place I use nulls is for human-only notes in a (N)VARCHAR(MAX) field. Nulls take up no space, so they are wonderful in that limited case.

    Otherwise, kill them with fire!

  • I am always focused on reportablity (is that a word) and providing the least amount of code as well as complete 100% referential integrity.  When migrating data from OLTP system to OLAP it is so much easier, faster and more efficient when the OLTP system deals in a true relational database methods. 

    That being said, nulls are a no-no!

    There is no way to use a null in a Primary Key and frankly, when dealing with data nothing should always be something-that-means-nothing if a code/number/date.

    In the long run when coding a OLTP application, this makes thing simple, and when ETLing to a data warehouse it makes things simple....

    One thing I have found is that simple is consistent, reliable and stable. 

    Just my perspective.

  • You should provide a Stored Procedure interface for the application.

    App Devs don't expose the internals of their classes, and neither should DB Devs expose the internals of their persistence layers.

  • Nulls are an abomination.

    The problem is their ambiguity. Not in the literal sense of "data is missing" but in the sense of "WHY is this data missing".

    I know I'm probably going to become a pest in this thread, but SQL Server isn't there to tell you why it is missing. Its like your storing an integer value of 200 for instance and yelling at SQL for not telling you 200 of WHAT. You named the column, you need to document the cases that could cause data not to be stored. If you want to know WHY, add a WHY column and fill THAT in.

  • Also a point that I missed earlier about the use of NULLs - in SQL referential integrity terms they are vital.

    Table A is a set of some data. Table B is a set of of some other data that may optionally refer to Table A. Enforcing referential integrity for the column in Table B is nothing more than a case of

    CREATE TableA (
      ID INTEGER NOT NULL PRIMARY KEY
    );

    CREATE TableB (
      ID INTEGER NOT NULL PRIMARY KEY,
      TableARefID INTEGER NULL
    );

    By setting the column TableARefID in TableB as NULLable we can add a FOREIGN KEY constraint referencing TableA however keeping the value optional. In other words, if a value is specified in TableARefID then it must be valid however not recording a value at all is equally valid.

  • Mark Dalley - Friday, March 3, 2017 3:35 AM

    Terje Hermanseter - Friday, March 3, 2017 2:51 AM

    I like NULL as the deault choice. All other values have a tendency to create confusion at some point in the future in my experience. Most people have a good understanding of what NULL values mean.

    I think NULL is the logical choice too, but I would also say that "knowing what NULL means" is pretty context-dependent! Take an employment termination date in a personnel record. While the employee is still employed, the termination date should (I suppose) be null, but that doesn't mean it's unknown, it means that because it hasn't yet happened yet it doesn't exist at all! Perhaps this is too much of a straw-splitting distinction.

    Sorting NULL fields can be a pain though. Often you want them to sort before any other date, as SQL Server does by default, but other times (as in the case of the termination date above) you want it to sort after. Hence the unwelcome appearance of magic values (one system I use uses a date in the year 2500 for this sort of thing).

    What might be cool would be to take a leaf from PostgreSQL's book and introduce two new values: positive infinity (i.e not NULL but it sorts after everything else) and negative infinity (not NULL but sorts before everything else). I've never actually used PostgreSQL so can't speak from experience, but it seems like a good idea. Some purists would no doubt disagree and I respect that. I have no interest in starting a religious war.

    I think that the root of the problem is that when you have several concepts (unknown, doesn't exst, not defined, ...) which are easily confused, context dependent and generally slippery to work with, trying to use just one thing (i.e. NULL) for all of them is necessarily going to involve a few compromises. And some extra work.

    MarkD

    Mark, I'm trying to wrap my head around the positive/negative infinity idea.  I admit it's an interesting comment, but somewhat worrisome.  I accumulated 42 years in IT, and can say I never faced any problem whatsoever with the concepts of NULL, except for the problems of inexperienced developers having to learn to understand and accommodate the concept.  And I can't come up with any reason to mix the concepts of UNKNOWN and INFINITE.  I guess by definition, if something is truly infinite, we think it's big or small, but we don't know HOW big or HOW small, so it's still UNKNOWN.  And if it is UNKNOWN, then can we safely make any assumption as to whether it is big or small ?  I think the concepts of BIG and SMALL assume a comparison to some other known value. 

    I have to say I am concerned about trying to attach significance to positive infinity or negative infinity as further complicating understanding of a truly unknown value.

    Rick
    Disaster Recovery = Backup ( Backup ( Your Backup ) )

  • patrickmcginnis59 10839 - Friday, March 3, 2017 8:31 AM

    Nulls are an abomination.

    The problem is their ambiguity. Not in the literal sense of "data is missing" but in the sense of "WHY is this data missing".

    I know I'm probably going to become a pest in this thread, but SQL Server isn't there to tell you why it is missing. Its like your storing an integer value of 200 for instance and yelling at SQL for not telling you 200 of WHAT. You named the column, you need to document the cases that could cause data not to be stored. If you want to know WHY, add a WHY column and fill THAT in.

    In one way I agree, but in the case of null it crosses the line. The problem is data integrity, which is very much a SQL Server/persistence layer issue.

    Null means "missing". But in terms of database design/validation/data integrity/consistency "missing" is unacceptably ambiguous.

    Consider a field that needs to be filled in to let a process occur. For example, a parts order. Clearly if you order a part leaving the part number null ("missing") you cannot process the order.

    From a simple "bucket" perspective you are correct, a missing part number concerns SQL Server not at all. But from an architectural perspective an order without a part number is utterly unacceptable.

    That's the most extreme case, missing data that prevents correct operation, but even more subtle issues rear their ugly heads.

    For example, an employee's termination date is null. Does that mean the employee is still employed with us? Or does it mean they've been terminated and somebody forgot to fill in the date? Or did it happen so long ago we no longer know the termination date?

    Typically a null termination date would mean the employee is still employed--but if they had been terminated then clearly some kind of error occurred. But given the data we have no way of determining what kind of error.

    Using magic values for the conditions "to be determined (TBD)", "not applicable" (N/A) or "verified unknown" (UNK) instead of allowing nulls we can now say that an employee with a termination date of:

    N/A  - is still employed, or if not the error was failure to correctly update the field
    TBD - Employee IS terminated, but the exact termination date has yet to be determined
    UNK - The employee IS terminated but the exact termination date has been lost for some reason (usually because it happened before the data was required to be kept).

    Dates are perhaps the biggest reason to use magic values, but just about any domain can benefit from them, assuming the sub-domain (the range of actual valid values) is smaller than the domain (the entire possible range of values).

    A "why" column adds complexity and size, especially considering we're speaking of domains here, not individual fields. You'd basically need a why column (or table) for every column in your database! Unacceptably complex and cumbersome, I think...

  • skeleton567 - Friday, March 3, 2017 9:25 AM

    Mark Dalley - Friday, March 3, 2017 3:35 AM

    Terje Hermanseter - Friday, March 3, 2017 2:51 AM

    I like NULL as the deault choice. All other values have a tendency to create confusion at some point in the future in my experience. Most people have a good understanding of what NULL values mean.

    I think NULL is the logical choice too, but I would also say that "knowing what NULL means" is pretty context-dependent! Take an employment termination date in a personnel record. While the employee is still employed, the termination date should (I suppose) be null, but that doesn't mean it's unknown, it means that because it hasn't yet happened yet it doesn't exist at all! Perhaps this is too much of a straw-splitting distinction.

    Sorting NULL fields can be a pain though. Often you want them to sort before any other date, as SQL Server does by default, but other times (as in the case of the termination date above) you want it to sort after. Hence the unwelcome appearance of magic values (one system I use uses a date in the year 2500 for this sort of thing).

    What might be cool would be to take a leaf from PostgreSQL's book and introduce two new values: positive infinity (i.e not NULL but it sorts after everything else) and negative infinity (not NULL but sorts before everything else). I've never actually used PostgreSQL so can't speak from experience, but it seems like a good idea. Some purists would no doubt disagree and I respect that. I have no interest in starting a religious war.

    I think that the root of the problem is that when you have several concepts (unknown, doesn't exst, not defined, ...) which are easily confused, context dependent and generally slippery to work with, trying to use just one thing (i.e. NULL) for all of them is necessarily going to involve a few compromises. And some extra work.

    MarkD

    Mark, I'm trying to wrap my head around the positive/negative infinity idea.  I admit it's an interesting comment, but somewhat worrisome.  I accumulated 42 years in IT, and can say I never faced any problem whatsoever with the concepts of NULL, except for the problems of inexperienced developers having to learn to understand and accommodate the concept.  And I can't come up with any reason to mix the concepts of UNKNOWN and INFINITE.  I guess by definition, if something is truly infinite, we think it's big or small, but we don't know HOW big or HOW small, so it's still UNKNOWN.  And if it is UNKNOWN, then can we safely make any assumption as to whether it is big or small ?  I think the concepts of BIG and SMALL assume a comparison to some other known value. 

    I have to say I am concerned about trying to attach significance to positive infinity or negative infinity as further complicating understanding of a truly unknown value.

    Something I neglected to add is that in my IT career I did see a number of instances where the 'MAGIC VALUE' concept caused problems.  These introduce the potential is accidentally  presenting invalid information due to not being correctly handled while being much more difficult to detect than 'missing' values.

    Rick
    Disaster Recovery = Backup ( Backup ( Your Backup ) )

  • Mark Dalley - Friday, March 3, 2017 3:35 AM

    ...

    What might be cool would be to take a leaf from PostgreSQL's book and introduce two new values: positive infinity (i.e not NULL but it sorts after everything else) and negative infinity (not NULL but sorts before everything else). I've never actually used PostgreSQL so can't speak from experience, but it seems like a good idea. Some purists would no doubt disagree and I respect that. I have no interest in starting a religious war.

    I  like this idea

  • Personally, I think it really varies on the data. There are some things where I will use NULL, where other times I will use a "Magic Number", or something else. A good example I can think of is with some parts of our web site. If a customer is answering some yes/no questions, a further text field may be opened up (maybe they said they've previously been declared bankrupt and we need to know details). The Default value for that column in the database would be NULL, as this means the data is "missing" (not required). If, however, the customer is presented with the textbox and leaves it blank (although the website should force them to enter something) a blank string ('') would be stored. This instead shows that the value is empty, rather than missing. We can then treat the two values differently.

    With Numbers, this is a little different. Getting an average of a bunch of Integers, with a default of 0 could skew your data, if 0 could represent both 0 or unknown. You may well want to omit unknown values, and thus NULL could be a logical choice (as it's not used in aggregate functions). If you used a "magic number" and someone didn't know your data well, they could well include values of '9999999' in their aggregates, and that would also blow your data out of proportion.

    There are times to use and not use NULL as a default value, but the key part is knowledge of what NULL represents for that specific dataset. Simply using NULL to represent absence of data, regardless of it it means it's missing, unknown, or was not supplied when it should have been though, I feel is a bad idea.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Steve Jones - SSC Editor - Friday, March 3, 2017 9:46 AM

    Mark Dalley - Friday, March 3, 2017 3:35 AM

    ...

    What might be cool would be to take a leaf from PostgreSQL's book and introduce two new values: positive infinity (i.e not NULL but it sorts after everything else) and negative infinity (not NULL but sorts before everything else). I've never actually used PostgreSQL so can't speak from experience, but it seems like a good idea. Some purists would no doubt disagree and I respect that. I have no interest in starting a religious war.

    I  like this idea

    Actually, I'm starting to have second thoughts. Skeleton567 has a point in that one doesn't want to introduce further potential sources of confusion (such as infinity, which, as my maths teacher was at pains to point out, was "not a number!")  without being very clear as to semantics.

    Maybe one could cover the case I was thinking about by specifying that NULL "values" sort either HIGH or LOW when defining the column in a table. Something like this:

    CREATE TABLE Employees (
        Employee_ID int IDENTITY(1,1) NOT NULL,
         FirstName varchar(100) NULL,
         StartDate Date NULL,              -- you might insist on NOT NULL for this one
         EndDate Date HIGH NULL        -- null values sort after non-null values
        
    CONSTRAINT PK_Employee PRIMARY KEY CLUSTERED
    ([Employee_ID] ASC)
    )

    MarkD

  • In terms of things like termination date, I see both sides. I use NULL here, because the data is unknown. As for why it's unknown, as Roger points out is an issue, I also need a flag as another field that is "IsEmployed" or "EmploymentStatus". the date has no meaning without the flag field.

    I do wish we had an infinity value, but I also wish we had a NULL-type flag that says unknown at this time rather than, unknown for some Schrodinger-type reason and could be anything.

  • I have really enjoyed the comments on this topic so far.

    Maybe I can add another element to this discussion.
    It does not matter if you use nulls or 'magic numbers' - please just be consistent in the database design.
    Don't change the design 5 years later just because you don't like it - or read something on a forum like this!
    (Unless it is possible to refactor everything and change all third-party applications).

    If you design a new database, please go ahead and implement the latest best practices.

    The same concept applies to the naming of tables, columns, views and stored procedures.
    Stick to the current design!

    Put it this way:
    Imagine a building designed by both Gaudi and Le Corbusier.
    Or a painting done as a collaboration between Rembrandt and Pollock...

Viewing 15 posts - 16 through 30 (of 41 total)

You must be logged in to reply to this topic. Login to reply