Null, empty string, or zero?

, 2015-04-15 (first published: )

The answer: It Depends.

imageOne of the more common problems I encounter when managing data quality, especially in an ETL process, is the handling of null values, blanks, and zeroes. When I put on my preaching shoes to talk about bad data, this is one the areas I have to spend a lot of time covering because it is so often overlooked as a stumbling block for data quality. Improperly interpreting what is represented by these values can lead to inconsistent results, dirty data, and incorrect answers to business questions.

Data Modeling

When managing text data, more often than not an empty string is treated the same way as a null value. For example, if a record in a Person table has a valid, non-blank value in the FirstName and LastName field but a non-null, empty string value in the MiddleName field, it’s usually assumed that the person has no middle name, or that the middle name is unknown. However, what if the data in question is the database for the official birth registrar for the municipality? A record with an empty string for the middle name could show that the person’s official birth record indicates that they legally do not have a middle name, while a null value in that field reflects an unknown value for the middle name (possibly to accommodate old records transcribed from incomplete handwritten birth certificates). In this case, an empty string value for MiddleName represents a completely different situation than a null in the same field.

It’s even trickier with numeric data. Zeroes and nulls are often treated as one and the same, but in many business cases, a null value is intended to be treated differently than a zero. Let’s say I’m working for a sales organization, and I want to calculate some metrics on some of our sales leads. If I’m arbitrarily interpreting a null value as a zero in the HouseholdIncome field, I’m going to get some incorrect results when calculating, for example, the average household income of our potential leads (since zeros are factored into averages, while nulls typically are not). In this case, a zero value indicates that the reported household income for that lead is zero, while a null value could indicate that the person refused to provide any answer for that field.

The ETL Factor

Complicating both of these is the exchange of data in ETL. Assuming the data is correct in the source system, it is possible – in fact, I would say it is likely – that some part of the ETL process can misinterpret what nulls, empty strings, and zeroes are supposed to represent. Part of this is a lack of standard nomenclature. How does one represent a null in a character field? If an empty string is used do indicate a null, then how does one indicate an empty string when that value should be separate and distinct from a null? On occasion, I’ve seen the literal word NULL used in text data to represent a null value. Find yourself a phone book (do they still print those?) of sufficient size, and you’ll find that there are individuals with the last name of Null. It can’t always be assumed that the string literal Null can be interpreted as a null value.

There’s no silver bullet for the null/empty string/zero quandary. But there are ways to mitigate the possibility of a misinterpretation causing pains in your database or ETL process:

  • Know your data. I don’t mean that you just need to know its technical structure – you need to really know what it represents to your employer or client.
  • Know how your data is being used. Talk to your users and find out how they create, curate, and report on the data. Ask them about any known outliers in the data. This will reveal a great deal about what the data means (see the previous bullet).
  • Pay special attention to ETL, reporting, and analytics. These areas are hotspots where a misinterpretation of a potentially null value can be especially problematic.
  • Have a solid, documented data mapping strategy. Given that many of the problems in this realm are rooted in moving data, having a well-documented and visible source-to-target mapping can help to clarify how these special values should be interpreted during data movement and transformation.
  • Know where the quicksand is. Sometimes, an empty string and a null are synonymous by design. Solve the problems that need to be solved, and don’t get stuck working on those that don’t.

The null/empty string/zero problem isn’t going away anytime soon, but simply being aware that it’s out there can help avoid misinterpretations leading to bad data.

The post Null, empty string, or zero? appeared first on Tim Mitchell.





Related content

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Question: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup? This question was sent to me via email. My reply follows. Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup? Databases to be mirrored are currently running on 2005 SQL instances but will be upgraded to 2008 SQL in the near future.


1,567 reads

Networking - Part 4

You may want to read Part 1 , Part 2 , and Part 3 before continuing. This time around I'd like to talk about social networking. We'll start with social networking. Facebook, MySpace, and Twitter are all good examples of using technology to let...


1,530 reads

Speaking at Community Events - More Thoughts

Last week I posted Speaking at Community Events - Time to Raise the Bar?, a first cut at talking about to what degree we should require experience for speakers at events like SQLSaturday as well as when it might be appropriate to add additional focus/limitations on the presentations that are accepted. I've got a few more thoughts on the topic this week, and I look forward to your comments.


360 reads