In previous articles I've discussed SQL NULLs and practical methods for dealing with them on SQL Server, in the real world. Although I have mentioned previously that the use of NULL should be minimized in the database ("minimized" meaning as close to zero as practical), it has been suggested that I have not placed enough emphasis on NULL-avoidance. It has even been suggested that NULLs should not be dealt with in the real world at all. I intend to address some of these arguments in this article, and apply the advice of some of the more prolific experts to implement a "Zero-NULL" database in SQL.
Before we begin eliminating NULLs, we're going to lay down a few ground rules. The intent is to get rid of all NULLs from the database, period. To that end we will adhere strictly to the following:
Because we'll be discussing the practicality of eliminating all NULLs from a SQL database, we will consider several aspects, including development complexity, maintenance costs, and performance.
Our starter database is a simple database designed to hold some basic employee information. This is the database from which we will eliminate all traces of NULL. The tables look like this:
And the script to create and populate these tables is given below:
-- Here we create a table of contact number types CREATE TABLE ContactNumberType (Type CHAR(1) NOT NULL PRIMARY KEY, Description VARCHAR(20) NOT NULL); INSERT INTO ContactNumberType (Type, Description) VALUES ('H', 'Home Phone'); INSERT INTO ContactNumberType (Type, Description) VALUES ('W', 'Work Phone'); INSERT INTO ContactNumberType (Type, Description) VALUES ('C', 'Cell Number'); INSERT INTO ContactNumberType (Type, Description) VALUES ('F', 'Fax Number'); -- Here we create the employee and contact number tables CREATE TABLE Employee (EmployeeID INT NOT NULL PRIMARY KEY, FirstName VARCHAR(20) NOT NULL, MiddleName VARCHAR(20), LastName VARCHAR(20) NOT NULL, Salary DECIMAL(10, 2) NOT NULL, Bonus DECIMAL(10, 2)); CREATE TABLE ContactNumber (EmployeeID INT NOT NULL, Type CHAR(1) NOT NULL, Number VARCHAR(20) NOT NULL, PRIMARY KEY (EmployeeID, Type)); -- Here we add the appropriate foreign key constraints ALTER TABLE ContactNumber WITH CHECK ADD CONSTRAINT FK_ContactNumber_ContactNumberType FOREIGN KEY (Type) REFERENCES ContactNumberType (Type); ALTER TABLE ContactNumber WITH CHECK ADD CONSTRAINT FK_ContactNumber_Employee FOREIGN KEY (EmployeeID) REFERENCES Employee (EmployeeID); -- Here we add some employees INSERT INTO Employee (EmployeeID, FirstName, MiddleName, LastName, Salary, Bonus) VALUES (1, 'Nat', 'King', 'Cole', 50000.00, 5000.00); INSERT INTO Employee (EmployeeID, FirstName, MiddleName, LastName, Salary, Bonus) VALUES (2, 'Anita', NULL, 'Baker', 55000.00, 0.00); INSERT INTO Employee (EmployeeID, FirstName, MiddleName, LastName, Salary, Bonus) VALUES (3, 'Louis', 'Daniel', 'Armstrong', 58000.00, NULL); -- Now we add some phone numbers INSERT INTO ContactNumber (EmployeeID, Type, Number) VALUES (1, 'H', '1-900-872-3451'); INSERT INTO ContactNumber (EmployeeID, Type, Number) VALUES (2, 'H', '1-900-987-7892'); INSERT INTO ContactNumber (EmployeeID, Type, Number) VALUES (2, 'W', '1-900-356-4982');
The following sample query retrieves all employee data:
-- Simple query to retrieve all employee data -- from the initial database SELECT e.EmployeeID, e.FirstName, e.MiddleName, e.LastName, e.Salary, e.Bonus, cn.Number, cnt.Description FROM Employee e LEFT JOIN ContactNumber cn ON e.EmployeeID = cn.EmployeeID LEFT JOIN ContactNumberType cnt ON cn.Type = cnt.Type;
And the result looks like this:
As for performance, SQL Server reports a query execution plan with a total cost of 0.0510, with an execution plan that looks like this:
Now that we have a baseline, it's time to start the step-by-step process of eliminating all trace of NULL from the database.
In Fabian Pascal's "Nulls Nullified" article, he brings up a point about the confusion between the "absence of data", and "how such absence is represented and manipulated in databases."
We can easily address Pascal's issue with using NULL to represent the "absence of data" by simply not storing NULLs in the database. We will simply make all of the columns non-nullable and represent the "absence of data" by breaking out nullable columns into new tables with 1:0..1 relationships, without regard for so-called "attribute splitting." In our example this means breaking the nullable MiddleName and Bonus columns out into their own tables with EmployeeID as the Primary Key, in 1:0..1 relationships with Employee.
-- Changes to initial database to prevent NULLs from being -- stored in it CREATE TABLE Employee (EmployeeID INT NOT NULL PRIMARY KEY, FirstName VARCHAR(20) NOT NULL, LastName VARCHAR(20) NOT NULL, Salary DECIMAL(10, 2) NOT NULL); CREATE TABLE EmployeeMiddleName (EmployeeID INT NOT NULL PRIMARY KEY, MiddleName VARCHAR(20) NOT NULL); CREATE TABLE EmployeeBonus (EmployeeID INT NOT NULL PRIMARY KEY, Bonus DECIMAL(10, 2) NOT NULL); ALTER TABLE EmployeeMiddleName WITH CHECK ADD CONSTRAINT FK_EmployeeMiddleName_Employee FOREIGN KEY (EmployeeID) REFERENCES Employee (EmployeeID); ALTER TABLE EmployeeBonus WITH CHECK ADD CONSTRAINT FK_EmployeeBonus_Employee FOREIGN KEY (EmployeeID) REFERENCES Employee (EmployeeID);
Here is the newly created Zero-NULL database:
The query to grab all employee information from our revised database now looks like this:
-- Retrieve all employee data from the revised tables -- using LEFT JOINs SELECT e.EmployeeID, e.FirstName, emn.MiddleName, e.LastName, e.Salary, eb.Bonus, cn.Number, cnt.Description FROM Employee e LEFT JOIN ContactNumber cn ON e.EmployeeID = cn.EmployeeID LEFT JOIN ContactNumberType cnt ON cn.Type = cnt.Type LEFT JOIN EmployeeBonus eb ON e.EmployeeID = eb.EmployeeID LEFT JOIN EmployeeMiddleName emn ON e.EmployeeID = emn.EmployeeID
And this leads us right into the second part of Pascal's complaint concerning "how such absence is represented and manipulated in databases." This cannot be as easily dismissed in SQL as one might wish. If we eliminate the LEFT JOINs in the query above, for instance, the query fails to retrieve a complete list of employees:
--Removing the LEFT JOINs from our query SELECT e.EmployeeID, e.FirstName, emn.MiddleName, e.LastName, e.Salary, eb.Bonus, cn.Number, cnt.Description FROM Employee e INNER JOIN ContactNumber cn ON e.EmployeeID = cn.EmployeeID INNER JOIN ContactNumberType cnt ON cn.Type = cnt.Type INNER JOIN EmployeeBonus eb ON e.EmployeeID = eb.EmployeeID INNER JOIN EmployeeMiddleName emn ON e.EmployeeID = emn.EmployeeID;
The result of this query is one row:
Converting the LEFT JOINs to INNER JOINs results in a simple "all-or-nothing" proposition when querying the database. Any employees who are missing any "optional" data are not returned by the query. There are two workarounds to this problem:
Implementing the first option for the example database requires eight total queries to ensure that all the required data is returned in every required format. The possible combinations are represented in the table below, with "yes" indicating the optional information is available for the employee, and "no" indicating the optional information is not available for the employee:
You can easily calculate how many individual queries are needed for any number of optional values. The total number of individual queries needed is simply 2n where n is the number of optional values that can be represented. It is up to the client to handle the 2n different result sets returned.
The second option is marginally "better", since it allows you to return a single result set to the client. Like the first option, you still have to create eight separate queries, but you get to UNION them all together to create a "single-query" façade. Additionally, you get to show your creative side by defining several arbitrary query-specific placeholders. Here is a query based on the second option:
-- First grab the employees with Bonus, Contact Number, and Middle Name SELECT e.EmployeeID, e.FirstName, emn.MiddleName, e.LastName, e.Salary, eb.Bonus, cn.Number, cnt.Description FROM Employee e INNER JOIN ContactNumber cn ON e.EmployeeID = cn.EmployeeID INNER JOIN ContactNumberType cnt ON cn.Type = cnt.Type INNER JOIN EmployeeBonus eb ON e.EmployeeID = eb.EmployeeID INNER JOIN EmployeeMiddleName emn ON e.EmployeeID = emn.EmployeeID UNION -- Second grab the employees with Bonus, Contact Number, but no Middle Name SELECT e.EmployeeID, e.FirstName, 'NO MIDDLE NAME', e.LastName, e.Salary, eb.Bonus, cn.Number, cnt.Description FROM Employee e INNER JOIN ContactNumber cn ON e.EmployeeID = cn.EmployeeID INNER JOIN ContactNumberType cnt ON cn.Type = cnt.Type INNER JOIN EmployeeBonus eb ON e.EmployeeID = eb.EmployeeID WHERE NOT EXISTS ( SELECT 1 FROM EmployeeMiddleName emn WHERE emn.EmployeeID = e.EmployeeID) UNION -- Third grab the employees with Bonus, Middle Name, but no Contact Number SELECT e.EmployeeID, e.FirstName, emn.MiddleName, e.LastName, e.Salary, eb.Bonus, 'NO CONTACT NUMBER', 'NO CONTACT NUMBER DESCRIPTION' FROM Employee e INNER JOIN EmployeeBonus eb ON e.EmployeeID = eb.EmployeeID INNER JOIN EmployeeMiddleName emn ON emn.EmployeeID = e.EmployeeID WHERE NOT EXISTS ( SELECT 1 FROM ContactNumber cn WHERE cn.EmployeeID = e.EmployeeID) UNION -- Fourth grab the employees with Bonus, but no Contact Number or Middle Name SELECT e.EmployeeID, e.FirstName, 'NO MIDDLE NAME', e.LastName, e.Salary, eb.Bonus, 'NO CONTACT NUMBER', 'NO CONTACT NUMBER DESCRIPTION' FROM Employee e INNER JOIN EmployeeBonus eb ON e.EmployeeID = eb.EmployeeID WHERE NOT EXISTS ( SELECT 1 FROM EmployeeMiddleName emn WHERE emn.EmployeeID = e.EmployeeID) AND NOT EXISTS ( SELECT 1 FROM ContactNumber cn WHERE cn.EmployeeID = e.EmployeeID) UNION -- Fifth grab the employees with Contact Number and Middle Name, but no Bonus SELECT e.EmployeeID, e.FirstName, emn.MiddleName, e.LastName, e.Salary, -99999.99, cn.Number, cnt.Description FROM Employee e INNER JOIN EmployeeMiddleName emn ON emn.EmployeeID = e.EmployeeID INNER JOIN ContactNumber cn ON cn.EmployeeID = e.EmployeeID INNER JOIN ContactNumberType cnt ON cn.Type = cnt.Type WHERE NOT EXISTS ( SELECT 1 FROM EmployeeBonus eb WHERE eb.EmployeeID = e.EmployeeID) UNION -- Sixth grab the employees with Middle Name, but no Bonus or Contact Number SELECT e.EmployeeID, e.FirstName, emn.MiddleName, e.LastName, e.Salary, -99999.99, 'NO CONTACT NUMBER', 'NO CONTACT NUMBER TYPE' FROM Employee e INNER JOIN EmployeeMiddleName emn ON e.EmployeeID = emn.EmployeeID WHERE NOT EXISTS ( SELECT 1 FROM EmployeeBonus eb WHERE eb.EmployeeID = e.EmployeeID) AND NOT EXISTS ( SELECT 1 FROM ContactNumber cn WHERE cn.EmployeeID = e.EmployeeID) UNION -- Seventh grab the employees with Contact Number, but no Bonus or Middle Name SELECT e.EmployeeID, e.FirstName, emn.MiddleName, e.LastName, e.Salary, -99999.99, cn.Number, cnt.Description FROM Employee e INNER JOIN EmployeeMiddleName emn ON e.EmployeeID = emn.EmployeeID INNER JOIN ContactNumber cn ON cn.EmployeeID = e.EmployeeID INNER JOIN ContactNumberType cnt ON cn.Type = cnt.Type WHERE NOT EXISTS ( SELECT 1 FROM EmployeeBonus eb WHERE eb.EmployeeID = e.EmployeeID) UNION -- Finally, grab the employees with no Bonus, no Contact Number, and no Middle Name SELECT e.EmployeeID, e.FirstName, 'NO MIDDLE NAME', e.LastName, e.Salary, -99999.99, 'NO CONTACT NUMBER', 'NO CONTACT NUMBER TYPE' FROM Employee e WHERE NOT EXISTS ( SELECT 1 FROM EmployeeBonus eb WHERE eb.EmployeeID = e.EmployeeID) AND NOT EXISTS ( SELECT 1 FROM ContactNumber cn WHERE cn.EmployeeID = e.EmployeeID) AND NOT EXISTS ( SELECT 1 FROM EmployeeMiddleName emn WHERE emn.EmployeeID = e.EmployeeID)
The results look like this:
Notice the use of the arbitrary placeholders 'NO MIDDLE NAME', 'NO CONTACT NUMBER', and 'NO CONTACT NUMBER TYPE'. You would be forced to cast all bonus amounts to a character data type (VARCHAR, etc.) if you used a string like 'NO BONUS' as a missing data indicator for the Bonus column. Instead the query uses the "magic number" -99999.99 to indicate no bonus. The cost of performing this query is 0.411, or just over 8 times the cost of our original query. The query execution plan is fairly large for this one, but here's a portion of it:
At this point we've effectively eliminated all traces of NULL from the Zero-NULL database. We've also demonstrated several key points:
We did not consider the implications of adding unique constraints to the tables in this example. I'll leave it as an exercise to the reader to simulate a simple unique constraint on (FirstName, MiddleName, LastName) in the Zero-NULL database.
If you're not excited at the prospect of eliminating all NULLs yet, let's see if Sixth Normal Form provides that extra necessary motivation.
If you really want to be rigorous in your NULL elimination, decompose your database into Darwen and Date's Sixth Normal Form (6NF). A table is in 6NF if it satisfies no nontrivial join dependencies. Darwen summarizes 6NF very succinctly in his paper "How to Handle Missing Information Without Nulls":
This script creates the Zero-NULL database in 6NF:
CREATE TABLE ContactNumberType (Type CHAR(1) NOT NULL PRIMARY KEY, Description VARCHAR(20) NOT NULL); CREATE TABLE EmployeeFirstName (EmployeeID INT NOT NULL PRIMARY KEY, FirstName VARCHAR(20) NOT NULL); CREATE TABLE EmployeeMiddleName (EmployeeID INT NOT NULL PRIMARY KEY, MiddleName VARCHAR(20) NOT NULL); CREATE TABLE EmployeeLastName (EmployeeID INT NOT NULL PRIMARY KEY, LastName VARCHAR(20) NOT NULL); CREATE TABLE EmployeeSalary (EmployeeID INT NOT NULL PRIMARY KEY, Salary DECIMAL(10, 2) NOT NULL); CREATE TABLE EmployeeBonus (EmployeeID INT NOT NULL PRIMARY KEY, Bonus DECIMAL(10, 2) NOT NULL); CREATE TABLE ContactNumber (EmployeeID INT NOT NULL, Type CHAR(1) NOT NULL, Number VARCHAR(20) NOT NULL, PRIMARY KEY (EmployeeID, Type));
Queries, foreign key constraints, and unique constraints for the 6NF version of the Zero-NULL database are left as an exercise for the reader.
Getting back to Pascal's "Nulls Nullified" article, the author continues his example of storing "inapplicable data" in a SQL database by suggesting that NULLs are commonly stored in an employee database for those employees that do not earn a commission; that is, those employees for which commission is "inapplicable".
In his example Pascal is using NULL to represent a known amount of commission. Specifically, the amount is exactly 0.00. Using NULL to represent zero is misuse use of NULL that I covered in previous articles, and I won't re-hash it here. Pascal goes on to state:
He is correct to use the word if in this sentence. If, on the other hand, you use NULLs only to represent unknown values you won't ever encounter this particular issue. Just as you would not use the number "1" to represent both "1" and "2", NULL should not be used to represent both "inapplicable" and "unknown".
Dr. Codd himself recognized the inadequacy of using a single NULL mark to represent two different concepts. He later recommended the introduction of two types of NULL mark (and an accompanying 4-valued logic) to represent both "Missing but Applicable" and "Missing but Inapplicable" (Codd, The Relational Model for Database Management: Version 2).
Pascal goes on to address "defenders of NULLs":
I've seen no evidence that a single "special value" with a well-defined behavior (system-wide) incurs the "same kind of complexity in SQL" that potentially hundreds of arbitrary application-specific (or even query-specific) user-defined "special values" produces. In fact, the exact opposite appears to be true in the real world. Pascal further asserts:
As we've seen, it is not "necessary" to store NULLs in a SQL database or to even retrieve them in result sets, so long as you avoid outer joins like the plague. By the same logic, looping constructs (e.g., For...Next, While, Do...Repeat and equivalents) are "unnecessary" in procedural languages that have the equivalent of If...Then and Goto statements; therefore they should not be in their respective languages. Unless you "want to have it both ways", that is.
NULL critics are quick to point to flaws in the definitions of ANSI SQL convenience functions (like the aggregate functions) as proof that the entire concept of NULL itself is flawed. Hugh Darwen, in his presentation "The Askew Wall", takes on NULL with selected examples where SQL 3-valued logic (3VL) and NULL can cause "unexpected" results:
1. SELECT * FROM T WHERE X = Y OR NOT ( X = Y ) is not equivalent to SELECT * FROM T 2. SELECT SUM(X) + SUM(Y) FROM T is not equivalent to SELECT SUM(X + Y) FROM T 3. IF X = Y THEN 'Yes'; ELSE 'No' is not equivalent to IF NOT ( X = Y ) THEN 'No'; ELSE 'Yes'
As SQL developers, we are already aware of these issues, and none of the results they produce are actually "unexpected". Darwen's examples #1 and #3 produce the result we expect if we know the following simple rule:
It's a trivial matter to modify Darwen's examples to properly use SQL's 3-valued logic, and I'll leave that as an exercise for the reader.
SQL developers should not be surprised by Darwen's example #2 either. We already know that X + Y results in NULL if X or Y is NULL, and SUM (per the ANSI SQL definition) discards NULLs. The simple solution here (this assumes you're actually looking for solutions) is to use the aggregate function in accordance with its documented behaviors to obtain the results you seek. A couple of other options, if you feel that SUM(column) doesn't behave the way you feel it should, are:
Pascal may have said it best when he wrote:
You can easily do as Pascal suggests using standard SQL functionality. As an example here's SUM(X + Y) using COALESCE to produce the same results as SUM(X) + SUM(Y):
SELECT SUM(COALESCE(X, 0) + COALESCE(Y, 0)) FROM T
Another common example cited by NULL critics is that the SUM of the empty set should be zero. ANSI defines the result of SUM(column) on a table with no rows as NULL. The easy answer, if you want the SUM of a table with no rows to return zero instead of NULL, is to use SQL's COALESCE function, as in the following example:
SELECT COALESCE(SUM(X), 0) FROM T
Additionally, if you think SUM should return NULL if the column contains a NULL, you can use CASE to achieve that in a query as well:
SELECT CASE WHEN COUNT(*) = 0 THEN 0 ELSE CASE WHEN EXISTS ( SELECT 1 FROM T WHERE X IS NULL) THEN NULL ELSE SUM(X) END END FROM T
On SQL Server you can encapsulate such logic in a user-defined function that performs these aggregations and handles NULL in any manner you see fit.
As Pascal pointed out, aggregate function treatment of missing values is irrelevant to the discussion. Even if aggregate function NULL-handling in SQL raises your eyebrows, SQL Server provides sufficient means for creating your own user-defined functions and user-defined aggregates. With these tools you can implement your own "user-options" that treat missing values in any manner you choose.
In this article I've addressed some of the arguments made by critics of SQL NULLs, and used a short example to demonstrate what life without NULL in SQL really entails. On the other hand, there are those who would recommend a less dramatic method of dealing with NULL, namely: "learn how to use them properly, and minimize their use where practical."
To read the rest of this article, and access thousands of other articles, we ask you to register on the site and subscribe to our newsletters.
Subscribing to our newsletters gets you:
Steve Jones Editor, SQLServerCentral.com