01 July 2021

10029 views

4 0

Much Ado about nothing: missing data

Joe Celko explains how missing data is handled from the printing press to databases.

Last night I saw upon the stair
A little man who wasn’t there
He wasn’t there again today
Oh, how I wish he’d go away… -Glen Miller

Human beings are not very good at handling nothing. The printing press didn’t just lead to civilization as we know it, but it also changed our mindset about text. When we wrote text manually on paper, a blank or space was not seen as a character. It was just the background upon which characters were written.

It was centuries before the zero was accepted as a number. After all, it represents the absence of a quantity or magnitude or position; how could it possibly be a number? Before it was accepted as a number, it was considered a symbol or mark in a positional notation to indicate that there was nothing in that position.

Statistics and databases cause a similar mindset change as to how to handle missing information. SQL has its NULL, but that’s only a part of the model. Is a value missing because the attribute doesn’t exist, or is it missing because we don’t know what it is? Then we get into questions about underflows, overflows, and other reasons that we can’t represent the value. We have a whole range of reasons why we don’t know something.

Let’s take a quick look at the various kinds of missing data.

Shooting blanks

When we started using printing presses with metal type, a space became a physical character. More than that, the width of the space became important. The typewriter also changed the way spacing was handled. When this was carried over to computer printers, we had to define space, newline, carriage return, backspace, and tab characters. They were encoded in the ASCII or EBCDIC that the computer used. These characters were generally lumped together under the term “white space,” and each programming language defined various rules for using them.

For example, some languages allow a string to be of length zero, while others do not. The SQL standard requires that the string be of at least length one. If you have a string of length zero, you get into some problems with substrings which we wished to avoid. SQL Server does allow an “empty” string, but I’m not sure how Microsoft handles it.

In SQL, the rule is that when you compare two strings, the short string is padded out with trailing blanks, and then the strings are matched character for character, position by position. Well, that’s just one way of doing it. The xBase languages truncate longer strings and then compare position by position. Some languages are case-sensitive so that “Job” and “job” are seen as either an Old Testament prophet or a noun.

What had previously looked like blank space on a piece of paper could now be non-printing characters of various kinds that only look like blank space. I probably can assume that everyone who has been doing data processing for any length of time has had the problem of hunting down such non-printing characters.

Zero is my hero

Our zero was originally called the “cipher” from the original Arabic word for missing. The cipher evolved into the zero. Today, zero is accepted as a number, and it has mathematical properties that you learned in elementary school. We have agreed on a few rules about zero in numeric data over the centuries:

1) Zero times any number n equals zero. (0 × n) = (n × 0) = (0 × 0) = 0 seems to be intuitive.

2) Zero divided by any non-zero number is zero. (0 ÷ n) = 0 says when you have nothing to work with, you can’t make groups of size (n). That also seemed pretty logical. Dividing by zero is “undefined” and is not allowed because there is no logical answer. The quick proof is to assume (n ÷ 0) = a, then (a × 0) = n, which violates the first rule.

The special case of (0 ÷ 0) is left undefined in mathematics, and it messes up many programming languages, too.

Things get really bad when you use IEEE floating-point numbers. They have a positive and negative zero. Finally, they have the NaN concept. The abbreviation stands for “not a number,” and they have their own rules. Generally speaking, when you get a NaN, you are expected to set up error handling when the exception flag is raised. In IEEE 754 arithmetic, (n ÷ +0) is positive infinity when (n) is positive, negative infinity when (n) is negative, and NaN when n = ±0. The infinity signs change when dividing by −0 instead. The reason is to preserve the sign of the result in case of arithmetic underflow.

The problem is that you really can’t store these special symbols in SQL. In fact, you may have trouble using them in whatever host language is invoking your SQL. We tend to forget that SQL all too often stands for “Scarcely Qualifies as a Language” because it was never intended to be used by itself.

3) When we get to exponents, things get a little more complicated. We agree that zero raised to a non-zero value 0ⁿ= 0 is defined as a string of (n) zeros multiplied together. Again, this is intuitive. To preserve the laws of exponents, we also agree that n⁰= 1 So that we can say (n^j × n^k) = (n ^(j+k)) when one of the exponents is zero. Things get tricky when we have 0⁰. Not all programming languages handle this expression the same way; some will raise an exception, some will return 1. For example, the following interpretations make sense for returning 1 in a particular context.

a) In analysis, anything to the zero power is 1. Let’s keep the rule that we know best.

b) The combinatorial interpretation of n⁰ is the number of 0-tuples of elements from a n-element set. There is exactly one empty tuple.

c) In calculus, the power rule for derivatives is:

D_x(xⁿ) = (x × x⁽ⁿ⁻¹⁾) is valid for n = 1 at x = 0 only if 0⁰ = 1. This is an argument based on limits.

4) The set-theoretic interpretation of n⁰ is the number of functions from the empty set to an n-element set; there is exactly one such function: the empty function. Since SQL and the relational model are based on set theory, this has some unique appeal for database guys.

The empty set

The empty set ∅ is the set that has no elements. It is always a subset of any other set, including itself. It even gets that special symbol.

Three of the mathematicians who developed set theory were Dedekind, Cantor, and Peano. Dedekind and Peano required their sets to be non-empty, while Cantor considered the empty set to be a valid set. Cantor’s position ended up being dominant, but the models of Dedekind and Peano can be made to work too.

There are advantages to excluding the empty set and only accepting sets if they have at least one element. The advantages are similar to excluding 0 from numbers and only accepting positive numbers. For example, when Dedekind constructed real numbers from rational numbers, he used what we now call Dedekind cuts of rational numbers. A Dedekind cut of the set of rational numbers consists of partitioning the set of rational numbers into two non-empty subsets so that all the elements in one of those subsets are less than all the elements of the other subset.

But there are advantages to regarding the empty set as being a set. The biggest one is that intersections of sets always exist. If you exclude the empty set as a set, then disjoint sets don’t have intersections, and that means every time you use intersections of sets, you have to consider two cases: the case where the sets are disjoint and don’t have an intersection, and the case where the sets are not disjoint and do have an intersection. Let’s agree that S⊆T means every element of S (the subset) is also in T, and therefore every element that is not in T is not in S either. Step-by-step, we can deduce:

S⊆T Given

∀x:((x∈S) ⟹ (x∈T)) Definition of Subset

∃x:(¬(x∈T) ⟹ ¬(x∈S)) Contra-positive

This means there is no element in S that is not also in T. Now apply this definition to the empty set. There are no elements of ∅, from the definition of the empty set.

Cantor had another reason as well to include empty sets. They worked out well in his theories of ordinal and cardinal numbers. Always having an intersection is enough of an advantage that Cantor’s position won out. Finally, Hilbert, Frege, Russell, and Zermelo all accepted the empty set.

It’s not that set theory couldn’t be done excluding the empty set, but it would be much more complicated to exclude it than to include it. When it is actually required that a set needs to be non-empty, that can easily be done by including the word “non-empty” somewhere in the problem.

Empty tables

Tables are not like their ancestral forms, the files. Let’s start with empty files and then get to empty tables. A generic disk file consists of the name of the file name in a file allocation table of some kind, followed by a pointer to the first record in that file. The rest of the records are accessed by some kind of pointer chain, which can be followed down to the last record. Those pointer chains can be single or double linked lists. Early versions of Sybase SQL Server used the single linked list model, and they acted like an early magnetic tape drive in that cursors could only read forward. This was very deliberate because the files systems were built on top of the UNIX operating system, and that was its preferred behavior. If you wanted to move to a previous record, you had to close the cursor, reopen it and scan down to the desired location. In fact, if you look at the basic cursor operations, you can map the SQL commands directly into the first IBM magnetic tape system commands.

At the end of the file, there would be an EOF (end of file) mark of some kind. Originally, it was a physical reflective stripe on the tape that was read by an electric eye. Before you make fun of this technology, I’d like to point out it was many decades ago, and that it was replacing even worse tools – paper files, punch cards, and paper tape. Please remember that the first SQL systems were built on top of this technology. The most sophisticated techniques we had at the time were simple indexing and sorting.

In these early systems, the application programming languages determined where the records were relative to the start of the tape and how to read them. A FORTRAN record did not look anything like a COBOL record.

In SQL, the description of the data is held in the DDL (data definition language) and not the application program. The DDL also maintains data integrity. The data has to be physically materialized, but in SQL, it can be virtual. In SQL, the data is not directly read into an application program. It must be converted from the SQL data types to the application program datatypes via an interface. By definition, the rows in the table have no order. The columns have to be addressed by name because there’s no order to them within a row of the table.

Files and tables are so different conceptually that I don’t understand why people confuse them. As yet another difference, an empty file consists simply of an EOF mark, but an empty table has a structure defined in its DDL and can have constraints, references, etc. It simply has no data in it. As a quick aside, the constraints on an empty table always test TRUE, and a valid table must always have at least one column so that it can have a key.

SQL did not require a key on the table because we were dealing with legacy data from files, which may or may not have a key and may or may not allow duplicate records. As the late Jim Gray once said, 30 or 40 years ago, we didn’t really know what we were doing. This is why we still have an IDENTITY table property in SQL Server – It mimics a record number in a sequential file system instead of a key in a relational system.

Chris Date uses the Tutorial D language in his writings. He has defined two special tables: Table_Dum is empty, and Table_Dee has a single empty row. In effect, they perform the function of identity elements in other formal systems.

Erland Sommarskog pointed out an interesting fluke of SQL Server on the Microsoft Q&A forum. Using the old Sybase syntax with SELECT is the only way to assign multiple variables in a single statement in T-SQL since SQL Server does not yet support row constructors. (However, someone pointed out that the FETCH statement can also assign values to multiple variables in one statement.) The ANSI/ISO standard row constructor syntax looks like this:

Scalar version:

SET (a,b,c,..)

= (1, 2, 3,..);

table version:

SET Sometable (a,b,c,..)

= (11, 12, 13,..),

(21, 22, 23. ..),

(n1, n2, n3, ..);

As long as the SELECT statement returns a single value, it works like the ANSI Standard SET statement. You get a scalar assignment, and more than one row will raise an error. But when the SELECT returns multiple rows, it’s indeterministic, and you’re not sure what value is assigned to the variable. The real problem is when the SELECT returns no rows at all: this leaves the variable unchanged, which can cause some surprises if you are not prepared for it. You might have expected this to be an error or that it would assign defaults to the variables.

Conclusion

When you’re generating test data in SQL, it’s probably a good idea to include an empty table and a table with all NULLs. When you’re generating test data from non-RDBMS applications, you’d like to test values at the boundaries of your conditions. If you have a test for age, see how your system treats somebody with a birthday today. See how zeros and negative values work, and check your error handling. In addition to those problems, you also have to look out for empty tables, NULLs, and cardinality problems in SQL. RDBMS is not for wimps.

References

Date, C.J.; “Databases In-Depth; Relational Theory for Practitioners”; ISBN 0-596-0012-4.

Date, C.J.; “Database Design and Relational Theory”; ISBN 978-1–449-32801-6.

Much Ado about nothing: missing data

Shooting blanks

Zero is my hero

The empty set

Empty tables

Subscribe for more articles

Rate this article

Joe Celko

Related articles

Do not be surprised

Database Subsetting and Data Extraction

Tags