SQLServerCentral Article

Missing Data

,

We don't always have perfect knowledge of the reality we are trying to model in our database. The most general approaches to handling this missing data has been either ignore it, deduce it or impute it.

Ignoring something explains itself. In SQL, we very often use a NULL when we don’t have a value. While the use of NULLs is controversial, they do give us a uniform, consistent and well-defined set of rules for handling missing values. But from a modeling perspective, there are problems with them. Aside from special computational rules and the general principle that “NULLs propagate” can be difficult to learn, a more fundamental problem is that the general purpose NULL in SQL covers at least two very different cases. Dr. Codd later fix this in a revision of his relational model by adding A and I Types of missing values. In one case, the value is unknown, but the entity has the attribute. With this kind of NULL, it’s always possible we will discover a (known) value. The example I like to pull up is that when I’m wearing a hat,

it is always possible that I might have a hair color. To be very complete in my choice of values on this scale, I might also include “bald” or even “varicose veins, and liver spots” in the domain of my values for hair color. With this kind of NULL, it is (in theory) possible to come up with some meaningful value or token.

The second kind of NULL will never have a value because the attribute itself is missing from the entity. I can say with certainty, as a primate, I will not grow feathers and have feather colors as an option. This sort of NULL occurs in OUTER JOINs in other constructs that generate them. For example, GROUPING SETS, ROLLUP and CUBE constructs have to have a predicate to test for a NULL that was created versus one that was in the original data. GROUPING (<column reference list>) returns a binary number in which each position with a one in it has a created NULL. This might be easier to understand with a simple example of a SELECT statement done against a simple regional sales schema, in which cities exist within a region:

SELECT region_nbr,
 city_id,
 SUM(sale_amt) AS total_sales
FROM Sales AS S
GROUP BY ROLLUP (region_nbr, city_id);

Each of my cities will get a total for the amount of sales. But the ROLLUP will also add one row for each region level total and create a NULL for the city_id. Finally, there will be a grand total in which both the region number and the city columns will have generated NULLs. If I have a sale and I have a NULL region number or a NULL city identifier in the original data, then I have to use the GROUPING() predicate to tell whether or not it was created or actual data.

SELECT CASE WHEN GROUPING (region_nbr) THEN ‘region total’ ELSE region_nbr END
 AS region_nbr,
 CASE WHEN GROUPING (city_id) THEN ‘city total’ ELSE city_id END
 AS city_id,
 SUM(sale_amt) AS total_sales
FROM Sales AS S
GROUP BY ROLLUP (region_nbr, city_id);

Chris Date (Database Writings 1985–1989, Addison Wesley, 1990, ISBN 0-201-50881-8, chapter 8) gave seven kinds of missing data; value not applicable, value unknown, value does not exist, value is undefined, value is not valid, value is not supplied and the value is an empty set.

But this list is not complete. If you’ve worked with the original Sybase DATETIME, and SMALL DATETIME data types in your legacy systems, You know that there are certain temporal values which you simply cannot physically represent because of the internal storage format. DATETIME is limited to 3-1/3 milliseconds, while the modern DATETIME2 can be accurate to 100ns. You can Google around for several articles on it, if you want to know the details.

There are similar rounding and precision problems with floating-point numbers. The real number line is continuous, but approximate numbers that we use in the computer or more like dust on the number line. Fixed point numbers do not have same problem, but if you fail to declare them to the needed precision, it can be impossible to record a valid value. My favorite example was when Honda was entering the American market. Many decades ago, all Honda made were electrical generators and “a groovy little motorbike” (for young people, that was a reference to a lyric in a popular 1964 Beach Boys song about those motorbikes). The company simply assumed they would never have more than 10,000 dealerships in the United States, and that’s all the space they allowed in their COBOL files to identify a dealership. Oops!

Finally, the DATE data type is only defined in the range 0001-01-01 through 9999-12-31. Before you think that ought to be enough of a range for anything you’re going to use, I’d like to point out there is a common and very bad programming practice of using ‘9999-12-31’ as a token instead of as a date. The ISO temporal model is based on half open intervals, which have a known starting time, but might not have an ending time yet. Most common example will be an event which has not completed yet, such as checking into a hotel room, staying employed at a job, or an ongoing marriage that hasn’t ended in divorce or death. The preferred method for handling this unknown terminal date is to use a NULL and the DDL looks like this:

CREATE TABLE Events
(..
 event_start_date DATE NOT NULL,
 event_end_date DATE, -- NULL means on-going event
 CHECK(event_start_date <= event_end_date),
 ..);

As an aside, many programmers forget or never knew how UNKNOWN logical values work in a CHECK() constraint. We apply what we’ve called the “benefit of the doubt.” rule. A search condition in a CHECK() constraint that test to {true, unknown} accepts the data and only rejects rows that test to {false}. This is what lets us use NULL as the end date. This convention also lets us write things like:

DATEDIFF (DAY, event_start_date, COALESCE(event_end_date, CURRENT_TIMESTAMP))

This expression can determine the current hotel bill for a guest who hasn’t checked out yet. Obviously, an actual date can be put into a search condition. But you can also put in a CASE expression that picks a particular date based on several conditions.

Simple statistics can often provide missing data. For example, if I know the high temperatures on Monday and Wednesday of the week, It is reasonable to average them to find a high temperature on Tuesday of that week. The idea is that it’s very unlikely that one day would be either abnormally low or abnormally high. Please notice that this is smuggling in an assumption about temperatures. Not every attribute is going to have nice smooth set of values.

Another situation is where there is a strong mode (most frequent value) in the in the set of values on attribute. For example, if I have a Marine ambiguously named “Alex Mack”, I’m fairly safe in assuming that their sex code is going to be male. However, if I have a Marine named “Alexandra Mack”, then I can safely assume the sex code should be female.

Munqath Alattar and Attila Sali (Journal on Data Semantics (2020) 9.85 – 99) introduced the idea of “Strongly Possible Keys for SQL” in a paper published this year. The basic idea is that we have some missing values (NULLs), But instead of appearing in non-key values, they appear in candidate keys. Yes, I know technically I we cannot have NULLs in keys, But bear with us. They have an idea of a “possible world” in which the NULL is replaced with an actual value and it yields a proper key. To reiterate, a key is a subset of columns in the table such that none of its columns are NULL and each key is unique within the table. This was intended as a solution to a real-world problem. For example, in data warehousing applications, the database may be constructed from multiple sources which do not necessarily have the same attributes, so you’re assembling your data model from bits and pieces. The goal is to give us a possible world that is internally consistent, but not necessarily valid. The goal when constructing these possible keys is to make the fewest possible assumptions.

Floating Point Numbers

.You might have noticed that ANSI/ISO Standard SQL has three approximate data types, FLOAT, REAL and DOUBLE PRECISION. This is actually reflecting a bit of history. Before the IEEE 754 standards, each vendor had its own floating-point data types and nobody created them for data storage. These formats were all designed for computations and have rules for how special tokens are used. Floating-point numbers are based on the scientific notation (https://en.wikipedia.org/wiki/Scientific_notation) in which a number is represented as

m × 10n

or m times ten raised to the power of n, where the exponent n is an integer, and the coefficient m is any real number. The integer (n) is called the order of magnitude and the real number (m) is called the significand. The significand used to be called the mantissa, but that’s not the preferred term now because confuses another meaning for that term in logarithms. The 10 is the base and was used in early floating-point number implementations.

The IEEE 754 Standards have a binary base. They also have a wonderful set of bit configurations called NaNs, which stands for “not a number”, which have special interpretations. Systematic use of NaNs was introduced by the IEEE 754 floating-point standard in 1985, along with the representation of other non-finite quantities such as infinities. In mathematics, zero divided by zero is undefined as a real number, and is therefore represented by NaN in computing systems. The square root of a negative number is not a real number, and is therefore also represented by NaN in compliant computing systems. NaNs may also be used to represent missing values in computations.

There are two kinds of NaNs; “quiet NaNs” (qNaN) and “signaling NaNs” (sNaN). Quiet NaNs are used to propagate errors resulting from invalid operations or values. The idea is that these errors can be ignored until we get to a final result. Signaling NaNs can support advanced features such as mixing numerical and symbolic computation or other extensions to basic floating-point arithmetic.

Floating-point operations other than ordered comparisons normally propagate a quiet NaN. Most floating-point operations on a signaling NaN signal the invalid operation exception. Quiet NaNs do not raise any additional exceptions as they propagate through most operations. The exceptions are where the NaN cannot simply be passed through unchanged to the output, such as in format conversions or certain comparison operations.

Signaling NaNs raise an invalid operation exception which might then be "quieted" into a qNaN that may then propagate. Before the IEEE standards, programmers had to write explicit code to do this and it’s one of the things that made floating-point computations very difficult. An sNaN can signal an underflow, an overflow, a complex number and other such things. These things are all computational in nature, not related to data storage. The NaNs which are related to data are positive and negative infinity (±inf). These values would make much more sense than using arbitrary numeric value for a placeholder in an open interval. Likewise, it might be useful if temporal values had “± Eternity” token for its open intervals, but that’s another issue.

Encoding Schemes

When I’m teaching a class on data modeling, I always include the ISO/IEC 5218 sex codes as an example a good design. The four values are:

  • 0 = Not known
  • 1 = Male
  • 2 = Female
  • 9 = Not applicable; A lawful person such as a corporation or organization.

This encoding is embedded in several national identification numbers. The reasons for the use of zero and nine goes back to punch cards. In FORTRAN, a blank column on the punch card was read as a zero. Remember, this was a language that only had numbers. Yes, things used to be that primitive. My favorite example of this was the first ICD (International Classification of Disease) code for “cholera” was ‘000’, so when a disease was not yet diagnosed, people left those columns unpunched (blank). FORTRAN read the blanks, did statistical programs and promptly reported about a third to half of all incoming patients had cholera. This problem showed up in other encoding schemes, such as the Swedish accident reporting systems; this is where we got the myth of a high Scandinavian suicide rate.

The use of all nines guaranteed that miscellaneous codes would sort to the end of reports. Keypunch machines (and some old electric typewriters) had a feature called “typo-matic”, which meant that if you held the key down, it would repeat until that character filled a field in the punch card. A mechanical version of a default string option!

Today, were not so dependent on the mechanical implementation of the input of our data. It’s probably worth asking when you’re designing your encoding schemes, do you need more than one encoding for missing data? For example, statistically, a lack of a response on a questionnaire is statistically different from a refusal to answer that question. Likewise, questionnaires may have a very broad category. For example, imagine a form with check boxes where your age can be given as “65 and over” instead of asking for the year of birth, so the system could compute a current age.

The moral to the story is that not only do you have to design schemes for the information which you know, but you also have to design for the things that you don’t know or have.

Rate

Share

Share

Rate