Stairway to Data

Stairway to Data, Level 7: Data Encoding Schemes - Part I


Data is defined as information suitable for computer processing. This means it has to be encoded in a format that a computer can read. It is also nice if human beings can also read it. In the prior levels, we talked about scales and measurements. Now let's talk about how to get that stuff into the machine.

What to encode and how to do it as not always obvious. Entities can play roles in the data model. Do we encode the person receiving medical services or the policy that is paying for them? That might depend on whether the database is for the doctor or for the insurance company. Do we encode the first title of a song or the alternate title or both? Or should we include the music itself in a multimedia database? And should it be as an image of the sheet music or as an audio recording?

In the past, finding industry standards was hard. Most of them were in printed formats, not machine readable ones. If you did not know what to look for and where to go, you were out of luck. There is no reason for this today; we can do Google diligence. The problem ought to be which standardized encoding scheme to pick for a particular situation.

Where standardized encoding schemes exist, they are too often ignored in favor of some ad hoc scheme. The most recent example of this is Borders Books, now in bankruptcy. They decided to ignore the Industry Standard UPC bar codes that come pre-printed on packages. They spent the extra time and money to put their own labels on the back of their books and other merchandise.

Bad schemes give rise to errors in data entry and misreadings of outputs, and can lead to incorrect data models. Before we can discuss good encoding schemes we need to identify the bad ones.

Bad Encoding Schemes

An encoding scheme can be bad by simply not covering all the cases or range of values. A hospital system did not allow for anyone over 100 years of age, so geriatrics patients were sent to pediatrics wards. The old DOT (Dictionary of Occupational Titles) codes for occupations were created for the textile industry and before modern data processing existed.

A code can be bad because it does not allow for orderly expansion over time and circumstances. To use an actual example, the automobile tag system for a certain southern state started as a punch-card system written in COBOL. The auto tag system started with a single card column for a digit to indicate the type of tag: private car, chauffeured car, taxi, truck, public bus, and so forth. As time went on, more tag types were added for veterans of assorted wars, for State university alumni, and for whatever other pressure group happened to have the political power to pass a bill allowing it a prestige auto tag.

Soon there were more than 10 types, so a single-digit could not represent them.

But more than that, the resulting encoding was simply extended as each new tag type was enacted into law by using another non-digit character. Reporting was clumsy; there was no simple way to aggregate college, veteran, popular causes or other categories.

If the encoding scheme had been designed with two or three digits at the beginning, all of the problems would have been avoided. Talk to anyone who had to reconfigure a fixed-length record system to allow for the change from the old ZIP codes to the current ZIP+4 codes, or 10-digit UPC code to the current 13-digit EAN.

An encoding can be bad because it does not play well with others. This example was reported in Information Systems Week (ISW 1987). The first sentence told the whole story: "The chaos and rampant error rates in New York City's new Welfare Management System appear to be due to a tremendous increase in the number of codes it requires in data entry and the subsequent difficulty for users in learning to use it." The rest of the article explained how the new system attempted to merge several old existing systems. In the merger, the error rates increased from 2% to over 20% because the encoding schemes used could not be matched up and consolidated.

Another bad property is ambiguous encodings in the scheme. For example, when the ISBN (International Standard Book Number) was first created, several books were assigned the same code. Perhaps the funniest example of this problem was the Italian telephone system's attempt at a "time of day" service decades ago. They used a special three-digit number, like the 411 information number in the United States. But the three digits they picked were also those of a telephone exchange in Milan, so nobody could call into that exchange without getting the time signal before they completed their call.

This happens more often than you would think, but the form that it usually takes is that of a "miscellaneous" code that is too general. Very different cases are then encoded as identical and the user is given incorrect or misleading information when he does a query.

A bad encoding scheme lacks codes for missing, unknown, not applicable, or miscellaneous values. The classic story is the man who bought a prestige auto tag reading "NONE" and got thousands of traffic tickets as a result. The police had no special provision for a missing tag on the tickets, so when a car had no tag, they wrote "none" in the field for the tag number. The database simply matched his name and address to every unpaid missing-tag ticket on file at the time.

A NULL is not always the best way to show missing data. There are differences among "missing", "unknown", "not applicable", "miscellaneous" and erroneous values. For example, the International Classification of Disease uses 999.9 means “Other and unspecified complications of medical care, not elsewhere classified”; it means that we have diagnosed the patient, know that he has an illness, and cannot classify it; a very scary condition for the patient, but not quite the same thing as a missing disease code (just admitted, might not be sick) which is shown with 000.0, an inapplicable disease code (pregnancy complications in a male), a yet unknown disease code (sick and awaiting lab results) or an error in the diagnosis (the patient's temperature is recorded as 100 degrees Celsius, not Fahrenheit).

Encoding Scheme Types

The following is my classification system for encoding schemes and suggestions for using each of them. You will find some of these same ideas in library science and other fields, but I have never seen anyone else attempt a classification system for data processing.

Enumeration Encoding

An enumeration encoding arranges the attribute values in some order and assigns a number or a letter to each value. Numbers are usually a better choice than letters, because they can be increased without limit as more values are added. Enumeration schemes are a good choice for a short list of values, but a bad choice for a long list. It is too hard to remember a long list of codes, and very soon any natural ordering principle is violated as new values are tacked on the end.

A good heuristic is to order the values in some natural manner, if one exists in the data, so that table lookup will be easier. Chronological order (1 occurs before 2) or procedural order (1 must be done before 2) is often a good choice. Another good heuristic is to order the values from most common to least common. That way you will have shorter codes for the most common cases and easy default values. Other orderings could be based on physical characteristics such as largest to smallest, rainbow-color order, and so on.

After arguing for a natural order in the list, I must admit that the most common scheme is alphabetical order, because it is simple to implement on a computer and makes it easy for a person to look up values in a table.

The ANSI and FIPS Standards for “Identification of Counties of the United States for Information Interchange," encodes county names within a state by first alphabetizing the names, then numbering them with odd integers. Why odd numbers? So that when counties consolidate or split, it is easier to change the encoding scheme. The new county can be inserted between existing values.

Measurement Encoding

A measurement encoding is given in some unit of measure, such as pounds, meters, volts, or liters. Scales and measurement were already discussed, in painful detail in previous levels in this stairway.

But to re-iterate, it is a good idea to allow an extra decimal place in the data element's representation. Extra fractions make rounding more accurate; extra significant digits allow for expansion.

Jerry Weinberg urged us to do “reasonable checking” on data. This is not easy; Joe Cello buying 5000 bananas in his whole life is not reasonable. The Dairy Queen chain not buying 5000 bananas in a month is not reasonable.

Abbreviation Encoding

Abbreviation codes shorten the attribute values to fit into less storage space, but they are easily understood by the reader on sight. The codes can be either of fixed length or of variable length, but computer people tend to prefer fixed length. The most common example is the two-letter postal state abbreviations ('CA' for California, 'AL' for Alabama), which replaced the old variable-length abbreviations ('Calif.' for California, 'Ala.' for Alabama).

A good abbreviation scheme is very handy, but as the set of values becomes larger, the possibility for misunderstanding increases. The three-letter codes for airport baggage are pretty obvious for major cities: 'LAX' for Los Angeles, 'SFO' for San Francisco, 'BOS' for Boston, 'ATL' for Atlanta. But nobody can figure out the abbreviations for the really small airports. Often the airports are named for local  celebrities (Atlanta has “Hartsfield-Jackson International Airport” for example).

As another example, consider the ISO 3166 Country Codes, which come in two letter, three letter and non-abbreviation numeric forms. They are maintained by the RIPE Network Coordination Centre.

Algorithmic Encoding

Algorithmic encoding takes the value to be encoded and puts it through an algorithm to obtain the encodings. The algorithm should be reversible, so that the original value can be recovered. Though it is not required, the encoding is usually shorter (or at least of known maximum size) and more uniform in some useful way compared to the original value.

Encryption is the most common example of an algorithmic encoding scheme. But it is so important it needs to be considered as a topic in itself. The goal of encryption is NOT to be easily read by humans or processed by computers! Hashing is another example.

Computer people are used to using Julianized or ordinal dates, which convert a date into an integer. As an aside, please note that astronomers used the Julian Date, which is a large number that represents the number of days since a particular heavenly event, and it was not named after the Julian Calendar.

The ordinal date is a number between 1 and 365 or 366, which represents the ordinal position of the day within the year.

Algorithms take up computer time in both data input and output, but the encoding is useful in itself because it allows searching or calculations to be done that would be hard using the original data. Ordinal dates can be used for computations; Soundex names give a phonetic matching that would not be possible with the original text.

Another example is hashing functions, which convert numeric values into other numeric values for placing them in storage and retrieving them. Rounding numeric values before they go into the database is also a case of algorithmic encoding.

The difference between an abbreviation and an algorithmic encoding is not that clear. An abbreviation can be considered a special case of an algorithm, which tells you how to remove or replace letters. The tests to tell them apart are:

  1. When a human can read it without effort, it is an abbreviation.
  2. An algorithmic encoding is not easily human readable.
  3. An algorithmic encoding might return the same code for more than one value, but an abbreviation is always one-to-one.

Validation & Verification

Validation of an encoding means that the syntax is correct. Verification means that there is a trusted source or procedure for the encoding, so we know the value and the syntax is correct. Validation can be very exact or a little fuzzy.

Consider the five digit ZIP code. A quick validation can be the regular expression “zip_code CHAR(5) NOT NULL CHECK (zip_code LIKE '[0-9][0-9][0-9][0-9][0-9]')” which says that ZIP code is five digits. But this rule is not correct; the lowest ZIP code is 01001 which is for (Agawam, MA) and the highest ZIP code is 99950 (Ketchikan, AK). All zeros and all nines will never be used; they are deliberate dummy values for sorting missing data to the front or back of a list of ZIP codes. Are we done? Nope! There are unused ZIP codes between the highest and lowest code values; for example, Austin, TX has ZIP codes 78739, and 78741 but not 78740.

Lookup Tables

Lookup tables are eventually the safest way to validate encodings. But they are not always practical when there is too much data. You cannot load a database with all the valid URLs in the world, but you can write a regular expression that covers them.

I have already done an article on lookup tables, which are used with encodings to translate and verify them. Here are some quick links to related articles.

We will move on to more kinds of encoding schemes in our final level in this stairway.

This article is part of the parent stairway Stairway to Data


You rated this post out of 5. Change rating




You rated this post out of 5. Change rating