Stairway to Data

Stairway to Data, Level 8: Data Encoding Schemes - Part II


In the previous level, we looked at enumeration, abbreviation and algorithmic encodings. But there are more kinds of encoding schemes.

Hierarchical Encoding

A hierarchy partitions the set of values into disjoint categories, then partitions those categories into subcategories, and so forth until some final level is reached. Such schemes are shown either as nested sets or as tree charts. Each category has some meaning in itself and the subcategories refine meaning further.

The most common example is the Dewey Decimal Classification (DDC) system, used in public libraries in the English-speaking world. It is based on numeric ranges of decimal numbers. The 500 number series covers "Natural Sciences"; within that, the 510's cover "Mathematics"; finally, 512 deals with "Algebra" in particular. The scheme could be carried further, with decimal fractions for kinds of algebra.

Hierarchical encoding schemes are great for large data domains that have a natural hierarchy. They organize the data for searching and reporting along that natural hierarchy and make it very easy. But there can be problems in designing these schemes. First of all, the tree structure does not have to be neatly balanced, so some categories may need more codes than others and hence more breakdowns. Eastern and ancient religions were shortchanged in the original DDC, reflecting a prejudice toward Christian and Jewish writings. Asian religions were pushed into a very small set of codes. Today, the Library of Congress has more books on Buddhist thought than on any other religion on Earth.

Second, you might not have made the right choices as to where to place certain values in the tree. For example, in the original DDC, books on logic are in 164, a subcategory in the philosophy section, and not under the 510s, mathematics. In the nineteenth century, there was no mathematical logic. Today, nobody would think of looking for logic under philosophy – it is math. Dewey was simply following the conventions of his day. And like today's programmers, he found that the system specifications changed while he was working.

There are two ways to model these encoding schemes. The first is a simple list and the other is range pairs. The skeleton for Dewey might be:

CREATE TABLE Library_Catalog
CHECK (ddc LIKE '[0-9][0-9][0-9].[0-9][0-9][0-9]'),
ddc_name VARCHAR(35) NOT NULL,


CREATE TABLE Library_Catalog
CHECK (ddc LIKE '[0-9][0-9][0-9].[0-9][0-9][0-9]'),
high_ddc CHAR(7) NOT NULL
CHECK (ddc LIKE '[0-9][0-9][0-9].[0-9][0-9][0-9]'),
ddc_name VARCHAR(35) NOT NULL,

I prefer the range pair approach, since the description can be at the higher level. That is,

INSERT INTO Library_Catalog
VALUES ('500.000', '599.999', 'Natural Sciences'),
('510.000', '510.999', 'Mathematics'),
('511.000', '511.999', 'Algebra'),
etc ;

However, if the hierarchy is not large or deep, it might be better to use a simple list.

The worst possible way to do these encodings is to confuse a hierarchy and a hierarchical encoding. Yes, it actually happens! A posting by Westside2008, with narrative instead of DDL asked for a CTE to query a table that stores product categories. The table has three columns whose original names were not ISO-11179 standard, so I fixed them.

CREATE TABLE Product_Categories
product_cat_name VARCHAR(20) NOT NULL,
parent_product_cat INTEGER

The request was to find the superior and subordinate categories given an input. The answer is the usual CTE pattern, one for each direction:

Subordinate_CTE (product_cat, product_cat_name, parent_product_cat)
(SELECT product_cat, product_cat_name, parent_product_cat
FROM Product_Categories
WHERE parent_id = @in_search_cat)
SELECT P.product_cat, P.product_cat_name, P.parent_id
FROM Product_Categories AS P.
Subordinate_CTE AS S
WHERE S.product_cat = P.parent_id)
-- select from the cte
SELECT product_cat, product_cat_name, parent_product_cat
FROM Subordinate_CTE;

Compare this to the query for searching the Dewey Decimal Classification:

Here is with the simple list:

SELECT ddc, ddc_name
FROM Library_Catalog
WHERE ddc LIKE @in_ddc_pattern;

If I want to find all the math books, I use SET @in_ddc_pattern = '51_.%' and I am done. Since the ddc column is a key, it is indexed, there is no need for cursors and loops (which is what a recursive CTE is internally).

If I use the range pairs, the query becomes:

SELECT ddc, ddc_name
FROM Library_Catalog
WHERE @in_ddc BETWEEN low_ddc AND high_ddc;

Vector Encoding

A vector is made up of a fixed number of components. These components can be ordered or unordered, but are almost always ordered; they can be of fixed or variable length. The components can be dependent on or independent of each other, but the code applies to a single entity. The components of the vector can be determined by punctuation, symbol-set changes, or position within the code.

An example is the ISO code for tire sizes, which is made up of a wheel diameter (scaled in inches), a tire type (abbreviation code), and a width (scaled in centimeters). Thus, 15R155 means a 15-inch radial tire that is 155 millimeters wide, whereas 15SR155 is a steel-belted radial tire with the same dimensions. In spite of the mixed American and ISO units, this is a general physical description of a tire in a single code.

Vector schemes are very informative and allow you to pick the best scheme for each component. But they have to be disassembled to get to the components. Sorting by components is hard unless you want them in the order given; try to sort the tire sizes by construction, width, and diameter instead of by diameter, construction, and width.

Another disadvantage is that a bad choice in one component can destroy the usefulness of the whole scheme. Extending the code can be hard or easy. For example, if we invent a new kind of tire material, we jut add another abbreviation since this code is variable length.

But if the tire code had to be expanded to include thickness in millimeters, where would that measurement go? Another number would have to be separated by a punctuation mark. It could not be inserted into a position inside the code without giving ambiguous codes. The code cannot be easily converted to a fixed-position vector encoding without changing many of the database routines.

Concatenation Encoding

A concatenation code is made up of a variable number of components that are concatenated together. As in a vector encoding, the components can be ordered or unordered, dependent on or independent of each other, and determined by punctuation, symbol-set changes, or position.

A concatenation code is a cross between a hierarchy and a vector. The hierarchy is traversed by additions to the right. These are also known as facet codes in Europe. Or the code can be a list of features, any of which can be present or missing. The order of the components may or may not be important.

Concatenation codes were popular in machine shops at the turn of the 20-th century: a paper tag was attached to a piece of work, and workers at different stations would sign off on their parts of the manufacturing process. Concatenation codes are still used in parts of the airplane industry, where longer codes represent subassemblies of the assembly in the head (also called the root or parent) of the code.

Another type of concatenation code is a quorum code, which is not ordered. These codes say that (n) out of (k) marks must be present for the code to have meaning. For example, three out of five inspectors must approve a part before it passes.

The most common use of concatenation codes is in keyword lists in the header records of documents in document systems. The author or librarian assigns each article a list of keywords that describe the material covered by the article. The keywords are picked from a limited, specialized vocabulary that belongs to some particular discipline.

These codes could also be ambiguous if they were poorly designed. For example, is the head of 1234 the 1 or the 12 substring? When concatenation codes are used in databases, they usually become a set of "yes/no" check boxes, represented as adjacent columns in the file. This makes them Boolean vector codes, instead of true concatenation codes.

General Guidelines for Designing Encoding Schemes

These are general guidelines for designing encoding schemes in a database, not firm, hard rules. You will find exceptions to all of them.

Existing Encoding Standards

The use of existing standard encoding schemes is always recommended. The government or industry agency that provides these codes has someone who sat down and did nothing but work on this scheme. He probably did a better job than you could. In many cases he is also the trusted source for the data.

Allow for Expansion

Allow for expansion of the codes. The ALTER statement can create more storage when a single-character code becomes a two-character code, but it will not change the spacing on the printed reports and screens. Start with at least one more decimal place or position than you think you will need. Visual psychology makes "01" look like an encoding, whereas "1" looks like a quantity.

Use Explicit Missing Values to Avoid NULLs

Avoid the SQL general NULL as much as possible by putting special values in the encoding scheme instead. SQL handles NULLs differently from values and NULLs don't tell you what kind of missing value you are dealing with.

All-zeros are often used for missing values; all-nines, for miscellaneous values. For example, the ISO gender codes are 0 = Unknown, 1 = Male, 2 = Female, and 9 = Not Applicable. "Not applicable" means a lawful person, such as a corporation, which has no gender.

Keep the Codes in the Database

There should be a part of the database that has all of the codes stored in tables. These tables can be used to validate input, to translate codes in displays, and as part of the system documentation.

I was amazed to go to a major hospital in Los Angeles in mid-1993 and see the clerk still looking up codes in a dog-eared looseleaf notebook instead of bringing them up on her terminal screen. The hospital is still using a very old IBM mainframe system, which has "dumb" 3270 terminals, rather than a client/server system with workstations. There was not even a help screen available to the clerk.

The translation tables can be downloaded to the workstations in a client/server system to reduce network traffic. They can also be used to build picklists on interactive screens and thereby to reduce typographical errors. Changes to the codes are thereby propagated in the system without anyone having to rewrite application code. If the codes change over time, the table for a code should have to include a pair of "date effective" columns. This will allow a data warehouse to correctly read and translate old data.

Use One Table per Encoding

This should be obvious. But, there is a nightmare called the “One True Lookup Table” or OTLT. It is a relative of the EAV (Entity-Attribute-Value) design flaw. All the encodings are shoved into one table for the entire DB.

CHECK() constraints become impossible, the OTLT becomes huge, and it is constantly being locked for updates. Security becomes insanely complex.

Put Constraints on the Encodings

This should be obvious, too. It lets you define business rules one place, one way, one time and pass information. Without it, imagine adding CHECK() constraints on every table that uses, say, ZIP codes or telephone numbers.

Put VIEWs on the Encodings

This is not obvious. This tip usually is part of security. A Dentist should not be using medical procedure codes for heart surgeries. But it could be an honest typo by an input clerk. Even with the translation of the code on the screen, they might not recognize the technical terms and keep on going. The right way to do this is a VIEW of dental-only codes for the Dentist office. By keeping the whole encoding scheme in one place, instead of in CHECK() constraints, you guarantee consistency.

Translate Codes for the User

In some cases, you can display just the code to the end users. For example, most people do not need to see the two-letter state abbreviation written out in full in a street address. But if the encoding is obscure, you will want to inform the end user.

This article is part of the parent stairway Stairway to Data


You rated this post out of 5. Change rating




You rated this post out of 5. Change rating