In the final step of Database Design, Joe Celko gives a simple but effective explanation of the normalization process and why it is important.
Where is the wisdom?
Lost in the knowledge.
Where is the knowledge?
Lost in the information.
— T. S. Eliot —
Where is the information?
Lost in the data.
Where is the data?
Lost in the #@%&! database!
— Joe Celko —
So I am not the poet that T. S. Eliot is, but I can safely say he never wrote a computer program. However, I agree with his point about wisdom and information. And if he knew the distinction between data and information, I like to think that he would have agreed with mine.
I find it strange that most database programmers do not know about data qua data. The closest thing they are taught is the basic data types in some particular programming language. In fact, I have found that most of the younger programmers have not had anything beyond the basic descriptive statistics. Beyond that, nobody is taught about measurement theory, encoding scheme design, check digits, the standardization process, or anything related to data qua data.
What is Data?
“ Data! Data! Data! I can not make bricks without clay.”
— Sherlock Holmes (fictional detective of author Sir Arthur Conan Doyle)
The dictionary definition is that data is facts. Well, that is pretty generic. For purposes of databases, let’s make it more exact, more narrow. We want facts represented in a such a way that they can be manipulated by a computer. Even more than that, we want a digital computer doing that manipulation. We want symbols in a linear language with known, deterministic rules of manipulation.
This narrowed definition disallows maps, pictures, video, music, dance, speeches, literature and a lot of good things that give us information. Life is full of trade-offs. The trade-off here is that a machine can do the heavy lifting for us.
Continuous versus Discrete
The most basic distinction we can make is between things that are continuous versus those that are discrete. A discrete set has individual members that you can pick out. Look at a six pack of beer; each can of beer is distinct. We like discrete data because it can be made digital instantly. Each value gets a symbol that we can manipulate. In theory, a set can be countably infinite, but we do not actually find infinite sets lying around in the database world.
In a continuum, you can find individual members, but there are always more members. Look at an analog wall clock; what time is it exactly? The second hand is always moving so the best you can do is to round the time to the nearest minute. Or 1/10 of a second if you are using a stopwatch at a track meet. Or 1/100 of a second if you are a photographer. Maybe nanoseconds if you are an electrical engineer.
This disconnection between continuous and discrete data is where Calculus came from. But for database purposes, we have to find ways of making a continuum fit into a discrete digital model. Keep this problem in mind when we get to floating point numbers, and temporal data.
Quantitative versus Qualitative
Quantitative data allows for mathematical operations. It makes sense to “do math on” things. It makes sense to add kilograms to kilograms and get kilograms. This is possible because a quantity of a thing is uniform in its nature. I have a dozen eggs.
Qualitative data assumes things are not uniform. Individual elements in the set have attributes that vary from one element to the next. For example, given two eggs, one is a Grade ‘A’ egg and the other is a Grade ‘B’ egg. Sometime this is a summary of quantitative factors – shape, size, weight, color, etc. – but it can also be a judgment call. Is an all-A omelet as good as an A-B omelet with truffles?
I will argue that the first databases were the precursors to written language which were found in the Middle East (see WRITING; THE STORY OF ALPHABETS AND SCRIPTS by Georges Jean, ISBN 978-0500300169). Shepherds keeping community flocks needed a way to manipulate ownership of the animals, so that everyone knew who owned how many rams, ewes, lambs and whatever else. Rather than branding the individual animals, as Americans did in the West, each member of the tribe had a set of baked clay tokens which represented ownership of one animal, but not of any animal in particular.
When you see the tokens, your first thought is that they are a primitive internal currency system. For a funny look at this, link over to:
This is true in part, because the tokens could be traded for other goods and services. But their real function was as a record keeping system, not as a way to measure and store economic value. That is, the trade happened first, then the tokens were changed and not vice versa.
The tokens had all the basic operations you would expect in a database. The tokens were updated when a lamb grew to became a ram or ewe, deleted when an animal was eaten or died, and new tokens were inserted when the new lambs were born in the spring.
One nice feature of this system is that the mapping from the model to the real world is one to one and could be done by a person who cannot count or read. He had to pass the flock thru a gate and match one token to one animal; we would call this a “table scan” in SQL. He would hand the tokens over to someone with more math ability — the CPU for the tribe — who would update everyone’s set of tokens. The rules for this sort of updating can be fairly elaborate, based on dowry payments, oral traditions, familial relations, shares owned last year, etc.
The tokens were stored in soft clay bottles that were pinched shut to insure that they were not tampered with once accounts were settled; we would call that “record locking” in a DBMS systems.
Data versus Information
Information is what you get when you distill data. A collection of raw facts does not help anyone to make a decision until it is reduced to a higher level abstraction. My sheep herders could count their tokens and get simple statistical summaries of their holdings (“Larry owns 15 ewes, 2 rams, and 13 lambs”) which are immediately useful, but it is very low level information.
If Larry collected all his data and reduced it to information for several years, then he could move up one more conceptual level and make more abstract statements like, “In the years when the locust come, the number of lambs born is less than the following two years”, which are of a different nature than a simple count. There is a long time horizon into both the past and an attempt to make predictions for the future. The information is qualitative and not just quantitative.
Please do not think that qualitative information is to be preferred over quantitative information. SQL and the relational database model are based on sets and logic. This makes SQL very good at finding set relations, but very weak at finding statistical and other relations. A set relation might be an answer to the query “Do we have people who smoke, drink and have high blood pressure?”, which gives an existence result. A similar statistical query would be “How are smoking and drinking correlated to high blood pressure?”, which gives a numeric result that is more predictive of future events.
Information versus Wisdom
“Wisdom is putting knowledge into action”
– Bubishi: The Classic Manual of Combat by Patrick McCarthy (ISBN 978-0804838283)
Wisdom does not come out of the database or out of the information in a mechanical fashion. It is the insight that a person has to make from information to handle totally new situations. I teach data and information processing; I do not teach wisdom. However, I can say a few remarks about bad data and errors in the design of the data.
Errors in Data
Statisticians classify experimental errors as type I and type II. A type I error is accepting as false something which is true. A type II error is accepting as true something which is false. These are very handy concepts for database people, too.
The classic Type I database error is the installation in concrete of bad data, accompanied by the inability or unwillingness of the system to correct the error in the face of the truth. Anyone who has tried to correct a billing error knows this problem all too well.
The Type II error can be deliberate data fraud. But more often than not, it is internal problems. My favorite example was a major credit card company. The account data had a column for the date of the last usage. If the card had not been used in a certain length of time, it was automatically canceled.
They came out with a new card product and sent them out to old customers who requested them. The new cards needed customer data. The solution was obvious and simple; copy the existing data from the old account, without the balances, into the new account. The first batch of new card orders came in. Some orders were for replacement cards (close old account, open new), some were for the new card without any prior history and some were for the new account option while retaining the old account.
One of the columns was the date of first membership (the company thought that this date is very important since they use it in their marketing and advertising). The brand new accounts were fine. The replacement accounts were fine. But the members who picked the two card option were a bit distressed. The date that the system had to use as “date of last card usage” was the date that the original account was opened. This was almost always several years ago, since you needed a good credit history with the company to get offered the new card.
Before the shiny new cards had been printed and mailed out, the customers were getting drop letters on their new accounts. The switchboard in customer service looked like a Christmas tree. This is a Type II error – accepting as true the falsehood that the last usage date was the same as the acquisition date of the credit card.
Measure all that is measurable and attempt to make measurable that which is not yet so.
— Galileo (1564 to 1642)
Technically, Measurement Theory is a branch of applied mathematics that is useful in data analysis. This leads us to scales, which will be discussed later and in detail. For now, let’s look at general properties of measurements.
Strange as it might seem, measurement theory came from psychology, not mathematics, statistics or computer science. In particular, S. S. Stevens originated the idea of levels of measurement and classification of scales. Scales are classified into types by the properties they do or do not have. The properties with which we are concerned are the following.
- A natural “origin point” on the scale. This is sometimes called a zero, but it does not have to be literally a numeric zero. For example, if the measurement is the distance between objects, the natural zero is zero meters — you cannot get any closer than that. If the measurement is the temperature of objects, the natural zero is zero degrees Kelvin (0°K)– nothing can get any colder than absolute zero. However, consider time; it goes from an eternal past into an eternal future, so you cannot find a natural origin for it.
- Meaningful operations that can be performed on the units. It makes sense to add weights together to get a new weight. However, adding shoe sizes together is absurd.
- A natural ordering of the units. It makes sense to speak about an event occurring before or after another, or an object being heavier, longer, or hotter than another. But saying that Vanilla is better than Chocolate makes no sense.
- A natural metric function on the units. A metric function has nothing to do with the “metric system” of measurements, which is more properly called SI, for “Systéme International d’units” in French. Metric functions have the following three properties:
- The metric between an object and itself is the natural origin of the scale. We can write this in a semi-mathematical notation as M(a, a) = 0.
- The order of the objects in the metric function does not matter. Again in my informal notation, M(a, b) = M(b, a).
- There is a natural additive function and ordering that obeys the rule that M(a, b) + M(b, c) >= M(a, c), which is also known as the triangular inequality.
This notation is meant to be more general than just arithmetic. The “zero” in the first property is the origin of the scale, not just a numeric zero. The third property, defined with a “plus” and a “greater than or equal” sign, is a symbolic way of expressing more general ordering relationships. The “greater than or equal” sign refers to a natural ordering on the attribute being measured. The “plus” sign refers to a meaningful operation in regard to that ordering, not just arithmetic addition.
The special case of the third property, where the “greater than or equal to” is always “greater than”, is very desirable to people because it means that they can use numbers for units and do simple arithmetic with the scales. This is called a strong metric property.
It is possible to have a scale that has any combination of the metric properties. For example, instead of measuring the distance between two places in meters, measure it in units of effort -– ancient Chinese uphill/downhill units. Does it have the property that M(a, a) = 0? Yes; it takes no effort to get to where you already are located. Does it have the property that M(a, b) = M(b, a)? No; it takes less effort to go downhill than to go uphill. Does it have the property that M(a, b) + M(b, c) >= M(a, c)? Yes; the amount of effort needed to go directly to a place will always be less than the effort of making another stop along the way.
Now, your first homework assignment is to come up with examples for the other cases. I offer an autograph copy of SQL FOR SMARTIES (4th edition) for examples of this.
Range, Granularity, Accuracy and Precision
Range, Granularity, Accuracy and Precision are properties of a measurements whether it is discrete or continuous. These have more to do with taking the measurement than with the property being measured. This will eventually become the choice of data types, policies, procedures, constraints and all the other stuff that a good DBA used to obtain data integrity.
Measurements (and the tools that take the measurement) have a range — what are the highest and lowest values which can appear on the scale? Database designers do not have infinite storage, so we have to pick a subrange to use in the database when we have no upper or lower bound. For example, very few SQL calendar functions will handle geological time periods. But then very few companies have bills that have been outstanding for that long, either, so we do not mind. Temperatures go from Absolute Zero to infinite, but weather reports are kept with that range.
Look at a ruler and a micrometer. They both measure length, using the same SI unit scale, but there is a difference. A micrometer is more precise because it has a finer granularity of units. Granularity is a static property of the scale itself — how many notches there are on your ruler.
Carrying things to an insane number of decimal places is a waste, but it looks compressive (Mother Celko’s Heuristic: a statistic impresses people with the square of the decimal places).
Accuracy is how close the measurement, repeated again and again, comes to the actual value. Sometimes the “actual value” is a bit vague, but often it is not. There is an issue of timeliness – “Is that really my bank balance? I just made a deposit!”
Precision is a measure of how repeatable a measurement is. This is different from accuracy. If you ask people for their birth date, they will give you an answer that is verified by a passport, driver’s license, the testimony of living men, etc. decade after decade, they will get the same answer. Then they find out that the original birth certificate was wrong.
Think of the pellets in the blast from a good shotgun. The choke is right, but the sighs are bent. You can get a tight pattern off of the bull’s eye.
This was the high level overview of measurements in the abstract. The next step is to get back to the hardware and look at Numeric Data from a database/computer viewpoint.