Go back to level one of this stairway and review range, precision, accuracy, granularity and other terms related to a generic measurement. Not the particular property being measured (say, distance), nor the scale used (millimeters or Light years), nor the tool used (micrometer or radio telescope).
I will not be talking about the tools; that is hardware and I only do software. But I do worry about the nature of the property being measured and modeled in a database. And I need to know about the scales used.
The reason for getting into data types in the prior levels was so that we can discuss how to get these measurements into a database.
There are often many scales for the same property. The lack or presence of precision and accuracy determines the scale you should choose. Scales are either quantitative or qualitative. The quantitative scales are what most people mean when they think of measurements, because these scales can be manipulated and are usually represented as numbers. Qualitative scales do not allow for computations -- just comparisons.
The simplest scales are the nominal scales. They simply assign a unique symbol, usually a number or a name, to each member of the set that they attempt to measure. For example, a list of city names is a nominal scale. In English, it's just a name.
Right away we are into philosophical differences, because many people do not consider naming to be measurement. Since there is no clear property except existence being measured, that school of thought would tell us this cannot be a scale.
There is no natural origin point for a set of names, no comparison rules other than equality and no ordering. We tend to use alphabetic ordering for names if we have a language with an alphabet. I had a friend who taught American Culture in Red China before the Tiananmen Square Massacre. Her class roster was translated into PinYin, for her. Each day, she got a new roster; it was never sorted or even ordered the same way twice. Alphabetical order is an incredibly powerful tool for handling data.
Nominal scales are very common in databases because they are used for unique identifiers, such as names and descriptions.
Now for a bit of culture
Abou Ben Adhem
--Leigh Hunt (1784-1859)
Abou Ben Adhem (May his tribe increase)
Awoke one night from a deep dream of peace
And saw, within the moonlight in his room,
Making it rich, and like a lily in bloom,
An Angel writing in a book of gold:
Exceeding peace had made Ben Adhem bold,
And to the Presence in the room he said,
"What writest thou?" The Vision raised its head,
And with a look made of all sweet accord
Answered, "The names of those who love the Lord."
"And is mine one?" said Abou. "Nay, not so,"
Replied the Angel. Abou spoke more low,
But cheerily still; and said, "I pray thee, then,
Write me as one that loves his fellow men."
The Angel wrote and vanished. The next night
It came again with a great wakening light.
And showed the names whom love of God had blessed,
And, lo! Ben Adhem's name led all the rest!
And do you why Ben Adhem's name led all the rest? Because God files data in alphabetical order.
The next simplest scales are the categorical scales. They place an entity into a category which is assigned a unique symbol, usually a number or a name. For example, the class of animals might be categorized as reptiles, mammals and so forth. The categories have to be within the same class of things to make sense.
Again, many people do not consider categorizing to be measurement. The categories are probably defined by a large number of properties and there are two potential problems with them.
The first problem is that an entity might fall in one or more categories. For example, a platypus is a furry, warm blooded egg-laying animal. Mammals are warmed blood, but give live birth and optionally have fur. The second problem is that an entity might fall not fall into any of the categories at all. If we find a creature with chlorophyll and fur on Mars, we do not have a category of animal in which to place him.
The two common solutions are either to create a new category of animal (monotremes for the warm blooded, egg-laying platypus and ekidna. The least desirable situation is to allow an entity to be a member of more than one category.
The categories have the same characteristics of the nominal scale. There is no natural origin and no natural linear ordering. The only meaningful operation that can be done with such a scale is a test for membership or subsets. Categories and sub-categories give us a non-linear ordering or nesting. This will come up again when I get to hierarchical encoding schemes.
More culture, when you least expected it!
from the Essay "The Analytical Language of John Wilkins" by Jorge Luis Borges:
"These ambiguities, redundancies, and deficiencies recall those attributed by Dr. Franz Kuhn to a certain Chinese encyclopedia entitled Celestial Emporium of Benevolent Knowledge. On those remote pages it is written that animals are divided into (a) those that belong to the Emperor, (b) embalmed ones, (c) those that are trained, (d) suckling pigs, (e) mermaids, (f) fabulous ones, (g) stray dogs, (h) those that are included in this classification, (i) those that tremble as if they were mad, (j) innumerable ones, (k) those drawn with a very fine camel's hair brush, (l) others, (m) those that have just broken a flower vase, (n) those that resemble flies from a distance."
An absolute scale is a count of the elements in a set. Its natural origin is zero, or the empty set. The count is the ordering (a set of five elements is bigger than a set of three elements, and so on). Addition and subtraction are metric functions (if you don't remember what a metric function is, go back to level #1). Each element is taken to be interchangeable. For example, when you buy a dozen Grade A eggs, you assume that for your purposes any Grade A egg will do the same job as any other Grade A egg.
Again, absolute scales are very common in databases because they are used for quantities.
Here is where we get into traditional units. During Metrication in the UK, a dairy tried to sell eggs in packages of ten eggs, instead of by the dozen. People would not accept the change, even tho the cost per egg was lower. Likewise, we have the gross (12 dozen), quire (20 sheets of paper) and so forth. But a ream of paper was formerly 480 sheets, now is 500 sheets and there is a printer's ream of 516 sheets to allow for waste in printing..
Ordinal scales put things in order, but have no origin and no operations. For example, geologists use Moh's Scale for Hardness (MSH for short) for minerals in field work. The tool is a box with compartments for a set of standard minerals, which are ordered by relative hardness (talc = 1, gypsum = 2, calcite = 3,
fluorite = 4, apatite = 5, feldspar = 6, quartz = 7, topaz = 8, sapphire = 9, diamond = 10). To measure an unknown mineral, you try to scratch the polished surface of one of the standard minerals with it; if it scratches the surface, the unknown rock is harder. Notice that I can get two different unknown minerals with the same measurement that are not equal to each other, and that I can get minerals that are softer than my lower bound or harder than my upper bound. There is no origin point and operations on the measurements make no sense (e.g., if I add 10 talc units I do not get a diamond).
Perhaps the most common use we see of ordinal scales today is to measure preferences or opinions. You are given a product or a situation and asked to decide how much you like or dislike it, how much you agree or disagree with a statement, and so forth. This is the Likert scale and it is usually given a set of labels such as "strongly agree" through "strongly disagree", or the labels are ordered from 1 to 5. Answers from more than five marks on the scale are not as reproducible.
Consider pairwise choices between ice cream flavors. Saying that vanilla is preferred over wet moldy leather in our taste test might well be expressing a universal truth, but there is no objective unit of "likeability" to apply. The lack of a unit means that such things as opinion polls that try to average such scales are meaningless. The best you can do is a bar graph of the number of respondents in each category. The most meaningful statistic is a weighted median.
Another problem is that an ordinal scale may not be transitive. Transitivity is the property of a relationship in which if R(a, b) and R(b, c) then R(a, c). We like this property, and expect it in the real world where we have relationships like "heavier than", "older than", and so forth. This is the result of a strong metric property.
But an ice cream taster, who has just found out that the shop is out of vanilla, might prefer squid over wet leather, prefer wet leather over wood, and yet prefer wood over squid, so there is no metric function or linear ordering at all. Again, we are into philosophical differences, since many people do not consider a non-transitive relationship to be a scale.
If you are interested in the problems of non-transitive relationships, look at the book “Wheels, Life, and Other Mathematical Amusements” (Martin Gardner, 1983, ISBN 978-0716715894), which has a chapter on non-transitive relationships. Arrow's Paradox in voting theory shows that there are still problems even when an individual's preferences are well ordered.
Rank scales have an origin and an ordering, but no natural operations. You can think of them as an Ordinal scale with an origin. The most common example of this would be military ranks. Nobody is lower than a private and that rank is a starting point in your military career, but it makes no sense to somehow combine three privates to get a sergeant.
Rank and ordinal scales have to be transitive: a sergeant gives orders to a private, and since a major gives orders to a sergeant, he can also give orders to a private. You will see ordinal and rank scales grouped together in some of the literature if the author does not allow non-transitive ordinal scales. You will also see the same fallacies committed when people try to do statistical summaries of such scales.
Scales versus Relative Positions
A measurement is made on an entity; statistics are made on a set. Do not confused the two. Let's we will use four army buddies and try to measure some of their attributes on different scales. Let us start with their military ranks:
Tom = sergeant
Dick = corporal
Richard = corporal
Harry = private
Clearly, Tom has the highest rank and Harry has the lowest. Dick and Richard have the same rank. If Tom or Harry were to get shot in combat, Dick and Richard would still have the rank of corporal. Likewise, the rank of corporal would still be on the scale even if there were no members at that level. A ranking can have zero or more elements in each unit of the scale.
The boys now get out of the Army and go to school. The grades on their first report card are
Tom = A
Dick = B
Richard = B
Harry = C
Clearly, Tom is the top student. But note that Harry is fourth in his class because there are three students with better grades than his. If Richard drops out of school, Harry moves up to third in the class.
Class standing (DENSE_RANK() in SQL) is not a ranking; it is an ordering. Each position in the ordering must have one or more members, so there has to be a promotion rule when an element is taken from the data set. The convention in schools has been to give both Dick and Richard the honor of being second in their class, with Harry as third. That is, we fill all the gaps from first to n-th standing with one or more data items. Now the boys go to work for the same company as salesmen. Their boss has a sales contest, in which they perform like this:
Tom = $100,000
Dick = $90,000
Richard = $90,000
Harry = $80,000
Nobody beats Tom, and he is the Salesman of the Month. Harry is fourth because there are three other people who sold more than he did. Again, the promotion rule applies if one of the boys should get hit by a beer truck. But what is the contest standing for Dick and Richard? The convention here is to allow gaps and say that they are both third and nobody is second.
These last two situations, class standing and contest placement, are not scales. Their values depend on a rule that needs the rest of the set, whereas a scale is external to the data set. If you try to put a standing or a placement in a database, then you have to recalculate it every time you change anything in the database. These last two situations are statistics and can be useful.
Next time, more kinds of scales.