Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Stairway to Data, Level 6: Types of Scales - Part II

By Joe Celko,

The Series

This article is part of the Stairway Series: Stairway to Data

IT projects can hit problems that turn out to be due to an insufficient understanding of the basic data and data-types, rather than the database design. It is a sorely neglected topic that might seem to be trivial, but certainly isn't. The DBA, with a broad perspective on corporate data can do a great deal to help application developers to avoid the common mistakes that so often happen, and Joe Celko's Stairway gives the busy IT professional a crash course to understanding the nature of the data being processed.

In the previous level in this stairway, we discussed Nominal, Categorical, Absolute, Ordinal and Rank scales. These were (in order) the weakest scales we can use. Now, it is time to get into the stronger scales that we data people like! The scales that have math! Fancy names! Symbols for their units!

Again, go back to the first levels of this service and refresh yourself on some terminology. In particular, the idea of a “metric function” has nothing to do with the “Metric System”; it means that certain computations are possible.

Interval Scales

“Time is what keeps everything from happening all at once.” – George Carlin

Interval scales have a unit or interval which is used do the measurement, in terms of multiples and fractions of that unit. Interval scales have a metric function, ordering, and meaningful operations among the units, but no natural origin. Calendars are the best example; some arbitrary historical event is the starting point for the scale and all measurements are related to it using days and aggregations of days from that origin point. Likewise, time extends from a past eternity to a future eternity. Real and integer numbers are another example of an interval scale that goes from a negative infinity to a positive infinity. This is shown in the IEEE floating point standard where there is both a positive and negative infinity. But let's use dates as our example in this discussion.

The metric function is the number of days between two dates. Looking at the three properties:

(1) M(a, a) = 0: there are zero days between today and today.
(2) M(a, b) = M(b, a): there are just as many days from today to next Monday as there are from next Monday back to today; notice we did not care about the sign of the direction of temporal travel; think absolute value.
(3) M(a, b) Å M(b, c) ? M(a, c): the number of days from today to next Monday plus the number of days from next Monday to Christmas is the same as the number of days from today until Christmas.

Ordering is very natural and strong: 2011-09-01 occurs before 2011-09-02. Time travel backwards is impossible (but it makes for good SF stories). Time travel forward is constant at the rate of one second per second.

Aggregations of the basic unit (days) into other units (weeks, months, and years) are also arbitrary. Romans and North Africans used a 10-day “week” while the “Children of Abraham” use a Lunar cycle of seven days. For a good discussion of the history of these aggregated units, see The Seven Day Circle (Zerubavel, 1985, ISBN 978-0226981659).

Log-interval

Please do not think that the only metric function is simple math; there are log-interval scales, too. The measurements are assigned numbers such that ratios between the numbers reflect ratios of the attribute. You then use formulas of the form (c * md), where c and d are constants, to do transforms and operations. For example, density = (mass/volume), fuel efficiency expressed in Miles per Gallon (mpg), decibel scale for sound and the Richter scale for earthquakes are exponential, so their functions involve logarithms and exponents.

Another example is temperature. A human body has an average temperature of 37°C, but having three people hold a pot will not boil water. Body mass is part of the metric function.

Ratio Scales

Finally! Ratio scales are what people really think of when they think about a measurement. Ratio scales are like interval scales with an origin (usually zero units). The operations that can be expressed in arithmetic. They are called ratio scales because all measurements are expressed as multiples or fractions (ratios) of a certain unit or interval. And other units can be computed from them, like derived units like kilometers per hour.

Length, mass, and volume are examples of this type of scale. The unit is what is arbitrary. The weight of a bag of sand is still weight whether it is measured in kilograms or in pounds. Another nice property is that the units are identical; a kilogram is still a kilogram whether it is measuring feathers or bricks.

But things can be tricky. When the Centigrade and Fahrenheit scales were set up, they were defined as interval scales. We believed in infinite temperatures. We then discovered absolute zero, Centigrade became Celsius, and we got the Kelvin as the unit of temperature. Now temperature is a ratio scale!

In practice, you use other scales for temperature. Your refrigerator might have the numbers 1 to 10 on a dial. Your room air conditioner might have a ranking scales expressed as (cold, cool, warm, hot) on push buttons. The Celsius scale is good for people in their daily business. When the UK went metric there were posters telling people that zero is freeze; 10's are cold; 20's are comfortable; 30's are hot; 40's are deadly.

Scale Conversion

An attribute might not fit exactly into any these scales. For example, you mix nominal and ordinal information in a single scale, such as in questionnaires that have several non-response categories. It is common to have scales that mix ordinal and interval scales by assuming the attribute is really a smooth monotone function. Subjective rating scales like the Likert scale ('strongly agree', 'agree', 'no opinion', 'disagree', 'strongly disagree') have no "equally spaced intervals" between the ratings, nor any objective way to find them if they do exist. A binary variable is at least an interval scale and it might a ratio or absolute scale, if it means that the attribute exists or does not exist.

The important principle of measurement theory is that you can convert from one scale to another only if they are of the same type of scale and measure the same attribute.

Absolute scales do not convert, which is why they are called absolute scales. Five apples are five apples, no matter how many times you count them or how you arrange them on the table. The most you can do is use diffrent aggregations. My favorite example is measuring beer cans by the eight-pack instead of the six-pack.

Nominal scales are converted to other nominal scales by a mapping between the scales. That means you look things up in a table. For example, I can convert my English city names to Polish city names with a dictionary. The problem comes when there is not a one-to-one mapping between the two nominal scales. For example, English uses the word "cousin" to identify the offspring of your parents' siblings, and tradition treats them all pretty much alike.

Chinese culture has separate words for the same relations based on the genders of your parents' siblings and the age relationships among them (e.g., the oldest son of your father's oldest brother is a particular type of cousin and you have different social obligations to him). Something is lost in translation.

Ordinal scales are converted to ordinal scales by a monotonic function. That means you preserve the ordering when you convert. Looking at the MSH scale for geologists, I can pick another set of minerals, plastics, or metals to scratch, but rock samples that were definitely softer than others will still be softer.
Again, there are problems when there is not a one-to-one mapping between the two scales. My new scale may be able to tell the difference between rocks where the MSH scale could not. Eventually, I will want to use the Rockwell Hardness scale which has seven specific “sub-scales” for industrial use.

Rank scales are converted to rank scales by a monotonic function that preserves the ordering, like ordinal scales. Again, there are problems when there is not a one-to-one mapping between the two scales. For example, in the US Army, a Warrant Officer has office privileges but British Warrant Officers do not. Even different military branches in the same country have slightly different ranks that don't quite correspond to each other.

In both the nominal and the ordinal scales, the problem was that things that looked equal on one scale were different on another. This has to do with range and granularity, which was discussed in stair #1 of these series.

Interval scales are converted to interval scales by a linear function; that is, a function of the form y = a*x + b. This preserves the ordering, but shifts the origin point when you convert. For example, I can convert temperature from degrees Celsius to degrees Fahrenheit using the formula °F = (9/5 * °C) + 32.

Ratio scales are converted to ratio scales by a constant multiplier, since both scales have the same ordering and origin point. For example, I can convert from pounds to kilograms using the formula p = 0.4536 * kg. This is why people like to use ratio scales.

Derived Units

Many of the scales that we use are not primary units, but derived units. These are measures that are constructed from primary units, such as miles per hour (time and distance), or square miles (distance and distance). You can use only ratio and interval scales to construct derived units.

If you use an absolute scale with a ratio or interval scale, you are dealing with statistics, not measurements. For example, using weight (ratio scale) and the number of people in New York (absolute scale), we can compute the average weight of a New Yorker, which is a statistic that deals with a set, not a unit of measurement which deals with an entity.

The SI measurements use a basic set of seven units (meter for length, kilogram for mass, second for time, ampere for electrical current, Kelvin for temperature, mole for molecules, and candela for light) and construct derived units. ISO standard 2955 ("Information Processing -- Representation of SI and other units for use in systems with limited character sets") has a notation for expressing SI units in ASCII character strings. The notation uses parentheses, spaces, multiplication (shown by a period), division (shown by a solidus, or slash) and exponents (shown by numerals immediately after the unit abbreviation). There are also names for most of the standard derived units. For example, "100 kg.m/s2" converts to 10 newtons (the unit of force), written as "10 N" instead.

Punctuation and Standard Units

Database stores measurements as numeric data represented in a binary format, but when the data is input or output, a human being wants readable characters and punctuation. Punctuation serves to identify the units being used, and can be used for prefix, postfix, or infix symbols. It can also be implicit or explicit.

If I write "$25.15", you know that the unit of measure is the dollar because of the explicit prefix dollar sign. If I write "160 lbs." you know that the unit of measure is pounds because of the explicit postfix abbreviation for the unit. If I write "2011 MAR 12" you know that this is a date because of the implicit infix separation among month, day, and year, achieved by changing from numerals to letters, and the optional spaces. The ISO and SQL defaults represent the same date, using explicit infix punctuation, with 2011-03-12 instead. Likewise, a column header on a report that gives the units used is explicit punctuation.

Databases do not generally store punctuation. Punctuation wastes storage space and the units can be represented in some internal format that can be used in calculations. Punctuation is only for display.

The DBA has to be sure that all data in a column of a table is expressed in the same units before it is stored. There are some horror stories about multinational companies sending the same input programs used in the United States to their European offices, where SI and English measurements were mixed into the same database without conversion. Likewise, we want to be sure that data is kept in the same units in all the tables in the database. If different units are needed, they can be provided in a VIEW that hides the conversions (thus the office in the United States sees English measurements and the European offices see SI units and date formats; neither is aware of the conversions being done for it).

It is possible to put the unit symbols in a column next to a numeric column that holds their quantities. This column is generally named “uom” for “unit of Measure” and it is needed for special situations. The most common one is for currency conversions that change daily.

General Guidelines for Using Scales in a Database

The following are general guidelines for using measurements and scales in a database and not firm, hard rules. You will find exceptions to all of them.

  1. In general, the more unrestricted the permissible transformations on a scale are, the more restricted the statistics. Almost all statistics are applicable to measurements made on ratio scales, but only a very limited group of statistics may be applied to measurements made on nominal scales.
  2. Use CHECK() clauses on table declarations to make sure that only the allowed values appear in the database. Nominal scales would have a list of possible values or a foreign key references to a lookup table. Other scales would have range-checking. Likewise, use the DEFAULT() clauses to be sure that each scale starts with its origin value, a NULL, or a default value that makes sense.
  3. Declare at least one more decimal place than you think you will need for units that can be expressed as fractions. Rounding and truncation will improve with more decimal places. The downside of SQL is that precision and the rules for truncation and rounding are implementation-dependent, so a query with calculations might not give the same results on another product.
  4. Try to store primary units rather than derived units. This is not always possible, since you might not be able to measure anything but the derived unit. Look at your new tire gauge; it is set for Pascals (Newtons per square meter) and will not tell you how many square meters you have on the surface of the tire or the force exerted by the air. And you simply cannot figure these things out from the Pascals given. A set of primary units can be computed in many different ways to construct any possible derived unit desired. Storing all those derived units in the same table would be redundant. It also opens the door to possible errors when a primary-unit column is changed and the derived units based on it are not updated. Also, most computers can recalculate the derived units in a computed column or a VIEW much faster than they can read a value from a disk drive.
  5. Use the same punctuation whenever a unit is displayed. For example, do not mix date formats, or express weight in pounds and kilograms on the same report. Ideally, everything should be displayed the same way in the entire application system.

You are thinking that not all things are expressible in SI units. You need to look at other ISO, ANSI or other industry Standards for your situation. Or sometimes you have to create your own data encodings. 

The last two levels in this staircase will deal with the kinds of encoding schemes, how to use them and how to design them. Yes, you have to design data! But we will get to that next time.

This article is part of the Stairway to Data Stairway

Sign up to our RSS feed and get notified as soon as we publish a new level in the Stairway! Rss

Total article views: 3814 | Views in the last 30 days: 20
 
Related Articles
FORUM

Buffer Cache Hit Ratio

Buffer Cache Hit Ratio

FORUM

Order by for Ratio Dataset

Order by for Ratio

ARTICLE

Scaling Out

Kumar discusses the differences between scale up and scale out, then does a very good walk through o...

FORUM

Time intervals

Creating time intervals on overlapping events

FORUM

sql time intervals problem

sql time intervals problem

Tags
data    
stairway series    
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones