Stairway to Data

Stairway to Data, Level 2: Numerics

,

Numbers and Numerals

Most people do not make a distinction between a number and a numeral in everyday conversation. A number is the abstraction represented by the numeral, which is a written symbol. Roman numerals, Chinese numerals, Mayan numerals and hundreds of other systems have existed to represent numbers in a written form so that they could either be recorded or manipulated or both.

The Hindu-Arabic numerals and place valued notation with the zero has proven so useful that it has replaced all other systems today. However, we forget that the zero had a long hard fight for acceptance. One fifteenth-century Frenchman wrote, 'Just as the rag doll wanted to be an eagle, the donkey a lion and the monkey a queen, the cifra (cypher, zero) put on airs and pretended to be a digit' (THE MEASURE OF REALITY: QUANTIFICATION AND WESTERN SOCIETY, 1250-1600). Many Italian Renaissance accounting houses advertised that were using the well-known and understood Roman numerals instead of the new, fancy Arabic ones.

While this universality is good, we use Hindu-Arabic numerals for many different purposes and we are not always clear about it. The three simplest uses for the numerals in a database are for cardinal numbers, ordinal numbers and tag numbers.

Tag Numbers

A tag number is simply a name written with digits. The first advantage of using digits instead of alphas is that they are linguistically neutral and there are simple rules for generating an unlimited set of such values. Tag numbers are often stored in strings rather than in an internal binary format. Arithmetic with tag numbers makes no sense at all -- you cannot add bus route #123 to bus route #456 and get bus route #579. Why keep them in a computable format?

However, other information can be embedded into them. In the US even numbered Interstate Highways run East to West; odd-numbered Interstate Highways run North to South; multiples of 200 are perimeter highways around major cities.

Later in this staircase, we will discuss check digits and types of encoding schemes that use tag numbers.

Ordinal Numbers

An ordinal number represents a position (first, second, third, ... ) in an ordering. An ordinal number also implies a corresponding cardinal number (1, 2, 3, ...), but knowing the ordinal position does not tell you about the cardinality of the whole set; in English, knowing that you are the third person in line for a promotion tells you that you have at least two competitors. It does not tell you if there is a fourth, fifth, .. , n-th candidate who were hoping for the job.

This question of position leads to another debate; is there such a thing as the zero-th ordinal number? Computer people like to have a 0-th position because it is handy in implementing data structures with relative positioning. For example, many programming languages have arrays whose indexes start with element zero.

The idea of a zero-th ordinal number is a mild mathematical heresy. To be in the zero-th position in a queue is to have arrived in the serving line before the first person in the line. And ordinal numbers never go negative.

Obviously, ordinal numbers exist in a database when there is an ordering in the data. This usually means sequences of some kind, such as check numbers, serial numbers in a manufactured product and other things. But, arithmetic with ordinal numbers makes no sense.

But ordinal numbers are often stored internally in binary format because it makes them easier to increment and to validate. Consider consecutively numbered theater tickets. If I buy the block of tickets for the second to the eighth seat, I have (8-2+1) = 7 tickets, not (8-2) = 6 tickets.

Cardinal Numbers

The usual definition of a cardinal number is something which represents a quantity or magnitude. They are integers. Well, sort of. For a mathematician, here is where we get into countable infinite sets and other things. The numbers database people need to worry about are not so exotic.

The natural or "counting" numbers, are positive integers {1, 2, 3, ..}. Then the "whole" numbers are the natural numbers together with zero {0, 1, 2, 3, ..}. Finally, we have the "integers", which are zero, the natural numbers and the negatives of the naturals {.. , –3, –2, –1, 0, 1, 2, 3, ..}, which is what the INTEGER data type stores in digital computers.

The natural and whole numbers can be used in a database with a CHECK() constraint on an INTEGER column declaration. The constraints preserve the appropriate range, but they also help the optimizer. The shame is that most SQL programmers do not bother to use the constraints.

Hardware Representations

Internally, computers represent these numbers as binary in the hardware. But what kind of binary? The sign-and-magnitude method is probably the most natural approach. The most significant digit is the sign of the integer in binary. It uses 0 for a positive number and 1 for a negative number. The remaining bits are the absolute value. A consequence of this representation is that there are two ways to represent zero, +0 and -0. Sign-and-magnitude is not completely gone; it is used for the in floating point numbers, which we will get to later..

There was also a ones' complement system, which uses the bitwise NOT of the positive number. This also has two representations of zero. To add two integers in this system, first do a conventional binary addition and add any resulting carry back into the result.

A two's-complement machine represents negative numbers by the two's complement of the absolute value. This system is the most common method of representing signed integers in computers today. This means that if you have (n) bits, you get a range of -2(n-1) to +2(n-1)-1. The math in this system is simple to implement and zero has only a single representation.

There are also more exotic binary representations that might make a good Science Fair project, but will not see commercial products. My favorites are the “nega-binary” and “Fibonacci” systems. The place values in the nega-binary system increase by powers of two but the sign switched each time (i.e. 1, -2, 4, -8, 16, ..), thus we have 1 = (1), -1 =(1,-2), 2 =(-2,4), 3 =(1,-2,4), etc. The place values in the Fibonacci system increase by the Fibonacci sequence with a sign bit (i.e. 1, 1, 2, 3, 5), we have 1 =(1), 2 = (1,1) or (2), 3 = (1,2) or (3), etc.

Don't laugh at the Fibonacci system. The Incas used it in the form of a counting tray called a “yupana” whose compartments had Fibonacci values. Addition and other math operations are easy.

BCD (Binary Coded Decimal) was popular with COBOL programmers and IBM. It uses a nibble (four bits) to represent the digits 0 to 9,positive and negative signs. It is still popular inside electronics that have to do display – each nibble maps to one display character. Then there was also Packed Decimal, Fixed-point Packed Decimal, Zoned Decimal and other forms.

This is why SQL and modern high level languages avoid bit and byte level operations, or make assumptions about physical storage. There is no guarantee about physical storage.

Rational Numbers

Fractions are better called Rational Numbers. It means that they are expressed as two integers, a/b with the denominator b not equal to zero. Since b may be equal to 1, every integer is a rational number.

Numbers of the form "a b/c" are found as a option on y pocket calculators. The calculations are usually handled internally as a binary number, then are converted to and from the fractional notation for display. However, you can also write functions to do the basic math with the (a, b, c) triples.

Fractions are handy for input when the data is given that way, as for example, English measurements with fractional inches. But frankly, with the SI (metric) system in universal use, there is less and less need for this data type. You are better off using a decimal fraction shown with as y decimal places as you need.

The decimal or binary expansion of a rational number always either terminates after finitely y digits or begins to repeat the same finite sequence of digits over and over.

A real number that is not rational is called irrational. The decimal or binary expansion of an irrational number continues forever without repeating. That means you can approximate them for calculations. Irrational numbers include v2, p and e, which are pretty common in computations. In fact, some programming languages have symbols for them.

Exact Numerics

An exact numeric value has a precision, P and a scale, S, The precision is a positive integer that determines the number of significant digits in a particular radix (formerly called the base of a number system). The radix is usually binary in a computer and decimal in the real world.

The scale is a non-negative integer that tells you how y decimal places the number has. The SMALLINT, INTEGER and BIGINT data types have a scale of zero. The SQL data types NUMERIC, DECIMAL, SMALLINT, INTEGER and BIGINT are exact numeric types. For example, DECIMAL(8,2) could be used to hold the number 123456.78, which has eight significant digits and two decimal places.

The difference between NUMERIC and DECIMAL is subtle. NUMERIC specifies the exact precision and scale to be used. DECIMAL specifies the exact scale, but the precision is implementation-defined to be equal to or greater than the specified value. It is a subtle difference, but T-SQL handles them the same way; DECIMAL is therefore the preferred choice for T-SQL and portability.

Approximate Numeric Representations versus the Continuum

A point is defined in mathematics as an indivisible position in some space. A continuum is defined as being made up of parts which are always further divisible. If you look at a line in geometry (or a number in analysis, an event in time, etc.), we speak about geometric points on a geometric line (or a number on the number line, a duration in time, etc.) and we regard the line (numbers, time) as being a continuum. Look at the number line; clearly given an segment on the number line, such as (1,2), we can always further divide the segment into smaller segments, such as (1.1, 1.5) and repeat this process forever and the same thing applies to geometric lines and to time.

This leads to a paradox. If a line is an infinitely divisible continuum, how can an indivisible point be a part of a line? I will let the reader worry about this and tell you that we do not worry about it in real databases. Instead, we have learned to live with approximate numeric values and a certain amount of inherent error.

An approximate numeric value consists of a significand (formerly called the mantissa) and an exponent. The significand is a signed numeric value; the exponent is a signed integer that specifies the magnitude of the significand. An approximate numeric value has a precision. The precision is a positive integer that specifies the number of significant binary digits in the significand. The value of an approximate numeric value is the significand multiplied by the base (usually 10) raised to the exponent.

The IEEE-754 Standard is the de facto hardware implementation. The IEEE floating point standard also has certain bit configurations, called NaNs (Not a Number), to represent overflow, underflow, errors and missing values; these provide a way to implement NULLs as well as to capture errors.

The advantage of floating-point representation over exact numeric representations is that it can support a much wider range of values. The trade-off is precision. Floating point calculations do not follow some basic algebraic laws. For example, (a + (b+c)) ? ((a + b)+c) and ((a + b)×c) ? (a×c + b×c) for all values. Furthermore, you cannot represent 0.1 and 0.01 exactly in binary, so rounding to those decimal places is a problem.

Today, there is little real use for floating point numbers. DECIMAL(s,p) can have huge integer values for the scale and precision in SQL products, far beyond what you are likely to ever use.

Zero, NULL and Math

The NULL in SQL is not a number. The usual description of NULLs is that they represent currently unknown values that might be replaced later with real values when we know something. You have to specifically forbid NULLs in a column by declaring the column with a NOT NULL constraint. But in SQL you can use the CAST function to declare a specific data type for a NULL, such as CAST (NULL AS INTEGER). The reason for this convention is practical; it let you pass information about how to create a column to the database engine.

The basic rule for math with NULLs is that they propagate. An arithmetic operation with a NULL will return a NULL. That makes sense; if a NULL is a missing value, then you cannot determine the results of a calculation with it.

However, the expression (NULL / 0) is confusing. The first thought is that a division by zero should return an error; if NULL is a true missing value, there is no value to which it can resolve and make that expression valid. However, SQL propagates the NULL and do not even issue a warning about division by zero when it appears as a constant in an expression. A non-NULL value divided by zero will cause an error, however.

SQL specifies two functions, NULLIF() and the related COALESCE(), that can be used to replace expressions with NULL and vice versa. James Anderson at the University of Reading, England proposed a new symbol, “Nullity” for the result of a division by zero. He had a brief burst of publicity in December 2006 on the BBC. However, commentators quickly pointed out that his ideas are just a variation of the standard IEEE 754 concept of NaN (Not a Number).

Zero and Powers

What does 00 (zero to zero power) mean? The schools of thought are that it is equal to one, that this expression is undefined or that it is an "indeterminate form", meaning that in some cases it has one value and in other cases it has another.

The discussion goes back to Euler, who argued for 00 = 1 since x0 = 1 for x ? 0 and this convention would avoid making zero an exception. This is enforced by that fact that the limit of xx as x ? 0, is one. The function f(x,y) = xy cannot be assigned values for x and y that will make it continuous at (0,0), since the limit along the line x = 0 is 0 and the limit along the line y = 0 is 1.

Donald Knuth thought that we must define x0 = 1 for all x, if the binomial theorem is to be valid when x = 0, y = 0 and/or x = -y. However, you can argue that giving a value to a function with an essential discontinuity at a point, such as xy at (0,0) should not be done.

Standard SQL has a POWER() function defined as:

  1. If either the base or exponent is NULL, then the result is NULL. The propagation rule applies.
  2. If the base is zero and ..
    1. the exponent is negative, then raise an exception.
    2. The exponent is zero, then the result is one. We went with Knuth because he is smart and a computer scientist.
    3. The exponent is positive, then the result is zero.
  3. If the base is negative and ..
    1. the exponent is not equal to an exact numeric value with scale zero, then an exception condition is raised.
    2. the exponent is equal to an exact numeric value with scale zero that is an even number, then the result is EXP(exponent × LN(-base)).
    3. the exponent is equal to an exact numeric value with scale of zero that is an odd number, then the result is -EXP(exponent × LN(-base)).
  4. Otherwise, the result is EXP(exponent × LN(base)).

Exotic Numbers

That covers the basic data types which are used in most database products, but there are also other numbers which are not often used. However, many databases today have user defined data types and are extensible, so these numbers might be found in packages that can be added to an existing database engine.

Repeating Decimals

The mathematical notation for repeating decimals is to put a bar over the digits in the decimal fraction which form the repeating group. Unlike fractions, there is no way to convert them into floating point or fixed decimal numbers without some loss.

Instead, symbolic arithmetic packages have to manipulate them as symbols. The rules are not the same as those for fractions kept as an internal data type. This is one of many places where infinite and finite things behave differently. Consider this statement:

1.0 = 0.99...

proof:

let:  x = 0.99...

therefore:  (10 * x) = 9.99...

subtract x from both sides:

(10 * x) - x = 9.99... - 0.99...

Therefore:

9 * x = 9

x = 1

There is no floating point rounding error in this calculation. But you can get a rational number that you cannot use in a computation.

Complex Numbers

Complex numbers are used only in mathematical, scientific and engineering calculations. Lots of modern programming languages have native complex numbers or an add-on package. Database products also have add-on packges for complex numbers. However, you have to decided if you want to use the Cartesian or polar notation or both. Standard Pascal allows both, but most languages use only Cartesian.

Coordinates

Coordinates are an important data type for Geographical Information Systems (GIS) or geobases. Within this general data type, you will find Cartesian coordinate, polar coordinates, longitude and latitude, Hierarchical Triangular Mesh (HTM)and other spatial systems for three dimensions. Since Geobases and spatial data is a separate topic in itself, with emerging standards, I will not discuss them in this book.

Extra Reading

The New York Times ran a series of 15 articles in their Opinionator section in 2010. They were writtn by Steven Strogatz, a professor of applied mathematics at Cornell University. The articles are short and start with basic counting and go up to calculus.

http://topics.nytimes.com/top/opinion/series/steven_strogatz_on_the_elements_of_math/index.html

This article is part of the parent stairway Stairway to Data

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating