SQLServerCentral Article

Mathematicians and SQL

,

Georg Ferdinand Ludwig Philipp Cantor is considered the creator of set theory, and his theories are the basis for the naïve set theory you learned in school. But there are lots of other mathematicians you should know, such as Hilbert, Frege, Russell, Zermelo and Dedekind. They made a lot of contributions, too.

Hilbert Hilbert is best known for asking the first questions about infinities – all the mathematicians want to stay at “Hilbert’s hotel” and you can look up some YouTube video animations that explain his work. Dedekind is responsible for “Dedekind cuts” which give a good consistent model of how limits work. Peano is best known for “Peano’s Axioms”, a set of axioms for deriving integer arithmetic. He was also an early advocate of constructed languages (Latino sine flexione - "Latin without inflections").

Two of them, Dedekind and Peano, required their sets to be nonempty, while Cantor considered the empty set to be a valid set. If it’s any consolation, it took a long time for people to accept the idea that zero is a number.

There are advantages to excluding the empty set and only accepting sets if they have at least one element. The advantages are similar to excluding zero from numbers and only accepting positive numbers. Remember that Roman numerals built an entire civilization without a zero. For example, when Dedekind constructed real numbers from rational numbers, he used what we now call Dedekind cuts of rational numbers. A Dedekind cut of the set of rational numbers consists of partitioning the set of rational numbers into two nonempty subsets so that all the elements in one of those subsets are less than all the elements of the other subset. If you don’t consider the empty set to be a set, you can leave out the word “nonempty” in the definition of Dedekind cuts. So, for that and other reasons, it can be convenient to require sets to be non-empty.

But there are advantages to including the empty set in your system. The biggest advantage is that intersections of sets always exist. If you exclude the empty set as a set, then disjoint sets don’t have intersections, and that means every time you use intersections of sets, you have to consider two cases: the case where the sets are disjoint and don’t have an intersection, and the case where the sets are not disjoint and do have an intersection. Cantor had another reason as well to include empty sets. They worked out well in his theories of ordinal and cardinal numbers. Always having an intersection is enough of an advantage. Cantor’s position won out. Hilbert, Frege, Russell, and Zermelo all accepted the empty set and we added a special symbol for it: Ø.

SQL Tables

In RDBMS and SQL, we say that sets are the basis for tables. This is actually not true. First of all, no matter how big your big data is, it will never be infinite, but we love infinite sets. Next, SQL really uses multi-sets or bags, which allow duplicate elements. A key is defined as a subset of the columns that uniquely identify each entity in a table. If you want to assure that the rows of an SQL table are indeed unique, then we have to add explicit constraints, like PRIMARY KEY. The ANSI X3H2 Standards committee considered making such constraints a requirement, but decided not to do so. The early SQL engines were built on top of existing file systems and we wanted to be able to move legacy data over to SQL directly. This is the reason that new SQL Server programmers use IDENTITY as a universal, generic, all-purpose Kabbalah key. It’s a quick, easy way to fake having a key. Essentially, such programmers are using a proprietary feature to mimic pointer chains.

The abstract model is that table is an unordered set of rows. Let’s dig down another layer of abstraction; a row is defined as being an unordered set of columns. Each of the columns is supposed to be atomic and scalar. Other programming languages use a physical position within a storage unit some kind to locate data. In SQL, we can reference data by column names within a table. Well, we have some short hands in the form of the * and an assumption of ordering in the columns, but these are just conventions and short hands for the programming language implementation.

Let’s take a little time to compare a table to a file, a row to a record and a column to a field. The strongest difference is that files and records in traditional file-based systems assume ordering. SQL and RDBMS reject ordering. A procedural language statement like “READ a, b, c;” is different from “READ b, c, a;” because of ordering of fields in records, but “SELECT a, b, c FROM Foobar;” is not different from “SELECT b, c, a FROM Foobar;” or any other permutation of the column list in the SELECT clause. The only thing we don’t want to see is a column name duplicated in the list.

A column has constraints, a strong data type and its value can be referenced by columns in the same or other tables in the schema. Fields have to get any constraints, typing from the program using their data. Files are pretty much independent of each other, while tables are parts of a common schema.

In a simple file, the fields are determined by the program reading the file. Any constraints on the data have to be put into the programs. The records are read from left to right, sometimes with the ability to navigate records, back and forth, within the file. The fields are also read from left to right, as opposed to the “All at once” set-oriented approach we use in SQL and RDBMS.

But what happens when the file is empty? Most simple file systems have an EOF (end of file) flag that signals the program when it has run out of data in the sequence of records. This is a source of another problem with file systems. Is this flag raised when you read the last record in the sequence or when you try to read past the last record in the sequence? Since a set has no first, last, or next concept in its structure, there is no such problem with tables.

But let’s go back to declaring a table. I can name my columns and I can put constraints on those columns. For example:

CREATE TABLE Empty_Table
(foo_key CHAR(10) NOT NULL PRIMARY KEY,
 floob INTEGER NOT NULL,
 flib INTEGER NOT NULL CHECK (flib > 0),
 CONSTRAINT Impossible CHECK (flib + 1 < flib));

I decided to prohibit putting any data in this table. Will the impossible constraint cause an error? Officially, according to the standards, this is just fine as long as the table is empty. The rule is that all constraints are assumed to be TRUE when the table is empty. This goes back to an old logical principle called Existential Import.

You probably remember Lewis Carroll (Charles Lutwidge Dodgson) as the author of “Alice in Wonderland”, but he was also one of the pioneers of symbolic logic. In his day logic was based on classic Aristotelian syllogisms (“all x are y”, “no x are y”, and “some x are y”). The question in their day, much like our empty set in the current discussion, was whether or not a proposition like “all men are mortal” should be taken to mean both “no men are not mortal” and “some men are mortal” – one or more men exist with property of mortality. Existential import lost because logic shifted to a Boolean model in which the universal qualifier, now written as ?(x): f(x), means that unspecified stream of conjunction (AND) operators between each term --- f(x1)^ ..^ f(xn). This works just fine with an empty set. In English, this means that you can say “all unicorns are pink.” and it’s true!

Another way to construct a table is to use a table constructor:

VALUES (‘alpha’, 1), (‘beta’, 2) (‘gamma’, 3) [AS foobar [(foo_name, foo_count)]]

I have a list of three rows, each of which includes two columns. But until I add names to the table and those two columns, we are not quite sure what to do. SQL does not like to use positional numbers; that’s like the indexes in an array declaration in a procedural language. Implementations will sometimes create system generated names for the columns and the tables, but these names are usually pretty horrible because you need to do something to guarantee uniqueness. But how do I create an empty table with this syntax?

As an aside, Chris Date has defined two tables, Dee and Dum, in his Tutorial D language. His Dum table is empty and contains no rows at all; Dee has a single row, But this row has no values in its column (This makes keys a bit difficult to define). He uses them to model TRUE and FALSE, respectively. It might be worth remarking that tutorial D has no concept of NULL.

Department of Redundancy Department

In mathematics, you can not have two copies of the same set. There’s only one set of integers, one set of reals, etc. However, in SQL you can create two or more tables with exactly the same structure. David McGovern Has written essays and articles on why this is a huge design flaw, and the problems it creates. In practice, you might want to do this for storage reasons, but it usually the result of a design flaw called attribute splitting. This design error has many forms, but at the table level It usually takes the form of converting an attribute value into an attribute.

This is easier to explain with an example. Let’s create a general table for personnel in our schema, but we have two tables the same structure. One is for male employees and one is for female employees.

CREATE TABLE Male_Personnel (…);
CREATE TABLE Female_Personnel (…);
CREATE VIEW Personnel (…)
AS
SELECT ‘1’ AS sex_code FROM Male_Personnel
UNION
SELECT ‘2’ AS sex_code FROM Female_Personnel;

This is pretty clearly absurd (and probably illegal) because we’ve split out the sex code values into two tables. One problem with split tables is that you have to split all of the values. The ISO six codes consist of (0= unknown, 1= male, 2= female, 9= lawful person (corporations, organizations, Etc.). A second problem is if I want to do anything with personnel,  I’m going to be writing a VIEW that UNIONs all of the tables that resulted from the splitting. This can be a lot of overhead.  A third problem is overlap. In ANSI/ISO Standard SQL, I could avoid this by having a CREATE ASSERTION statement in my schema. Performance will probably be awful.

Before you think that nobody would make this mistake, consider how many people implement a database by copying the old paper and magnetic tape filesystems they previously had. In particular, consider splitting on a date, with one table for each month in the year. This mimics file folders you used with paper invoices, so your mindsets locked into it. They have started thinking of a month is a thing in itself, a set. However, the truth is that it is a unit of measurement on a scale called the calendar.

Conclusion

SQL and the mathematics of relational algebra don’t map one-to-one, any more than algebra and FORTRAN did. Likewise, graph databases are not the same as mathematical graph theory. But if you don’t have a foundation or an appreciation of the various flavors of the underlying mathematical foundations upon which your computer tools are based, you will not ever have a good feel for how to properly use them.

Rate

4.33 (3)

You rated this post out of 5. Change rating

Share

Share

Rate

4.33 (3)

You rated this post out of 5. Change rating