Degree of Duplication

Technology is constantly moving forward, but it is also helpful to understand how we arrived where we are today. Joe Celko reminisces about the history of database design and how it relates to the concept of ‘Degree of Duplication’ in this article.

The relational model was strongly based on sets. It wasn’t enough that a table had to have a key (originally a primary key, but later we realized all keys were keys). As Jim Gray put it in the early days of RDBMS, “We had no idea what we were doing and just sort of made it up as we went along.” We were building our first relational databases on top of existing filesystems. The filesystems were built on top of punch cards and magnetic tapes. Sequential tape files don’t work very well unless they’re in sorted order for access. What had been the sort key in magnetic tape systems, became the primary key in our new RDBMS model. A bit later, however, Dr. Codd realized that a key is a key, regardless of whether we decide to make it special.

But it went a little further than this. One or two of the early SQL products automatically eliminated duplicate rows from result sets. This follows Dr. Codd’s original definition of what a relation (table) should be. It’s still a good idea for the result sets coming back from a query to be properly defined tables, but SQL doesn’t require this. In fact, one of our big debates the early days of ANSI X3H2 was whether or not to require a PRIMARY KEY in the table declarations or force the results of SQL statements remove duplicate rows.

Largely because the early SQL products were built on top of filesystems, we decided not to require a key in a table declaration (knowing full well that this meant you could declare what were basically tape files or decks of punch cards in our new language). The idea was to allow people to directly move from their old systems into SQL. If you go back to the old IBM tape systems and compare the basic commands on the original tape drives to the basic SQL cursor statements, they match one for one.

It was in this environment that Dr. Codd introduced the concept of ‘degree of duplication’ to handle repeated data. This meant that SQL moved from language based on sets to a language based on multisets (also called ‘bags’), and we also picked up three value logic NULLs. Nulls are another topic in themselves.

I’m going to assume by now everybody has filled out a paper form to order something. The order forms all look the same and have for centuries. There’s a header that tells us who was ordering merchandise. It’s then followed by several lines of order details. Those lines consist of a quantity, the item’s identifier (this can be an SKU, but in the days of paper forms it was generally a name), the unit price, and an extension (order quantity times unit price).

The trouble with paper forms is, well, they are paper. Unlike a relational database, a paper form has no relationship to anything else in the system. It doesn’t check for typos, current prices, or guarantee that we’ve done the correct extension.

It also doesn’t check to see if we’ve ordered the same item on two different lines of the form. Ideally, if two lines order the same item, then they should have been consolidated into one line with the combined quantity from the original lines. How many lines are duplicated for a particular item are part of what Dr. Codd called the degree of duplication in the data.

A properly formed set or multiset is a collection of the same kind of things. The term ‘same kind of thing’ is a bit vague, but it is important for a relational database. In a relational database, the rows of a table must be instances of the same entity; that is, a Personnel table is made up of rows that represent individual employees, not a mix of employees, sardines and automobiles.

Sets and RDBMS

The reason you want to work with a properly formed set is that operations on a general element makes sense when applied to all the elements of the set. For example, you can eat a sardine, but you cannot eat an automobile, and you should not eat an employee.

However, a grouped table built from the Personnel table, say by grouping on departments, is not the same kind of thing. In the grouped table, the rows are aggregates and not individuals. Departmental data is a different level of abstraction and cannot be mixed with individual data. Degree of duplication is something like this.

The basic set operations are:

  1. Membership: This operation says how elements are related to a set. An element either is or is not a member of a particular set. The symbol is stylized letter epsilon, written as a ∈ A in mathematical notation.

  2. Containment: One set A contains another set B if all the elements of B are also elements of A. B is called a subset of A. This includes the case where A and B are the same set, but if there are elements of A which are not in B, then the relationship is called proper containment. The symbol is stylized letter U turned on its side; if you need to show ‘contains or equal to’, a horizontal bar can be placed under the symbol. The notation is A ⊂ B and A ⊆ B.

    It is important to note that the empty set is not a proper subset of every set. If A is a subset of B, the containment is proper if and only if there exists an element b in B such that b is not in A. Since every set contains itself, the empty set is a subset of the empty set but this is not proper containment, so the empty set is not a proper subset of every set.

  3. Union: The union of two sets is a single new set which contains all the elements in both sets. The symbol is a stylized letter U and the formal mathematical definition is:

    To translate this into simple English: For every x that is a member of A OR x that is a member of B, x is a member of A UNION B.

  4. Intersection: The intersection of two sets is a single new set which contains all the elements common to both sets. The symbol is an inverted stylized letter U and the formal mathematical definition is

    For every x that is a member of A AND x that is a member of B, x is a member of A INTERSECT B.

    This gets a little tricky with multisets. What if set A is bigger than set B? The definition of set equality consists of being able to match one for one between two sets, much like a caveman matching his arrowheads to his seashells, one for one.

  5. Difference: The difference of two sets A and B, is a single new set which contains elements from A which are not in B. The symbol is a minus sign.

    For every x that is a member of A AND x that is not a member of B, x is a member of the set A minus B.

  6. Cardinality: This is the number of elements in a set or multiset. A point that is hard to make is that finding cardinality is not the same as counting. Cantor originally invented set theory to handle infinite sets which would obviously be impossible to enumerate. I will use ‘card(A)’, but you will see #A and other notations. For database work, let’s assume it returns an integer greater than or equal to zero. The empty set is the only set which has a cardinality of zero.

Multisets

A multiset (also called a ‘bag’) is a collection of elements of the same type with duplicates of the elements in it. There is no ordering of the elements in a multiset and we still have the empty set. Multisets are the basis for SQL, while sets are the basis for Dr. Codd’s relational model.

The basic multiset operations are derived from set operations, but have extensions to handle duplicates.

  1. Membership: An element either is or is not a member of a particular multiset. The symbol is the stylized letter epsilon. In addition to a value, an element also has a degree of duplication which tells you the number of times it appears in the multiset.

    Everyone agrees that the Degree of Duplication of an elements can be greater than zero. However, there is some debate as to whether the Degree of Duplication can be zero, to show that an element is not a member of a multiset. Nobody has proposed using a negative Degree of Duplication, but I do not know if there are any reasons not to do so, other than the fact that it does not make any intuitive sense.

    For the rest of this discussion, let me introduce a notation for finding the degree of duplication of an element in a set.

    where the integer value is zero or greater. I’m not going to allow for infinite sets, although that should in theory be possible. It’s just not useful for real databases.

  2. Reduction: This operation removes redundant duplicates from the multiset and converts it into a set. In SQL, this is the effect of using a SELECT DISTINCT clause.

    For the rest of this discussion, let me intro a notation for the reduction of a set:

  3. Containment: One multiset A contains another multiset B if

    This definition Includes the case where A and B are the same multiset, but if there are elements of A which are not in B, then the relationship is called proper containment. The easiest way to think of this is that there is a mapping of identical values from set A into elements of the same value in a set B. We just happen to take the count using the dod() function, rather than set up an explicit correspondence

  4. Union: The union of two multisets is a single new multiset which contains all the elements in both multisets. A more formal definition is

    The degree of duplication in the union is the sum of the degree of duplication from both tables. Again, the simplest way to think of this is that you just ‘dump everything in the new multiset’, in the same way that you use to create the union of two sets. At this point. You’re probably wising up the fact that were talking about UNION and UNION ALL in SQL. We are just using a bunch of mathematical symbolism.

  5. Intersection: The intersection of two multisets is a single new multiset which contains elements common to both sets, such that

    The degree of duplication in the intersection is based on the idea that you match pairs from each set in the intersection. Notice that this means the smaller set will be paired up completely and leave elements from a larger set unpaired.

    ANSI/ISO Standard SQL has an INTERSECT [ALL] option, which pairs values in one set with another and discards any that didn’t match up the larger set.

  6. Difference: The difference of two multisets A and B, is a single new multiset which contains elements from A which are not in B after pairs are matched from the two multisets. More formally:

    Again, ANSI/ISO Standard SQL makes pairs between the two sets involved, and removes those pairs.

  7. Cardinality: Again, it is the number of elements in a multiset and it is not the same as identifying and counting individual elements. The late David Beech wrote a letter about sets versus multisets for use in relational databases which became the famous ‘cat food’ problem (see “Toil and Trouble” by Chris Date, Relational Database Programming & Design, 1994 January and “New Life for SQL” by David Beech, Datamation 1989 February). The situation is that I have a collection of identical objects, cans of cat food that sell for the same price. Should I have only one occurrence of the cat food with a total price or is it all right to have multiple occurrences?

Obviously, I can get the total price if I know the total quantity and the unit price. There’s a problem here; it’s called the Law of Identity and it is the foundation of Western logic. Thanks to Ayn Rand, people falsely believe that Aristotle created The Law of Identity, but it actually predates him. Simply stated, it says “to be is to be something in particular; to be nothing in particular or anything in general, is to be nothing at all.” Everything in the universe has an identity. Yet at the same time, we look at categories and collections whose elements are seen as generic (i.e. cans of cat food are interchangeable).

There is another way to do this. Weigh the shopping bag, grab any can from it, weigh it, throw it back in the bag and do a division. You distinguished a representative subset of the cans from the rest of the set, but you never identify an individual can. Distinguishing is a more primitive operation than counting. I might have big hands and have grabbed two or three cans to throw on my scale. That is fine; I just talk about quantity as pairs, triples or as a ‘handful’ of cans. This is the ‘set oriented’ approach and it is how you should shape your thinking about SQL and relational databases.

Where now into the area philosophy? At what point of aggregation do we consider something to be an entity? Is it each individual can of cat food? Or the category of cans of cat food? There really is no one right answer. This is why data modeling is hard.

Conclusion

None of this should really come as a surprise to anyone. There’s a standard pattern to the development of a new technology, especially when it’s replacing older one. It first copies features of the old technology, partly out of convenience for the user, and partly because this is the only way we know how to think the problem. This is why the model T Ford had a buggy whip holster. This is why I wanted to have a rotary dial or at least familiar dial tones, rings and so forth on my first cell phone.