SQL Data Aggregation Aggravation

When we have to deal with and store a lot of data, it makes sense to aggregate it so that we store only the information we actually need. If we get this right, this works well, but the design of the system takes care and thought because the problems can be subtle and various. Joe Celko describes some of the ways that things can go wrong and end up providing incorrect, inaccurate or misleading results.

Raw data is usually bulky, so we like to aggregate it into smaller chunks to make it easier to digest, but there are a lot of subtle problems with aggregating data. We like to aggregate data in such a way that when we put the small chunks back together, we get the original set. In mathematics, this is called a partitioning of a set. When you can do it; it’s very, very, nice. But it’s not always possible.

Defining Subgroups

Your first thought is that you’re going to partition it by using some kind of GROUP BY clause. But back up a step. Many years ago there was a popular puzzle that originally appeared in a Martin Gardner Scientific American column called The Vanishing Leprechaun (See the Youtube demonstration here). The puzzle consists of three parts, like a jigsaw puzzle, which can be rearranged one of two ways. Depending how the pieces are arranged, one of the leprechauns seems to disappear.

The puzzle asks the false question; “where did the leprechaun go?” The problem was that the “leprechaun” was not well-defined. If we were to talk about using the same flour, eggs, butter and sugar to bake a dozen cookies or to make a small cake, you would not look at the cake and ask where the individual cookies are inside the cake, nor where the cake is inside the individual cookies. You would realize the two options with the original materials were aggregated differently. The cake is simply different from the cookies, and vice versa.

As an example of a well-defined aggregation, consider the children’s game Cootie (for an explanation, see the Wikipedia page). The object of the game is to be the first player to build a “cootie” piece by piece from various plastic body parts that include a beehive-like body, a head, antennae, eyes, a coiled proboscis, and six legs. Body parts are acquired by rolling a die. The winner is the first player to completely assemble a cootie. The number of each body part required to make a valid cootie is very well-defined; if he’s missing a part, he’s not a cootie.

The moral to the story is that the set of aggregations is not always the same as the original set.

Equivalence Classes and Relations

In set theory, we have equivalence classes and equivalence relations. They partition the set based on some relationship that is well-defined. The two that come to mind in SQL are equality and grouping. The difference is that simple equality (=) doesn’t work with NULLs, but only known values. We also have the mod (n) function, which creates (n) equivalence classes; the simplest example is that of breaking integers into odd and even classes.

Grouping (as in the GROUP BY clause, PARTITION and some other places in SQL) treats NULLs as falling in the same equivalence class.

By using the AVG() and SUM() math functions with GROUP BY, you reduce a group attribute to a value. These are the ones we picked in the early days of SQL, because they were very easy and very well understood. The problem is that they depend on the numeric values that we can compute.

We could depict the median, or a whole bunch of other measures of central tendency whose names involve a famous statistician, or Greek letter. They have a common characteristic, however; they depend on numeric values upon which we can do computations. In short, they’re not very good for discrete or descriptive values. In fact, they don’t make sense for such things. What is the average of the colors of shoes when you group by a city? If both yellow and red are equally represented, is the average color orange? The question is absurd. This is a classic issue of continuous versus discrete variables when you are doing statistics. Did you notice that the value returned does not have to be a member of the set from which it was created? That’s important. It means that it’s perfectly logical for the families in the population to have an average of 2.3 children. But in practice, it’s very embarrassing if the police find you with a 0.3 child in your house.

SQL did give you some aggregate functions that apply to discrete variables. The extrema functions (MIN () and MAX ()) extract a single value from the class. These values have be attributes of at least one element in that set. The obvious discrete statistical function that is missing is the MODE(), or most frequently occurring value in the set. It’s a terrible measure of central tendency in the set and there can be several of them in the case of ties, but it has the advantage of representing majority rule voting when it is unique.

A major problem with the mode, and other discrete statistical functions, is that they behave differently for the whole population than in subsets of the population. The best-known example of this is Simpson’s paradox (see my article: Data is crazier than you think).

Simpson’s paradox has nothing to do with Homer Simpson or O. J. Simpson; it is named after the British statistician Edward Hugh Simpson who wrote about it in the 1950s. Simpson’s paradox happens when a trend that appears in separate groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. It tends to happen when you partition your data into subsets where each of the subsets is a little skewed and of odd sizes, and illustrates the dangers of errors in population sampling.

Gerrymandering

Instead of trying to break up our population into subsets that avoid Simpson’s paradox, we can actually try to increase it! This is called gerrymandering. For a really nice animation on the basic principles go to Gerrymandering Explained.

The example is a population of 50 voters who belong to either Red or Blue party. We have voters who are 60% Blue and 40% Red. In what is called a perfect representation, each of five districts has two Reds and three Blues in it. This gives Blues absolute domination over the five districts of 10 voters each. And we have the tyranny of the majority.

However, I can group the voters in such a way that I have six Blue districts and four Red districts. Even better, create two all Blue districts and three (6 Reds + 4 Blues) districts. Suddenly the Reds are in control!

In the case of gerrymandering, this is done deliberately, but in the case of sampling, it might be accidental, or at least unintentional. For my generation, we remember the draft lottery in 1969 for the Vietnam war. A piece of paper with each calendar day was put it in a capsule, and the capsule was drawn from a bingo game tumbling cage. The capsules were drawn by hand and that date was assigned draft number #1, the next date was draft #2, until the last one was draft #366 (they included leap days).

Bad Sampling

These numbers determined the order in which people would be drafted, and it was projected that about 10% of the males would be sent to Vietnam immediately after bootcamp and that another 10% would go to Vietnam within their period of enlistment.

The correlation between day of birth (1-366) and draft number was -.28. A random process would have produced a correlation around zero. Because this correlation was statistically significant, the lottery was challenged in court. The reason the drawing didn’t work properly was the way the capsules were put in the tumbler and drawn from it. This was fixed later in the following draft lottery, which was based on initials.

Ntile Partitions

One obvious way to split up your data with the RANK and DENSE_RANK functions. The problem with this is that the groups they form have a uniform value for whatever you grouped on. We are really trying to find subsets of things. Yes, they should have a commonality, but not be identical. Think back to our voting districts; they really should have some variety of voters.

The NTILE() function varies a bit from one SQL or statistical product to another, but they are all basically the same in intent. R and statistical packages tend to have more options than just SQL. The Microsoft syntax is a windowed function call:

NTILE <positive integer constant>) OVER ([<partition by clause>] <order by clause>)

The <positive integer constant> specifies the number of groups into which each partition must be divided. Without the ORDER BY clause, this makes no sense and without the PARTITION BY clause, the whole table is assumed. The row and range clause would make no sense. No surprises here.

Basically, you’re going to take the result set, order it and partition it into <positive integer constant> groups. To go back to the previous example, if I had a set of 50 voters, I could order them on, say, campaign contribution amounts and then divide them into the fat cats in one group (largest contributors) and rank the rest of them down to zero contributions. Obviously, we have some problems with ties. I’m sure that a lot of voters on my mailing list haven’t sent me anything and just want to get the free coffee mug or whatever I’m using is a bribe.

If the total number of rows is divisible by the number of groups, the rows will be evenly distributed among the groups. For example, if the total number of rows is 50, and there are five groups, each bucket will contain 10 rows.

If the number of rows in a partition is not divisible by <positive integer constant>, I will have odd men left over and need a rule for signing them to some group. The rule, which is arbitrary, is that larger groups come before smaller groups in the order specified by the OVER clause. For example if the total number of rows is 53 and the number of groups is five, the first three groups will have 11 rows and the two remaining groups will have 10 rows each.

NTILE() is not deterministic. In fact, it’s really pretty weird. Consider this quick example, which lacks a key, and is therefore not a valid table:

Now every quadrille group shares a value with the one next to it in sequence! But what else should it do? The description of this function is to split the original data into <positive integer constant> groups. If I wanted to get rid of ties, I would have had to handle this in my query.

Summary

The message of this little thought piece is that “it’s really hard to put together leprechauns from leprechaun parts” – which is really, kinda gruesome when you think about it. But at least I hope you’ll think about your data, when you try to aggregate from a raw set of the rows.