The SQL of Membership: Equivalence Classes & Cliques

It is awkward to do 'Graph databases' in SQL to explore the sort of relationships and memberships in social networks because equivalence relations are classes (a set of sets) rather than sets. However one can explore graphs in SQL if the relationship has all three of the mathematical properties needed for an equivalence relationship.

Equivalence Classes & Cliques

We keep harping that SQL is based on sets, but how many of us ever go back and re-read any old text books on set theory? In particular, one concept gets forgotten is equivalence relations on sets, which create equivalence classes. These classes are disjoint and we can put an element from a set into one of them with some kind of rule.

Let’s use ~ as the notation for an equivalence relation. The definition is that it has three properties:

  • The relation is reflexive: A ~ A. This means the relation applies to itself.
  • The relation is symmetric: : if (A ~ B) ⇔ (B ~ A). This means when the relation applies to two different elements in the set, it applies both ways.
  • The relation is transitive: (A ~ B) Λ (B ~ C) ⇒ (A ~ C). This means we can deduce all elements in a class by applying the relation to any known members of the class. This is pretty important for programming, as we will see.

So, if this is such a basic set operation, why isn’t it in SQL? The problem is that it produces classes, not a set! A class is a set of sets, but a table is just a set. The notation for each class is a pair of square bracket that contain some representative of each set. Assume we have

  • a ∈ A
  • b ∈ B

These expressions are all the same:

  • A ~ B
  • [a] = [b]
  • [a] ∩ [b] ≠ Φ

A common example is the set Z of integers and any MOD() operator will give you equivalence classes. The MOD(n, 2) operation gives you one class consisting of all even numbers, and the other consisting of all odd numbers. The nice part is that you can compute the MOD() function with arithmetic.

An equivalence classes can also be defined by a set of sets. This is an important concept in graph theory and graph database. If you have not seen it, Google “Six Degrees of Kevin Bacon“. The idea is that any individual involved in the Hollywood film industry can be linked through his or her film roles to Kevin Bacon within six steps.

Let’s steal a sample set from a SQL Forum posting of friends, where ~ now means “is a friend of” and that we can use “foaf” — “Friend of a friend” — to build cliches.

  • {Fred, Jessica, John, Mary}
  • {Albert, Nancy, Peter}
  • {Abby}
  • {Frank, Joe}

A complete graph means that each node is connected to every other node by one edge. If a subgraph is complete, it is actually called a Clique in graph theory!

The complete graph Kn of order n is a simple graph with n vertices in which every vertex is adjacent to every other. The complete graph on n vertices has n(n-1)/2 edges, which corresponding to all possible choices of pairs of vertices. But this is not an equivalence relation because it does not include the reference of each node to itself. Remember A~A? Think about Abby in this set, assuming that she likes herself, in spite of her lack of social skills.

Obviously, if you have a clique with (k) nodes in it, you can a complete subgraph of any size (j <k). A maximal clique is a clique that is not a subset of any other clique (some authors reserve the term clique for maximal clique.

SQL seems to have two equivalence relations that are major parts of the language. The first is plain old vanilla equal (=) for scalar values in the base data types. SQL is just like any other programming language, but we also have NULLs to consider. Oops! We all know that (NULL = NULL) is not true. So we have to exclude NULLs or work around them.

The other “almost” equivalence relation is GROUP BY in which the class is the grouping columns. A quick example would be “SELECT city_name, state_code FROM Customers GROUP BY state_code;” and this is still not quite right because we have to do some kind of aggregation on the non-grouping columns in the query.

Graphs in SQL

The idiom for graphs in SQL is to use a table with the nodes involved and a second table with pairs of nodes (a, b) to model the edges, which model our ~ relation.. This is classic RDBMS; the first table is entities (e.g. cities on a map, electronic components in a device, etc) and the second table is a relationship (e.g. roads between cities, wiring between components, etc).

The bad news is that when you use it for equivalence relations, it can get big. A class with one member is has one row to show the edge back to itself. A class with two members {a, b} has {(a, a), (b, b), (a, b), (b, a)} to show the ~ properties as edges. Three members gives us six rows; { (a, a), (b, b), (c, c), (a, b), (a, c), (b, a), (b, c), (c, a), (c, b)}. The general formula is (n * (n-1)) + n, which is pretty close to (n!).

First, load the table a few facts we might already know:

Reflexive Rows

The first thing we noticed is that only Abby has a reflexive property row. We need to add those rows and can do this with basic set operators. But it is not that easy, as you can see with this insertion statement:

The use of table aliases is tricky. You have to be sure that the SQL engine will construct a table in such a way that you do not get scoping problems. The UNION and EXCEPT operators are used to assure that we do not have primary key violations.

Abby

Abby

Albert

Albert

Frank

Frank

Fred

Fred

Jessica

Jessica

Jessica

Fred

Joe

Joe

Joe

Frank

John

Mary

John

John

Mary

Mary

Mary

Jessica

Nancy

Nancy

Peter

Peter

Peter

Nancy

Peter

Albert

Symmetric Rows

Look at the update table and there are no { (a, b), (b, a)} pairs. We need to add this second property to the relation table. This follows the pattern of set operators we just used:

This will give us:

Abby

Abby

Albert

Peter

Albert

Albert

Frank

Joe

Frank

Frank

Fred

Jessica

Fred

Fred

Jessica

Mary

Jessica

Jessica

Jessica

Fred

Joe

Joe

Joe

Frank

John

Mary

John

John

Mary

Mary

Mary

John

Mary

Jessica

Nancy

Peter

Nancy

Nancy

Peter

Peter

Peter

Nancy

Peter

Albert

If you do quick GROUP BY query, you see that Abby is seriously anti-social with a count of 1 , but Mary and Jessica look very social with a count of 3. This is not quite true because the property is the dreaded “Rock-paper-scissors-lizard-Spock” relationship.”

Transitive Rows

Transitivity goes on forever. Well, until we have what mathematicians call a closure. This means we have gotten a set of elements that have all of the valid relationships. In some cases, these sets are infinite. But in database, we can only have insanely large tables. Let’s pull a subset that is part oi a clique of 4 friends.

Fred

Jessica

Fred

Fred

Jessica

Mary

Jessica

Jessica

Jessica

Fred

John

Mary

John

John

Mary

Mary

Mary

John

Mary

Jessica

Now apply the transitive relation to get some of the missing edges of a graph:

This is a simple statement using the definition of transitivity, without a universal quantifier or loop on it. It will add these rows to this subset:

Fred

Mary

Jessica

John

John

Jessica

Mary

Fred

When you look at Mary and Jessica, you have all of their friends. However, Fred and John do not know that they are friends. So we invoke the statement again and get those two rows. If you try it a third time, there are zero rows added.

The first thought is this sounds like a job for a recursive statement. But it is not that easy. If the original graph has a cycle in it, you can hang in an infinite loop if you try to use a recursive CTE. The assumption is that each clique has a spanning graph in the pairs in. Oops! New term: a spanning graph is a sub-graph that includes the nodes and some or all of the edges of the graph. A complete graph is the spanning graph has all the possible edges, so each node is directly connected to any other node.

Cliques

Now change your mindset from graphs to sets. Let’s look at the size of the cliques:

Abby

1

Frank

2

Joe

2

Nancy

3

Peter

3

Albert

3

Fred

4

Jessica

4

John

4

Mary

4

What we want to do is to assign a number to each clique. This sample data is biased by the fact that the cliques are all different sizes. You cannot simply use the size to assign a clique_nbr.

Create this table load it with the names.

We can start by assigning everyone their own clique, using whatever your favorite method is:

Everyone is equally a member of their clique in this model. That means we could start with anyone to get the rest of their clique. The update is very straight forward. Take any clique number, find the member to whom it belongs and use that name to build that clique. But which number to we use? We could use the MIN, MAX or a random number in the clique; I will use the MAX for no particular reason

I keep thinking that there is a recursive update statement that will do this in one statement. But I know it will not port (Standard SQL has no recursive update statement right now. We would do it with SQL/PSM or a host language) and I think it would be expensive. Recursion will happen for each member of the whole set, but if we consolidate a clique for one person, we have removed his friends from consideration.

The worst situation would be a bunch of hermits, so the number of clique would be the cardinality of set. That is not likely or useful in the real world. Let’s put the update in the body of a loop.

As a programming assignment, replace the simple counting loop with one that does not do updates with clique_loop values that were removed in the prior iteration. This can save a lot of work in a larger social network than the sample data used here.

Adding Members

Now that we have a Cliques table, we can do a simple insertion if the new guy belongs to one clique. However, we can have someone who knows a people in different cliques. This will merge the two cliques into one.

Deleting a member or moving him to another clique is trivial.

Conclusion

This article is full of handy tricks to use with SQL. But cliques and equivalence classes are better handled with a graph database than with SQL. This is no surprise; that is what they were meant to do! The graph database can also handle relationships that do not have all three of the mathematical properties needed for an equivalence relationship.

If any readers want to add to this set of tricks, please do.