This article is a sequel to "Ordering Tables To Preserve Referential Integrity" which appeared in www.sqlservercentral.com on 20 Mar 2009
Overview
Email broadcasting is nothing more than redirecting emails to multiple addresses, but it’s a common tool for many companies. However, maintaining broadcast lists can be surprisingly difficult. Customers sometimes receive multiple copies of the same email (for no obvious reason) or none at all. A frequent complaint is that recipients ask to be removed from a mailing list but continue to receive them even after the sender attempts manual corrections. This article will show two ways in which subtle errors may occur, and offers a general method for cleaning up broadcast tables using a bit of mathematics and a nice feature of SQL Server 2005.
Although mailing lists are conceptually simple, several people may be independently maintaining them (using private spreadsheets, for example). This can cause many errors. And some of them aren’t even evident until after the redirections have been collated into a master list (e.g. two people independently putting the same email on their own spreadsheets).
To show how this can happen, take a fictitious company called My Company whose email redirections look like the following (after Sales, Marketing, and Admin have submitted their redirections to their email administrator):
bill@MyCompany.com bill@Microsoft.com
SALES@MyCompany.com lan@MyCompany.com
SALES@MyCompany.com ADMIN@MyCompany.com
SALES@MyCompany.com janet@IBM.com
SALES@MyCompany.com accounts
ADMIN@MyCompany.com accounts
sue@MyCompany.com bill@Microsoft.com
jane@MyCompany.com webmaster
mary@MyCompany.com glen@MyCompany.com
glen@MyCompany.com janet@IBM.com
natasha@MyCompany.com MARKETING@MyCompany.com
lan@MyCompany.com natasha@MyCompany.com
fair@MyCompany.com president
MARKETING@MyCompany.com fair@MyCompany.com
MARKETING@MyCompany.com SALES@MyCompany.com
Here, any email sent to bill@MyCompany.com would be automatically redirected to bill@Microsoft.com (the parent company). SALES@MyCompany.com is a mailing list whose recipients are lan@MyCompany.com, ADMIN@MyCompany.com (another mailing list), janet@IBM.com, and accounts (a user account on the company’s email server).
Already there are two errors in this broadcast (you did see them, right?):
CIRCULARITY (return to sender):
MARKETING@MyCompany.com > SALES@MyCompany.com > lan@MyCompany.com > natasha@MyCompany.com > MARKETING@MyCompany.com
REDUNDANCY (recipient occasionally receives multiple copies of the same email):
SALES@MyCompany.com > accounts
SALES@MyCompany.com > ADMIN@MyCompany.com > accounts
The scripts below locate these problems in any email broadcast.
How the scripts work
For notational simplicity, label the email addresses from 1 to 16:
1 bill@MyCompany.com
2 sue@MyCompany.com
3 jane@MyCompany.com
4 mary@MyCompany.com
5 glen@MyCompany.com
6 natasha@MyCompany.com
7 lan@MyCompany.com
8 fair@MyCompany.com
9 MARKETING@MyCompany.com
10 SALES@MyCompany.com
11 ADMIN@MyCompany.com
12 bill@Microsoft.com
13 janet@IBM.com
14 webmaster
15 president
16 accounts
Then the 15 redirections defining the broadcast look like this:
What a difference a picture makes!
Now it’s easy to see why 9 (Marketing) eventually redirects mail back to itself and why 16 (accounts) always gets multiple emails sent to 10 (SALES). For some reason ADMIN (11) wasn’t aware that SALES was already copying her emails to accounts. For many organizations there could be hundreds of redirections from a single email, which is why manual tinkering is ineffective.
To diagnose these pictures, let’s view them as directed graphs where the above problems can be approached in a natural way (for an overview of graphs, see http://en.wikipedia.org/wiki/Graph_theory).
Recall that a directed graph is any finite set of elements (called vertices), together with a set of directed edges between them. Multiple edges between any pair of vertices are not permitted, as well as edges involving the same vertex. The scripts below will assume these conditions to keeps things simple. In any event, they’re easy to test.
By definition, for any directed edge A > B between vertices A and B, the element A is called the LHS (“lefthand side”) and B is called the RHS (“righthand side”) of the edge.
Lists of redirections (i.e. single broadcast) will be represented by a table whose columns are LHS and RHS. Later we’ll see why this edgeoriented approach is a natural choice over its vertexoriented alternative.
For directed graphs, any finite sequence of directed edges V1> V2> … > Vn on the vertices V1, V2, … Vn is called a path (of length n1), provided that the Vi are distinct (except that V1may be Vn). A path is circular if V1= Vn.
The scripts below will test any broadcast for the following conditions:
· There are no circular paths
· At most one directed path exists between any two vertices
Showing that a directed graph has no circular paths is fairly straightforward.
First, a simple definition. The outdegree of any vertex V is the number of edges of the form V > W. This is nothing more than the number of edges leading from V to other vertices. Indegrees are defined in the same way.
If the graph has no circular paths, then there must be a vertex V whose outdegree is 0, because the number of vertices is finite.
Our test will attempt to list the vertices in the following way. Choose a vertex whose outdegree is 0 as the first member of the list. Remove it and all edges into it, and repeat for the remaining graph. This process will eventually choose all of the vertices if, and only if, there is no circular path in the original graph (use induction on the number of vertices). Note that if you remove a vertex on the RHS of an edge coming from another vertex with no other edges, then that vertex can also be dropped at this stage because the next iteration will drop it anyway. But since we are storing graphs as edges, this will happen automatically (hence the edgeoriented approach).
For the above example, the vertices with outdegree = 0 are first chosen:
12, 13, 14, 15, 16
After removing these vertices and their edges, the vertices on the remaining graph with outdegree = 0 are chosen:
5, 8, 11
After removing these vertices and their edges, the remaining graph still contains the vertices 9, 10, 7, 6 but none have outdegree = 0. So the test fails, and the remaining graph must have a circular path which, in this case, traverses all of the remaining vertices. Note that if there were three more vertices connected in the following way 17 > 18 > 19 > 17 then there would be two circular paths, but neither would traverse all of the remaining vertices. So this algorithm won’t immediately produce a circular path – it only tests whether one exists. But to find one, simply choose any vertex that remains and follow its edges until it returns to some vertex already traversed. That will always happen since there are only finitely many vertices, and none of them have outdegree = 0. Note that the vertex you choose may not be part of the circular path that you found this way. In fact, it won’t be part of any circular path if its indegree = 0.
The script for the circularity test is the following:
/*
Script: Circularity test for email lists
Platform: SQL Server 2005
Purpose: This script accepts a table of email
redirections and determines if
circularity exists.
Redirections are represented by a table called Edge
whose columns LHS and RHS represent the left and
righthand sides of a redirection.
Output: List of redirections containing all circular paths
Comment: This test will destroy Edge, so a working copy is used
*/
 No counting
SET NOCOUNT OFF
 Make working copy of Edge
IF EXISTS (SELECT* FROM sys.objects WHERE object_id =OBJECT_ID(N'[dbo].[EdgeTemp]') ANDtype in (N'U'))
DROP TABLE [dbo].[EdgeTemp]
SELECT LHS, RHS INTO EdgeTemp FROM Edge
CREATE INDEX NDX_EdgeTemp_RHS ON EdgeTemp(RHS)
 Recursively delete edges whose RHS vertices have outdegree = 0.
 Note that the RHS of such edges will disappear from Edge.
 So will the LHS of such edges if they have outdegree = 1.
 Note that @@ROWCOUNT = 0 at this point.
DELETE
FROM EdgeTemp
WHERE RHS NOT IN (SELECTDISTINCT LHS FROM EdgeTemp)
 If @@ROWCOUNT > 0 at this point, then continue
WHILE @@ROWCOUNT > 0
DELETE
FROM EdgeTemp
WHERE RHS NOTIN (SELECTDISTINCT LHS FROM EdgeTemp)
 Original graph will have no circular paths
 if and only if there are no remaining edges
 in its working copy.
 All edges will have outdegree > 0 at this point.
SELECT * FROM EdgeTemp
 Drop working table
DROP TABLE EdgeTemp
This test concludes with a list of edges containing all circular paths.
Showing that at most one path exists between any two vertices can be tested in the following way: Copy each edge to a temporary collection. Initially, this collection can be viewed as the LHS and RHS of all paths of length 1. Then match the RHS of each edge of the graph with the LHS of any member in the collection, adding the LHS of the edge and the RHS of the member to the collection. Now we have the LHS and RHS of all paths of length 1 or 2. Repeat this matching (using just the newlyintroduced members of length 2) to get the LHS and RHS of all paths of length 1, 2, or 3. Continue this way until no more new edges are available for the next iteration of this recursion. After that, no duplicate records exist in the temporary collection if, and only if, there is at most one path between any two vertices of the original graph.
Now if you’re still awake, notice that the second test assumes that there are no circular paths! Otherwise, you’ll go into an infinite loop. So the first test must succeed before the second test is even performed.
So let’s change the above example by truncating the edge 9 > 10. Now there are no circular paths and we can perform the second test:
The first iteration produces the LHS and RHS of all paths of length 1 (i.e. all edges):
LHS 
RHS 

LHS 
RHS 

LHS 
RHS 

LHS 
RHS 

LHS 
RHS 

LHS 
RHS 

LHS 
RHS 
1 
12 

2 
12 

3 
14 

4 
5 

5 
13 

6 
9 

7 
6 
8 
15 

9 
8 

10 
7 

10 
11 

10 
13 

10 
16 

11 
16 
The next iteration produces the LHS and RHS all paths of length 2:
LHS 
RHS 

LHS 
RHS 

LHS 
RHS 

LHS 
RHS 

LHS 
RHS 

LHS 
RHS 
4 
13 

6 
8 

7 
9 

9 
15 

10 
6 

10 
16 
The next iteration produces the LHS and RHS all paths of length 3:
LHS 
RHS 

LHS 
RHS 

LHS 
RHS 
6 
15 

7 
8 

10 
9 
The next iteration produces the LHS and RHS all paths of length 4:
LHS 
RHS 

LHS 
RHS 
7 
15 

10 
8 
The final iteration produces the LHS and RHS all paths of length 5:
Note that duplicate LHS, RHS combinations produced in a given iteration must represent different paths (use simple induction on the preceding iteration to prove this). But paths represented by those LHS, RHS combinations in different iterations have different lengths (by construction) so they must be different paths. Therefore, duplicate LHS, RHS combinations always represent different paths between the same pair of vertices. So in our truncated example, there are two paths from 10 to 16 and no other redundancy exists.
We will use SQL Server 2005’s elegant Common Table Expressions to seek duplicates while recursively listing the LHS, RHS combinations. This test will output all pairs of vertices with multiple paths between them, and count how many there are. Optionally, you may use the diagnostics option to list the iterations.
The script for the redundancy test is the following:
/*
Script: Redundancy test for email lists
Platform: SQL Server 2005
Purpose: This script accepts a table of email
redirections and determines if
any recipient received multiple emails.
Redirections are represented by a table called Edge
whose columns LHS and RHS represent the left and
righthand sides of a redirection.
Output: Number of redundant paths between any pair of emails having them
Note: Assumes no circular paths with redirections.
Otherwise, script will go into an infinite loop.
Use sister script to check this beforehand.
*/
 No counting
SET NOCOUNT ON
 Determine redundant paths between any pair of emails, and count them
;WITH Collect AS
(
 Collect the LHS and RHS of all paths of length L = 1 (i.e. all edges).
 Use i to record iteration for diagnostics.
SELECT i= 1, LHS, RHS
FROM Edge
UNION ALL  Must use ALL because duplicates will always represent different paths
 Collect the LHS and RHS all paths of length L
 by joining redirections with previously collected paths
 in the collection of length L  1
SELECT i = i + 1, Edge.LHS, Collect.RHS
FROM Edge INNER JOIN Collect
ON Edge.RHS = Collect.LHS
)
 Output result.
 Set MAXRECURSION to 0 to allow unlimited recursion.
SELECT LHS, RHS,COUNT(*) AS NumPath FROM Collect GROUP BY LHS,RHS HAVING COUNT(*)> 1 OPTION (MAXRECURSION 100)
 Diagnostics:
 SELECT i, LHS, RHS FROM Collect ORDER BY CAST(i AS INT), CAST(LHS AS INT), CAST(RHS AS INT) OPTION (MAXRECURSION 100)
Finally, here’s a simple script to produce random graphs for testing:
/*
Script: Produce random directed graph
Platform: SQL Server 2005
Purpose: This script builds a random directed graph called Edge
*/
 No counting
SET NOCOUNT ON
 Declarations
DECLARE @Vertex1 INT
DECLARE @Vertex2 INT
 Create Edge
IF EXISTS (SELECT * FROM sys.objects WHERE object_id =OBJECT_ID(N'Edge')AND type in(N'U'))
DROP TABLE Edge
CREATE TABLE Edge(
LHS varchar(255),
RHS varchar(255)
)ON [PRIMARY]
 Initialise the first vertex
SET @Vertex1 = 0
 Enumerate all pairs of vertices from 1 to 10,
 where the second vertex is less than the first,
 and create a directed edge with 20% probability
 on each pair chosen (50% either way for those chosen).
 Note that higher probabilities will create more edges.
WHILE @Vertex1 < 10  Number of vertices to build
BEGIN
SET @Vertex1 = @Vertex1 + 1
SET @Vertex2 = 1
WHILE @Vertex2 < @Vertex1
BEGIN
 20% probability of creating directed edge (i.e. 1 out of 5)
IF CONVERT(INT, 5*RAND())= 1
 50% probability direction goes this way (i.e. 1 out of 2)
IF CONVERT(INT, 2*RAND())= 1
INSERT INTO Edge(LHS,RHS)VALUES(@Vertex1, @Vertex2)
ELSE
 50% probability direction goes that way
INSERT INTO Edge(LHS,RHS)VALUES(@Vertex2, @Vertex1)
SET @Vertex2 = @Vertex2 + 1
END
END
 Index Edge
CREATE INDEX IDX_Edge_RHS ON Edge(RHS)
 Display graph
SELECT * FROM Edge
After producing a random graph, run the first test for circularity. If it’s successful (and only if it’s successful), run the second test for redundancy. Otherwise, produce another random graph and start over.
Randomly generated graphs readily produce circular paths, unlike those built manually from real broadcasts, where most emails are directed to external sites (where we know nothing about their redirections). You can adjust the probabilities to influence how many circular paths should be created. Of course, performance will be an issue with the second test on large nonsparse graphs since the expected number of paths goes up very quickly with the number of edges. Fortunately, real broadcasts are relatively sparse and wellbehaved. In practice, however, even one redundancy is one too many since its affect on customers can be quite annoying.
An enhanced version of these scripts is used by a midsize marketing company to verify the integrity of their broadcasts, which change monthly. Users submit their redirections to their email coordinator, who verifies the collated lists before sending them to their email administrator. And that saves everybody phone calls and manual debugging.
The above scripts are quite general, and may be applied to any situation where hierarchies exist in the form of directed graphs.