SQLServerCentral Article

Mathematical Auditing Of Email Broadcasts

,

This article is a sequel to "Ordering Tables To Preserve Referential Integrity" which appeared in www.sqlservercentral.com on 20 Mar 2009

Overview

Email broadcasting is nothing more than re-directing emails to multiple addresses, but it’s a common tool for many companies. However, maintaining broadcast lists can be surprisingly difficult. Customers sometimes receive multiple copies of the same email (for no obvious reason) or none at all. A frequent complaint is that recipients ask to be removed from a mailing list but continue to receive them even after the sender attempts manual corrections. This article will show two ways in which subtle errors may occur, and offers a general method for cleaning up broadcast tables using a bit of mathematics and a nice feature of SQL Server 2005.

Although mailing lists are conceptually simple, several people may be independently maintaining them (using private spreadsheets, for example). This can cause many errors. And some of them aren’t even evident until after the re-directions have been collated into a master list (e.g. two people independently putting the same email on their own spreadsheets).

To show how this can happen, take a fictitious company called My Company whose email re-directions look like the following (after Sales, Marketing, and Admin have submitted their re-directions to their email administrator):

bill@MyCompany.com                            bill@Microsoft.com
SALES@MyCompany.com                      lan@MyCompany.com
SALES@MyCompany.com                      ADMIN@MyCompany.com
SALES@MyCompany.com                      janet@IBM.com
SALES@MyCompany.com                      accounts
ADMIN@MyCompany.com                     accounts
sue@MyCompany.com                           bill@Microsoft.com
jane@MyCompany.com                          webmaster
mary@MyCompany.com             glen@MyCompany.com
glen@MyCompany.com                          janet@IBM.com
natasha@MyCompany.com                     MARKETING@MyCompany.com
lan@MyCompany.com                            natasha@MyCompany.com
fair@MyCompany.com                           president
MARKETING@MyCompany.com            fair@MyCompany.com
MARKETING@MyCompany.com            SALES@MyCompany.com

Here, any email sent to bill@MyCompany.com would be automatically re-directed to bill@Microsoft.com (the parent company). SALES@MyCompany.com is a mailing list whose recipients are lan@MyCompany.com, ADMIN@MyCompany.com (another mailing list), janet@IBM.com, and accounts (a user account on the company’s email server).

Already there are two errors in this broadcast (you did see them, right?):

CIRCULARITY (return to sender):

MARKETING@MyCompany.com -> SALES@MyCompany.com -> lan@MyCompany.com -> natasha@MyCompany.com -> MARKETING@MyCompany.com

REDUNDANCY (recipient occasionally receives multiple copies of the same email):

SALES@MyCompany.com -> accounts

SALES@MyCompany.com -> ADMIN@MyCompany.com -> accounts

The scripts below locate these problems in any email broadcast.

How the scripts work

For notational simplicity, label the email addresses from 1 to 16:

1 bill@MyCompany.com

2 sue@MyCompany.com

3 jane@MyCompany.com

4 mary@MyCompany.com

5 glen@MyCompany.com

6 natasha@MyCompany.com

7 lan@MyCompany.com

8 fair@MyCompany.com

9 MARKETING@MyCompany.com

10 SALES@MyCompany.com

11 ADMIN@MyCompany.com

12 bill@Microsoft.com

13 janet@IBM.com

14 webmaster

15 president

16 accounts

Then the 15 re-directions defining the broadcast look like this:

Arrow diagram

What a difference a picture makes!

Now it’s easy to see why 9 (Marketing) eventually re-directs mail back to itself and why 16 (accounts) always gets multiple emails sent to 10 (SALES). For some reason ADMIN (11) wasn’t aware that SALES was already copying her emails to accounts. For many organizations there could be hundreds of re-directions from a single email, which is why manual tinkering is ineffective.

To diagnose these pictures, let’s view them as directed graphs where the above problems can be approached in a natural way (for an overview of graphs, see http://en.wikipedia.org/wiki/Graph_theory).

Recall that a directed graph is any finite set of elements (called vertices), together with a set of directed edges between them. Multiple edges between any pair of vertices are not permitted, as well as edges involving the same vertex. The scripts below will assume these conditions to keeps things simple. In any event, they’re easy to test.

By definition, for any directed edge A -> B between vertices A and B, the element A is called the LHS (“left-hand side”) and B is called the RHS (“right-hand side”) of the edge.

Lists of re-directions (i.e. single broadcast) will be represented by a table whose columns are LHS and RHS. Later we’ll see why this edge-oriented approach is a natural choice over its vertex-oriented alternative.

For directed graphs, any finite sequence of directed edges V1-> V2-> … -> Vn on the vertices V1, V2, … Vn is called a path (of length n-1), provided that the Vi are distinct (except that V1may be Vn). A path is circular if V1= Vn.

The scripts below will test any broadcast for the following conditions:

· There are no circular paths

· At most one directed path exists between any two vertices

Showing that a directed graph has no circular paths is fairly straightforward.

First, a simple definition. The out-degree of any vertex V is the number of edges of the form V -> W. This is nothing more than the number of edges leading from V to other vertices. In-degrees are defined in the same way.

If the graph has no circular paths, then there must be a vertex V whose out-degree is 0, because the number of vertices is finite.

Our test will attempt to list the vertices in the following way. Choose a vertex whose out-degree is 0 as the first member of the list. Remove it and all edges into it, and repeat for the remaining graph. This process will eventually choose all of the vertices if, and only if, there is no circular path in the original graph (use induction on the number of vertices). Note that if you remove a vertex on the RHS of an edge coming from another vertex with no other edges, then that vertex can also be dropped at this stage because the next iteration will drop it anyway. But since we are storing graphs as edges, this will happen automatically (hence the edge-oriented approach).

For the above example, the vertices with out-degree = 0 are first chosen:

12, 13, 14, 15, 16

After removing these vertices and their edges, the vertices on the remaining graph with out-degree = 0 are chosen:

5, 8, 11

After removing these vertices and their edges, the remaining graph still contains the vertices 9, 10, 7, 6 but none have out-degree = 0. So the test fails, and the remaining graph must have a circular path which, in this case, traverses all of the remaining vertices. Note that if there were three more vertices connected in the following way 17 -> 18 -> 19 -> 17 then there would be two circular paths, but neither would traverse all of the remaining vertices. So this algorithm won’t immediately produce a circular path – it only tests whether one exists. But to find one, simply choose any vertex that remains and follow its edges until it returns to some vertex already traversed. That will always happen since there are only finitely many vertices, and none of them have out-degree = 0. Note that the vertex you choose may not be part of the circular path that you found this way. In fact, it won’t be part of any circular path if its in-degree = 0.

The script for the circularity test is the following:

/*
Script:     Circularity test for email lists
Platform:   SQL Server 2005
Purpose:    This script accepts a table of email
             re-directions and determines if
             circularity exists.
             Redirections are represented by a table called Edge
             whose columns LHS and RHS represent the left and
             right-hand sides of a re-direction.
Output:     List of re-directions containing all circular paths
Comment:    This test will destroy Edge, so a working copy is used
*/-- No counting
SET NOCOUNT OFF
-- Make working copy of Edge
IF EXISTS (SELECT* FROM sys.objects WHERE object_id =OBJECT_ID(N'[dbo].[EdgeTemp]') ANDtype in (N'U'))
      DROP TABLE [dbo].[EdgeTemp]
SELECT LHS, RHS INTO EdgeTemp FROM Edge
CREATE INDEX NDX_EdgeTemp_RHS ON EdgeTemp(RHS)
-- Recursively delete edges whose RHS vertices have out-degree = 0.
-- Note that the RHS of such edges will disappear from Edge.
-- So will the LHS of such edges if they have out-degree = 1.
-- Note that @@ROWCOUNT = 0 at this point.
DELETE     
FROM EdgeTemp
WHERE RHS NOT IN (SELECTDISTINCT LHS FROM EdgeTemp)
-- If @@ROWCOUNT > 0 at this point, then continue
WHILE @@ROWCOUNT > 0
      DELETE
      FROM EdgeTemp
      WHERE RHS NOTIN (SELECTDISTINCT LHS FROM EdgeTemp)
-- Original graph will have no circular paths
-- if and only if there are no remaining edges
-- in its working copy.
-- All edges will have out-degree > 0 at this point.
SELECT * FROM EdgeTemp
-- Drop working table
DROP TABLE EdgeTemp

This test concludes with a list of edges containing all circular paths.

Showing that at most one path exists between any two vertices can be tested in the following way: Copy each edge to a temporary collection. Initially, this collection can be viewed as the LHS and RHS of all paths of length 1. Then match the RHS of each edge of the graph with the LHS of any member in the collection, adding the LHS of the edge and the RHS of the member to the collection. Now we have the LHS and RHS of all paths of length 1 or 2. Repeat this matching (using just the newly-introduced members of length 2) to get the LHS and RHS of all paths of length 1, 2, or 3. Continue this way until no more new edges are available for the next iteration of this recursion. After that, no duplicate records exist in the temporary collection if, and only if, there is at most one path between any two vertices of the original graph.

Now if you’re still awake, notice that the second test assumes that there are no circular paths! Otherwise, you’ll go into an infinite loop. So the first test must succeed before the second test is even performed.

So let’s change the above example by truncating the edge 9 -> 10. Now there are no circular paths and we can perform the second test:

The first iteration produces the LHS and RHS of all paths of length 1 (i.e. all edges):

LHSRHS LHSRHS LHSRHS LHSRHS LHSRHS LHSRHS LHSRHS
112 212 314 45 513 69 76
815 98 107 1011 1013 1016 1116

The next iteration produces the LHS and RHS all paths of length 2:

LHSRHS LHSRHS LHSRHS LHSRHS LHSRHS LHSRHS
413 68 79 915 106 1016

The next iteration produces the LHS and RHS all paths of length 3:

LHSRHS LHSRHS LHSRHS
615 78 109

The next iteration produces the LHS and RHS all paths of length 4:

LHSRHS LHSRHS
715 108

The final iteration produces the LHS and RHS all paths of length 5:

LHSRHS
1015

Note that duplicate LHS, RHS combinations produced in a given iteration must represent different paths (use simple induction on the preceding iteration to prove this). But paths represented by those LHS, RHS combinations in different iterations have different lengths (by construction) so they must be different paths. Therefore, duplicate LHS, RHS combinations always represent different paths between the same pair of vertices. So in our truncated example, there are two paths from 10 to 16 and no other redundancy exists.

We will use SQL Server 2005’s elegant Common Table Expressions to seek duplicates while recursively listing the LHS, RHS combinations. This test will output all pairs of vertices with multiple paths between them, and count how many there are. Optionally, you may use the diagnostics option to list the iterations.

The script for the redundancy test is the following:

/*
Script:     Redundancy test for email lists
Platform:   SQL Server 2005
Purpose:    This script accepts a table of email
             re-directions and determines if
             any recipient received multiple emails.
             Redirections are represented by a table called Edge
             whose columns LHS and RHS represent the left and
             right-hand sides of a re-direction.
Output:     Number of redundant paths between any pair of emails having them             
Note:       Assumes no circular paths with re-directions.
             Otherwise, script will go into an infinite loop.
             Use sister script to check this beforehand.
*/-- No counting
SET NOCOUNT ON
-- Determine redundant paths between any pair of emails, and count them
;WITH Collect AS
(
-- Collect the LHS and RHS of all paths of length L = 1 (i.e. all edges).
-- Use i to record iteration for diagnostics.
SELECT i= 1, LHS, RHS
FROM Edge
UNION ALL  -- Must use ALL because duplicates will always represent different paths
-- Collect the LHS and RHS all paths of length L
-- by joining re-directions with previously collected paths
-- in the collection of length L - 1
SELECT i = i + 1, Edge.LHS, Collect.RHS
FROM Edge INNER JOIN Collect
 ON Edge.RHS = Collect.LHS
)
-- Output result.
-- Set MAXRECURSION to 0 to allow unlimited recursion.
SELECT LHS, RHS,COUNT(*) AS NumPath FROM Collect GROUP BY LHS,RHS HAVING COUNT(*)> 1 OPTION (MAXRECURSION 100)
-- Diagnostics:
-- SELECT i, LHS, RHS FROM Collect ORDER BY CAST(i AS INT), CAST(LHS AS INT), CAST(RHS AS INT) OPTION (MAXRECURSION 100)

Finally, here’s a simple script to produce random graphs for testing:

/*
Script:     Produce random directed graph
Platform:   SQL Server 2005
Purpose:    This script builds a random directed graph called Edge
*/-- No counting
SET NOCOUNT ON
-- Declarations
DECLARE @Vertex1  INT
DECLARE @Vertex2  INT
-- Create Edge
IF EXISTS (SELECT * FROM sys.objects WHERE object_id =OBJECT_ID(N'Edge')AND type in(N'U'))
      DROP TABLE Edge
CREATE TABLE Edge(
      LHS varchar(255),
      RHS varchar(255)
)ON [PRIMARY]
-- Initialise the first vertex
SET @Vertex1 = 0
-- Enumerate all pairs of vertices from 1 to 10,
-- where the second vertex is less than the first,
-- and create a directed edge with 20% probability
-- on each pair chosen (50% either way for those chosen).
-- Note that higher probabilities will create more edges.
WHILE @Vertex1 < 10     -- Number of vertices to build 
      BEGIN
      SET @Vertex1 = @Vertex1 + 1
      SET @Vertex2 = 1
      WHILE @Vertex2 < @Vertex1
            BEGIN
            -- 20% probability of creating directed edge (i.e. 1 out of 5)
            IF CONVERT(INT, 5*RAND())= 1
                 -- 50% probability direction goes this way (i.e. 1 out of 2)
                 IF CONVERT(INT, 2*RAND())= 1
                       INSERT INTO Edge(LHS,RHS)VALUES(@Vertex1, @Vertex2)
                 ELSE
                       -- 50% probability direction goes that way
                       INSERT INTO Edge(LHS,RHS)VALUES(@Vertex2, @Vertex1)                         
            SET @Vertex2 = @Vertex2 + 1
            END
      END
-- Index Edge
CREATE INDEX IDX_Edge_RHS ON Edge(RHS)
-- Display graph
SELECT * FROM Edge

After producing a random graph, run the first test for circularity. If it’s successful (and only if it’s successful), run the second test for redundancy. Otherwise, produce another random graph and start over.

Randomly generated graphs readily produce circular paths, unlike those built manually from real broadcasts, where most emails are directed to external sites (where we know nothing about their re-directions). You can adjust the probabilities to influence how many circular paths should be created. Of course, performance will be an issue with the second test on large non-sparse graphs since the expected number of paths goes up very quickly with the number of edges. Fortunately, real broadcasts are relatively sparse and well-behaved. In practice, however, even one redundancy is one too many since its affect on customers can be quite annoying.

An enhanced version of these scripts is used by a mid-size marketing company to verify the integrity of their broadcasts, which change monthly. Users submit their re-directions to their email coordinator, who verifies the collated lists before sending them to their email administrator. And that saves everybody phone calls and manual debugging.

The above scripts are quite general, and may be applied to any situation where hierarchies exist in the form of directed graphs.

Rate

4.38 (13)

You rated this post out of 5. Change rating

Share

Share

Rate

4.38 (13)

You rated this post out of 5. Change rating