SQLServerCentral Article

Generating n-Tuples with SQL

,

I ran across an interesting problem the other day that required this and I thought I’d post on this topic of n-Tuples as a result. 

First what is an n-Tuple?  Suppose you have a list of unique character strings like ‘1’ ‘2’ ‘3’ and these are stored as rows in a table. Tuples are simply combinations of these values, so 1-Tuples would be:

1

2

3

 2-Tuples would be:

1,2

1,3

2,1

2,3

3,1

3,2

And of course there are these 3-Tuples:

1,2,3

1,3,2

2,1,3

2,3,1

3,1,2

3,2,1

Let’s now write a SQL query that generates Tuples for all n where we have unique strings we want to combine:

DECLARE @t TABLE (strcol CHAR(3))
INSERT INTO @t (strcol)
SELECT '001' UNION ALL SELECT '002' UNION ALL SELECT '003'
--UNION ALL SELECT '004' UNION ALL SELECT '005' UNION ALL SELECT '006'
--UNION ALL SELECT '007'  UNION ALL SELECT '008'
--UNION ALL SELECT '009' UNION ALL SELECT '010'
;WITH nTuples (n, Tuples) AS (
      SELECT DISTINCT 1, CAST(strcol AS VARCHAR(max))
      FROM @t
      UNION ALL
      SELECT 1 + n.n, t.strcol + ',' + n.Tuples
      FROM @t t JOIN nTuples n ON t.strcol <> n.Tuples
      WHERE CHARINDEX(t.strcol, n.Tuples) = 0
)
SELECT *
 FROM nTuples
 ORDER BY n, Tuples

This recursive CTE returns the following results set when run for n=3:

n             Tuples

1              001

1              002

1              003

2              001,002

2              001,003

2              002,001

2              002,003

2              003,001

2              003,002

3              001,002,003

3              001,003,002

3              002,001,003

3              002,003,001

3              003,001,002

3              003,002,001

Interestingly enough, that wasn’t really the result I needed but it was pretty close.  What I really needed was what I call “unique, ordered” n-Tuples.  In other words, if I list out the 2-Tuples from above, I can show you only the ones I need:

1,2   [Want]

1,3   [Want]

2,1   [Don’t Want]

2,3   [Want]

3,1   [Don’t Want]

3,2   [Don’t Want]

This means that 1,2 is equivalent to 2,1 but I want only want the first one where the data is ordered across the Tuple.  Likewise, there will then be only 1 3-Tuple that I’m interested in: 1,2,3

This actually ends up being a pretty simple change to the above CTE.  The not equals (<>) has been changed to less than (<).

;WITH UNIQUEnTuples (n, Tuples) AS (
      SELECT DISTINCT 1, CAST(strcol AS VARCHAR(max))
      FROM @t
      UNION ALL
      SELECT 1 + n.n, t.strcol + ',' + n.Tuples
      FROM @t t JOIN UNIQUEnTuples n ON t.strcol < n.Tuples
      WHERE CHARINDEX(t.strcol, n.Tuples) = 0
)
SELECT *
 FROM UNIQUEnTuples
 ORDER BY n, Tuples

The results from running this SQL with n=3 are:

n             Tuples

1              001

1              002

1              003

2              001,002

2              001,003

2              002,003

3              001,002,003

Thinking about it, I realized that the row sets generated by these bad boys are going to grow quite quickly!  After a little bit of playing around and without resorting to looking up the combinatorial statistics, I was able to deduce that the number of records in the respective row sets are generated by these formulas:

n-Tuples: Sum over n + n*(n-1) + n*(n-1)*(n-2) + … + n*(n-1)*…*(n-(n-1))

Unique, ordered, n-Tuples: 2^n-1

For the n-Tuples formula, you can verify in a spreadsheet that the last 2 factors in the series for n>=2 are n! and that, my intrepid reader, becomes a really big number very quickly!

Of course, being constantly performance-minded, I couldn’t help but wonder how well SQL would handle these huge row sets, so I set about to running my SQL across incrementing n’s.

For n-Tuples, I got these results (CPU and elapsed time are in milliseconds):

n  #n-Tuples CPU    Elapsed

1  1         0      1

2  4         0      1

3  15        0      1

4  64        0      2

5  325       15     83

6  1956      47     202

7  13699     468    831

8  109600    3682   7033

9  986409    41809  116940

10 9864100   485054 1410456

At n=10, I decided to stop because I’d be too impatient to run for n=11!

But you can take the unique, ordered, n-Tuples much further as the progression of increase in number of the rows in the row sets is much slower.

n  #n-Tuples CPU    Elapsed

1  1         0      0

2  3         0      0

3  7         0      0

4  15        0      0

5  31        0      2

6  63        15     2

7  127       0      4

8  255       15     62

9  511       0      34

10 1023      16     70

11 2047      63     264

12 4095      109    218

13 8191      234    392

14 16383     453    799

15 32767     1030   1717

16 65535     2230   3711

17 131071    4555   8535

18 262143    10452  27458

19 524287    21871  58594

20 1048575   45802  135406

21 2097151   --     --

22 4194303   --     --

23 8388607   --     --

Once again, I’ll be too impatient to wait for n=21 to complete!

There are some caveats to these functions that you should be aware of:

  1. If there are any duplicates in your source table, these will be eliminated in the results (by design).
  2. You need to take care if your data points look like A, AA, AAA.  Note how the source table has defined this column as CHAR(3).  This is also by design.  The CHARINDEX statement would remove the n-Tuples (where n>1) of this particular set if you used VARCHAR(3) instead.  So if you’re working with integers you want n-Tuples of, your best course of action is to left pad them with 0’s or right pad them with blanks.

We hope you enjoyed this quick excursion into the world of recursive CTEs and n-Tuples.  We know we most certainly did.  Perhaps even one day it will be as useful for you as it was for me.

Now if I can only figure out how these pesky recursive CTEs really work, then I could post something really interesting!

Rate

4.6 (20)

You rated this post out of 5. Change rating

Share

Share

Rate

4.6 (20)

You rated this post out of 5. Change rating