Gaps and Islands in SQL Server data

The fun of exploring problems such as Gaps and Islands is all the greater when we have a test-harness to try alternative solutions.

The word ‘Gaps’ in the title refers to gaps in sequences of values. Islands are unbroken sequences delimited by gaps. The ‘Gaps and Islands’ problem is that of using SQL to rapidly detect the unbroken sequences, and the extent of the gaps between them in a column.

Islands and gaps appear in all sorts of sequences, be they IDENTITY columns where some rows have been removed or dates that occur in sequence (but some are missing).  In all cases, the sequences do not contain duplicates. The ‘Gaps and Islands’ problem isn’t entirely an academic game, since a number of business processes demand some way of detecting gaps and islands in sequences. A typical example might occur in the express distribution business where a consignment has many packages numbered sequentially.  Typically you scan all packages when a consignment reaches the depot.  If packages are missing those represent the gaps.  So if you want to represent the event “consignment arrived at the depot” and list the package numbers that did arrive, you’d want to group them like 1-10, 15-18, etc. where those are the islands.

It is complex and interesting task to come up with a solution that performs and scales well.  Chapter 5 of SQL Server MVP Deep Dives, Gaps and Islands by Itzik Ben-Gan, is probably the best and most thorough explanation of the main solutions.  My interest was piqued by that article and I played about with the examples to try to gain a little better insight.  While doing so, I happened upon a rather different solution to the problem, and was sufficiently intrigued to develop a test harness to evaluate the way that the various algorithms performed and scaled.

Let’s dip our toe in the pond by looking at some sample data.

We’ve thrown in an ID column because you may need to retrieve islands and/or gaps across more than one record grouping.  However, this value is not used in the basic examples, but will be used in the later performance test harnesses. This produces a table of sample data (on the left) and I’ve added the results we should expect for islands (in the middle) and gaps (on the right):

Results

Islands

Gaps

ID SeqNo

1  1

1  2

1  5

1  6

1  8

1  9

1  10

1  12

1  20

1  21

1  25

1  26

ID StartSeqNo EndSeqNo

1  1          2

1  5          6

1  8          10

1  12         12

1  20         21

1  25         26

ID StartSeqNo EndSeqNo

1  3          4

1  7          7

1  11         11

1  13         19

1  22         24

In our example, islands are contiguous groups of integers.  In a way, gaps are the inverse of islands, as they are the endpoints between the islands. 

It is interesting to note that there will always be one less row of gaps than islands.

Some Islands Solutions

In the SQL MVP Deep Dives book, the author provides four solutions to the islands problem.  From his timing results, it seems that two of these seem to be much better than the others, based on the measure of elapsed time only.  Let’s take a look at these two.

We have modified the code provided to support the ID column, which may reflect different user IDs, machine IDs, or whatever for your particular case.  When you run this code, you’ll find that the results precisely match the expected results for islands shown in the table above.  It is important to note that in at least the second case, the sequence numbers must be unique within an ID in order for the solution to work.

The following performance was achieved by these two solutions across 1,000,000 rows, with the index being essential to achieving these results.  This basically confirms the findings in the book: solution #3 is faster.

Islands

Test Harness – 1,000,000 Rows

Solution

CPU (ms)

Elapsed Time (ms)

Logical IOs

Islands #1

1545

1611

10568

Islands #3

733

807

2642

In the resources file, you may run the script: Islands Test Harness #1.sql to confirm these results for yourself and see how they may vary based on your machine’s configuration.

The Traditional Gaps Solutions

The gaps problem, in my mind at least, is intrinsically a bit more challenging than islands because, in effect you need to “make up” the data points that are the endpoints of each gap; meaning that by definition for the gaps, the endpoints of the gap do not already exist in the data.

As I thumb through my extensively dog-eared copy of the SQL MVP Deep Dives book, I find there are four solutions proposed for gaps.  Of these, there are 3 which seem to perform to relatively the same order of magnitude when it comes to elapsed time.

Once again a check of the results confirms that they are identical to those shown in the preceding table for gaps.  Test harness Gaps Test Harness #1.sql confirms that solution #1 from SQL MVP Deep Dives is the fastest.

Gaps

Test Harness – 1,000,000 Rows

Solution

CPU (ms)

Elapsed Time (ms)

Logical IOs

Gaps #1

903

810

23012

Gaps #2

5056

1522

3192926

Gaps #3

1591

1593

5278

An Alternative Gaps Solution

Since we have relatively a fast method of determining islands, perhaps it’s possible to convert that solution into one that will identify our gaps.  Let’s take a look at the islands results to see how it might be converted into gaps.

gaps and islands

If we first insert two columns and fill the first with one less than the StartSeqNo and the second with one greater than the EndSeqNo, we can circle the numbers that represent the gaps.  Each pair of colored, circled numbers in the rightmost table perfectly represents our gaps.

This suggests to us that UNPIVOTing the islands might give us something to work with to arrive at the gaps.  There are at least three methods to un-pivot two columns: 1) using the SQL UNPIVOT keyword, 2) doing a UNION ALL between each of the columns and 3) using the CROSS APPLY VALUES (CAV) approach to UNPIVOT.  We’ll choose the latter because as this article: An Alternative (Better?) Method to UNPIVOT seems to demonstrate, that approach may be the swiftest.  We’ll also move the fastest islands bit of code magic into a Common Table Expression (CTE) so we can focus on our gaps solution.

We’ve thrown in a couple of row numbers to see if they might help us with the grouping we need.  These results are:

We certainly have the numbers in SeqNo that represent the endpoints of our gap and we even have a grouping column (m) that will allow us to group our gap start/end points, which we arrived at by the totally simplistic method of dividing by two.  Now all we have to do is figure out a way to remove the first and last row from the result set!

Since all the gap endpoints of interest are in groups of two, this brings to mind the HAVING clause.  Let’s give that a try.

Inspection of the results set clearly shows we’ve achieved our expected results – the gaps!

Performance Comparison of Gaps Solutions

Performance always counts so let’s see how well this solution performs in a sufficiently large test harness versus the solutions proposed in the MVP Deep Dives book. 

In the two test harness scripts we mentioned previously, the code to generate a large test data set looks like this:

While reviewing this article, Peter Larsson (alias PESO) suggested an alternative using CROSS APPLY VALUES that may improve the speed.  So let’s take a look at that method also.

Using the gaps solutions identified previously, a quick check of the results at 1,000,000 rows delivers these results (run Gaps Test Harness #2.sql).

Gaps

Test Harness – 1,000,000 Rows

Solution

CPU (ms)

Elapsed Time (ms)

Logical IOs

SQL MVP Deep Dives Gaps #1

967

827

22997

SQL MVP Deep Dives Gaps #2

5258

1653

3192923

SQL MVP Deep Dives Gaps #3

1840

1834

5278

CROSS APPLY VALUES – Islands to Gaps

718

933

2639

CAV – Islands to Gaps (by PESO)

640

711

2639

Elapsed times for the new CAV (Cross Apply Values) methods appear quite competitive, and both logical IOs and CPU time appear improved.  In a moment we’ll see if these results can be reproduced and how they scale.

An Alternative Islands Solution

If it is possible to convert Islands to Gaps, perhaps it is also possible to convert Gaps to Islands.  Let’s consider a data transformation as follows:

1842-clip_image004.jpg

For this solution, we’ll take the end point and add one, while we subtract one from the start point.  We need to add a row, and that row will be the first and last sequence number in our series.  Finally, we’ll un-pivot and add a row number that groups each resulting sequence number in pairs (note that you could either add 1 to or subtract 1 from the row number before dividing by 2).

So now, skipping the intermediate explanatory step as it is quite similar to what we did before, we can arrive at the following code to calculate islands from gaps:

The Gaps CTE above is the SQL MVP Deep Dives fastest (elapsed time) approach to gaps and the MinMax CTE is used to compute the additional row we’ll need to go from gaps to islands.

So how do these results compare to our fastest islands solutions at 1,000,000 rows?

Islands

Test Harness – 1,000,000 Rows

Solution

CPU (ms)

Elapsed Time (ms)

Logical IOs

SQL MVP Deep Dives Islands #1

1217

1256

10568

SQL MVP Deep Dives Islands #3

593

727

2641

CROSS APPLY VALUES – Gaps to Islands

1358

890

25690

The SQL MVP Deep Dives Islands solution #3 still wins the elapsed time race but our approach is a respectable showing, being solidly in second place in terms of elapsed time.  It appears that parallelism accounts for this positioning, as CPU for the CAV – Gaps to Islands solution exceeds the elapsed time.  These results may be reproduced by running Islands Test Harness #2.sql.

Scalability

Using each #2 test harness, we’ll generate test data from 200,000 to 2,000,000 rows and chart the results for comparison.  You’ll see that the final results appear to be a bit mixed.

1842-img4B.jpg

1842-img4C1.jpg

1842-img4D.jpg

For the CAV (Cross Apply Values) solutions for islands to gaps, their ranking appears as #1 for CPU across the board, although it is a slim lead and the convergence at 2M rows suggests that it may start lagging behind above that.  For elapsed time, the CAV solutions appear to be basically tied with MVP DD S#1 up to about 1.2M rows, but then they fall to second place ranking above that row count.  In logical IOs they are the clear winner across the board (the SQL MVP Deep Dive solution #2 was omitted due to scaling of the chart).  MVP DD S#3 does get good marks for logical IOs.

1842-img48.jpg

1842-img49.jpg

1842-img4A.jpg

For the CAV (Cross Apply Values) solution of gaps to islands, it manages to achieve a #2 ranking for CPU above 1M rows but lagged slightly behind prior to that.  For elapsed time, above 1.8M rows it seems to have achieved the #1 ranking of the 3 solutions (below 1.8M it is #2), but our scaling tests don’t go above 2M rows so we’re not sure how long it would maintain that lead.  However on logical IOs, it is a clear #3 (and would probably remain so) regardless of the number of rows in the test harness.

It also appears that for the CAV gaps to islands solution, we lost the parallelism that resulted in the highly favorable elapsed time result, possibly because SQL had cached a non-parallelized execution plan at the lower row counts.

Ultimately the CAV (Cross Apply Values) approaches may offer benefits under some conditions but your choice should be made based on a careful analysis of which solution works best for your constraints of data rows, parallelism and the specific resource that you most wish to conserve (CPU, elapsed time or logical IOs).

Gaps and Islands

A Google search for anything similar to an approach like this for calculating islands and gaps came up empty, but we’d love to hear if anyone has published anything on it before or has other improvements to suggest.  And we’d most certainly be thrilled to hear from our valued readers if this solution has been of benefit to them. Perhaps in SQL 2012 someone can come up with something faster utilizing a window frame.  We look forward to hearing about that too.

 Timing tests run for this article were done in SQL Server 2008 R2 on a Dell Inspiron Core i5 2.4 GHz with 8GB of memory. 

 Until next time

The Excel file with the results of the test, and the graphs, is attached, as are the four SQL source files for the tests. Links to them can be found at the head of this article

If you liked this article, you might also like Introduction to Gaps and Islands Analysis