Efficient setbased solution for islands and gaps
Introduction
Alexander Kozak described an efficient rowbased solution for the problem of identifying islands and gaps in sequential numbers in his article, Islands and Gaps in Sequential Numbers. The setbased solutions described there were not very efficient, especially when the input data was too fragmented. This article provides the logical next step, since it describes an efficient setbased solution; one that performs equally well for small and large fragmentation of the input data.
For easier reading and testing, here is the table structure and a small (but illustrative) set of test data, copied from the Kozak's article.
CREATE TABLE gaps(gapID int NOT NULL PRIMARY KEY)
INSERT INTO gaps values(1)
INSERT INTO gaps values(2)
INSERT INTO gaps values(3)
INSERT INTO gaps values(4)
INSERT INTO gaps values(6)
INSERT INTO gaps values(7)
INSERT INTO gaps values(8)
INSERT INTO gaps values(10)
INSERT INTO gaps values(14)
INSERT INTO gaps values(15)
INSERT INTO gaps values(16)
INSERT INTO gaps values(17)
INSERT INTO gaps values(38)
Listing 1: The original test table structure and data
For any further details on the solutions proposed in the previous article, please refer to the link above. The solution proposed here requires at least SQL Server 2000 (so much of ANSI portability :)), though I think an ORACLE equivalent can be easily implemented.
The Pitfall of SetBased Solutions and How To Eliminate It
The proposed setbased solutions used two intermediate rowsets, one for the lower and one for the upper bound of islands or gaps and there is nothing wrong there. To obtain the final result set, these two rowsets are JOINed. Alexander Kozak himself detected that setbased solutions had one major pitfall: for large fragmentation of input data, setbased solutions performed poorly. Due to the large number of records in each rowset, the JOIN took too much time.
So, how can you optimize those JOINs ? The execution plan of all three setbased solutions shows several things:
 A MERGE JOIN is used to obtain each intermediate rowset, and
 NESTED LOOPS are used to JOIN two rowsets into the final result.
The MERGE JOIN is said to be a more efficient way for JOINing rowsets that are already ordered. So the two intermediate rowsets are obtained in the most efficient way, but the matching between each island/gap's lower bound to its corresponding upper bound is performed less efficiently. The solution of this part of the puzzle leads to the solution of the problem.
The trick used here originates from the logical order of islands/gaps bounds. Lets take a look at the desired result. For islands, it should be:
island_start island_end
1 4
6 8
10 10
14 17
38 38
For gaps, it should be:
gap_start gap_end
5 5
9 9
11 13
18 37
When two columns in any of the result sets are considered separately (as an intermediate rowset), the following is noticed:
when they are ORDERed, the first row in the lower bound rowset corresponds to the first row in the upper bound rowset, the second row in the lower bound rowset corresponds to the second row in the upper bound rowset, and so forth. My opinion is that matching island/gap's bounds with correlated subqueries using <= operator was the major reason for inefficiency of the setbased solutions described by Kozak.
Now, let's consider how to number a rowset. MS SQL Server doesn't give too many choices  the IDENTITY function/property is usually used for that purpose. It can be used as a column property or as a derived column in SELECT ... INTO statement. But, since the solution should be setbased, it should return a result set that can be easily SELECTed or used otherwise in SQL statements. That is why a stored procedure with a temporary tables was not a choice. The other alternative is to use the IDENTITY as a column attribute. So our table should have one IDENTITY column that will number the rows of intermediate rowsets. Due to the desired setbased nature, I decided to use a multistatement table valued function. Each of the four intermediate rowsets (lower and upper bounds for islands and gaps) should be returned by UDF of this kind. The result table of the UDF should contain one IDENTITY column (for numbering) and one bound column (for the lower or upper bound).
Finding Islands
The final result is a JOIN of two rowsets, returned by two UDFs. Here is the code of the function that retrieves the islands lower bounds:
CREATE FUNCTION dbo.udfFindIslandStarts()
RETURNS @res_tbl TABLE
(row_num int identity(1, 1) NOT NULL PRIMARY KEY,
island_start int NOT NULL)
AS
BEGIN
INSERT INTO @res_tbl (island_start)
SELECT gapID FROM gaps AS g1
WHERE NOT EXISTS (SELECT gapID FROM gaps AS g2
WHERE (g2.gapID = g1.gapID  1))
ORDER BY gapID
OPTION (MERGE JOIN)
RETURN
END
GO
Listing 2: The UDF for finding island lower bounds
The ORDER BY ensures the proper ordering of bounds in all four UDFs, so the numbering can be used. The MERGE JOIN hint is added to ensure that the execution will use that method to make the JOIN. The result set of
SELECT * FROM dbo.udfFindIslandStarts()
looks like this:
row_num island_start
 
1 1
2 6
3 10
4 14
5 38
which means that the first island starts at number 1, the second starts at number 6 etc.
Similarly, the code that finds the island upper bounds is:
CREATE FUNCTION dbo.udfFindIslandEnds()
RETURNS @res_tbl TABLE (
row_num int identity(1, 1) NOT NULL PRIMARY KEY,
island_end int not null)
AS
BEGIN
INSERT INTO @res_tbl (island_end)
SELECT gapID FROM gaps AS g1
WHERE NOT EXISTS (SELECT gapID FROM gaps AS g2
WHERE (g2.gapID = g1.gapID + 1))
ORDER BY gapID
OPTION (MERGE JOIN)
RETURN
END
GO
Listing 3: The UDF for finding island upper bounds
The result set of
SELECT * FROM dbo.udfFindIslandEnds()
looks like this:
row_num island_end
 
1 4
2 8
3 10
4 17
5 38
which means that the first island ends at 4, the second one ends at 8 etc.
And finally, the SQL statement that retrieves the final result:
SELECT
t1.gap_start, t2.gap_end
FROM
dbo.udfFindGapStarts() AS t1
INNER JOIN dbo.udfFindGapEnds() AS t2
ON (t2.row_num = t1.row_num)
OPTION
(MERGE JOIN)
Listing 4: The final SQL statement that finds the islands
Another MERGE JOIN hint is necessary, to ensure optimal execution.
Finding Gaps
Similarly, another two UDFs do the job for gaps. Here is the first one:
CREATE FUNCTION dbo.udfFindGapStarts()
RETURNS @res_tbl TABLE (
row_num int identity(1, 1) NOT NULL PRIMARY KEY,
gap_start int NOT NULL)
AS
BEGIN
INSERT INTO @res_tbl (gap_start)
SELECT gapID + 1
FROM gaps AS g1
WHERE NOT EXISTS (SELECT gapID FROM gaps AS g2
WHERE (g2.gapID = g1.gapID + 1))
ORDER BY gapID + 1
OPTION (MERGE JOIN)
RETURN
END
GO
Listing 5: The UDF for finding gap lower bounds
The result set of
SELECT * FROM dbo.udfFindGapStarts()
looks like this:
row_num gap_start
 
1 5
2 9
3 11
4 18
5 39
which means that the first gap starts at 5, the second one starts at 9, ... OOOPS ! At the end, there is one obsolete record, that means that a gap starts at 39. Obviously, 39 is not in the result set, and it is here due to the specific rule for finding gaps  dealing with data that does not really exists. This last record will be ignored in the final result.
At last, the code that finds gap upper bounds:
CREATE FUNCTION dbo.udfFindGapEnds()
RETURNS @res_tbl TABLE (
row_num int identity(0, 1) NOT NULL PRIMARY KEY,
gap_end int NOT NULL)
AS
BEGIN
INSERT INTO @res_tbl (gap_end)
SELECT gapID  1
FROM gaps AS g1
WHERE NOT EXISTS (SELECT gapID FROM gaps AS g2
WHERE (g2.gapID = g1.gapID  1))
ORDER BY gapID  1
OPTION (MERGE JOIN)
RETURN
END
GO
Listing 6: The UDF for finding gap upper bounds
The result set of
SELECT * FROM dbo.udfFindGapEnds()
looks like this:
0 0
1 5
2 9
3 13
4 37
Another phantom row here, the first one, that is a result of the effort of the UDF to find a gap before the first record.
This last UDF is little unusual. If the IDENTITY starts at 1, the row_num column in the result set will be shifted by 1, yielding into a result set like this one:
row_num gap_end
 
1 0
2 5
3 9
4 13
5 37
This will require a little correction in the JOIN condition, so that corresponding row_nums are matched. But, that "little" correction may become expensive, if it is done for large number of rows. That is why, the IDENTITY here starts at 0.
And finally, the SQL statement for gaps, that JOINS two UDF results.
SELECT
t1.island_start, t2.island_end
FROM
dbo.udfFindIslandStarts() AS t1
INNER JOIN dbo.udfFindIslandEnds() AS t2
ON (t2.row_num = t1.row_num)
OPTION
(MERGE JOIN)
Listing 7: The final SQL statement that finds the gaps
Due to the JOIN condition, two phantom records do not appear in the result set.
Testing and Results
I used the UDF from the
script
I contributed to SQLServerCentral earlier, to load test data into the gaps table. After the loading, I deleted each third record (as Kozak did in his tests), to simulate large data fragmentation. The upper bound value in test tables and charts is the upper bound of input data.
Since I was interested in the performance of core statements, I wrote stored procedures that created temporary tables and simply filled them wit the result of two final SQL statements. So, the times measured were the times necessary to load the data into these temp tables. I didn't SELECT all those data to the client. I modified Kozak's rowbased solution accordingly, by commenting the last SELECT statement.
First test starts finding islands and gaps at upper bound of 100,000, ending at upper bound of 1,000,000, with a step of 100,000.
Upper bound  Old (islands)  Optimized (islands)  Optimized (gaps) 
100.000  12  7  7 
200.000  24  15  15 
300.000  36  22  25 
400.000  48  30  34 
500.000  59  39  42 
600.000  72  46  51 
700.000  84  54  61 
800.000  96  62  70 
900.000  109  72  79 
1.000.000  120  79  91 
The Excel chart based on this test data shows this:
Second test starts finding islands and gaps at upper bound of 1,000,000, ending at upper bound of 10,000,000, with a step of 1,000,000. The time measured and the chart based on them are shown below:
Upper bound  Old (islands)  Optimized (islands)  Optimized (gaps) 
1.000.000  131  81  89 
2.000.000  248  166  188 
3.000.000  377  306  354 
4.000.000  591  454  490 
5.000.000  722  644  648 
6.000.000  870  711  799 
7.000.000  1037  837  944 
8.000.000  1149  932  1034 
9.000.000  1285  1005  1150 
10.000.000  1440  1143  1280 
Conclusions
Obviously, when preordered result sets should be JOINed, the MERGE JOIN is the most efficient way. Ordering can be achieved by using IDENTITY columns in table valued UDFs to obtain a numbered result set. Regarding the islands and gaps problem, it seems that the SQL Server feature that usually causes fragmentation of sequential number (IDENTITY columns) helped to efficiently detect that fragmentation ... something like "fight fire with fire" :).
Total article views: 6528

Views in the last 30 days: 9

The challenge is to find the Islands(gaps) in sequential dates. You need to write a query to identif...
A script to find data islands within a sequence of numbers.
copy the select results into text file
SQL select statement results to XL file
Learn how you can update multiple servers in a single bound with this technique from Kimberly Killia...
