An Efficient Set-based Solution for Islands and Gaps

Introduction

Alexander Kozak described an efficient row-based solution for the problem of identifying islands and gaps in sequential numbers in his article, Islands and Gaps in Sequential Numbers. The set-based solutions described there were not very efficient, especially when the input data was too fragmented. This article provides the logical next step, since it describes an efficient set-based solution; one that performs equally well for small and large fragmentation of the input data.

For easier reading and testing, here is the table structure and a small (but illustrative) set of test data, copied from the Kozak's article.

CREATE TABLE gaps(gapID int NOT NULL PRIMARY KEY)
INSERT INTO gaps values(1)
INSERT INTO gaps values(2)
INSERT INTO gaps values(3)
INSERT INTO gaps values(4)
INSERT INTO gaps values(6)
INSERT INTO gaps values(7)
INSERT INTO gaps values(8)
INSERT INTO gaps values(10)
INSERT INTO gaps values(14)
INSERT INTO gaps values(15)
INSERT INTO gaps values(16)
INSERT INTO gaps values(17)
INSERT INTO gaps values(38)

Listing 1: The original test table structure and data

For any further details on the solutions proposed in the previous article, please refer to the link above. The solution proposed here requires at least SQL Server 2000 (so much of ANSI portability :-)), though I think an ORACLE equivalent can be easily implemented.

The Pitfall of Set-Based Solutions and How To Eliminate It

The proposed set-based solutions used two intermediate rowsets, one for the lower and one for the upper bound of islands or gaps and there is nothing wrong there. To obtain the final result set, these two rowsets are JOINed. Alexander Kozak himself detected that set-based solutions had one major pitfall: for large fragmentation of input data, set-based solutions performed poorly. Due to the large number of records in each rowset, the JOIN took too much time.

So, how can you optimize those JOINs ? The execution plan of all three set-based solutions shows several things:

A MERGE JOIN is used to obtain each intermediate rowset, and
NESTED LOOPS are used to JOIN two rowsets into the final result.

The MERGE JOIN is said to be a more efficient way for JOINing rowsets that are already ordered. So the two intermediate rowsets are obtained in the most efficient way, but the matching between each island/gap's lower bound to its corresponding upper bound is performed less efficiently. The solution of this part of the puzzle leads to the solution of the problem.

The trick used here originates from the logical order of islands/gaps bounds. Lets take a look at the desired result. For islands, it should be:

island_start   island_end
1              4
6              8
10             10
14             17
38             38

For gaps, it should be:

gap_start      gap_end
5              5
9              9
11             13
18             37

When two columns in any of the result sets are considered separately (as an intermediate rowset), the following is noticed: when they are ORDERed, the first row in the lower bound rowset corresponds to the first row in the upper bound rowset, the second row in the lower bound rowset corresponds to the second row in the upper bound rowset, and so forth. My opinion is that matching island/gap's bounds with correlated subqueries using <= operator was the major reason for inefficiency of the set-based solutions described by Kozak.

Now, let's consider how to number a rowset. MS SQL Server doesn't give too many choices - the IDENTITY function/property is usually used for that purpose. It can be used as a column property or as a derived column in SELECT ... INTO statement. But, since the solution should be set-based, it should return a result set that can be easily SELECTed or used otherwise in SQL statements. That is why a stored procedure with a temporary tables was not a choice. The other alternative is to use the IDENTITY as a column attribute. So our table should have one IDENTITY column that will number the rows of intermediate rowsets. Due to the desired set-based nature, I decided to use a multi-statement table valued function. Each of the four intermediate rowsets (lower and upper bounds for islands and gaps) should be returned by UDF of this kind. The result table of the UDF should contain one IDENTITY column (for numbering) and one bound column (for the lower or upper bound).

Finding Islands

The final result is a JOIN of two rowsets, returned by two UDFs. Here is the code of the function that retrieves the islands lower bounds:

CREATE FUNCTION dbo.udfFindIslandStarts()
  RETURNS @res_tbl TABLE
    (row_num int identity(1, 1) NOT NULL PRIMARY KEY,
     island_start int NOT NULL)
AS
BEGIN
  INSERT INTO @res_tbl (island_start)
    SELECT gapID FROM gaps AS g1
    WHERE NOT EXISTS (SELECT gapID FROM gaps AS g2
                      WHERE (g2.gapID = g1.gapID - 1))
    ORDER BY gapID
    OPTION (MERGE JOIN)
  RETURN
END
GO

Listing 2: The UDF for finding island lower bounds

The ORDER BY ensures the proper ordering of bounds in all four UDFs, so the numbering can be used. The MERGE JOIN hint is added to ensure that the execution will use that method to make the JOIN. The result set of

SELECT * FROM dbo.udfFindIslandStarts()

looks like this:

row_num     island_start 
----------- ------------ 
1           1
2           6
3           10
4           14
5           38

which means that the first island starts at number 1, the second starts at number 6 etc.

Similarly, the code that finds the island upper bounds is:

CREATE FUNCTION dbo.udfFindIslandEnds()
  RETURNS @res_tbl TABLE (
    row_num int identity(1, 1) NOT NULL PRIMARY KEY,
    island_end int not null)
AS
BEGIN
  INSERT INTO @res_tbl (island_end)
    SELECT gapID FROM gaps AS g1
    WHERE NOT EXISTS (SELECT gapID FROM gaps AS g2
                      WHERE (g2.gapID = g1.gapID + 1))
    ORDER BY gapID
    OPTION (MERGE JOIN)
  RETURN  
END
GO

Listing 3: The UDF for finding island upper bounds

The result set of

SELECT * FROM dbo.udfFindIslandEnds()

looks like this:

row_num     island_end  
----------- ----------- 
1           4
2           8
3           10
4           17
5           38

which means that the first island ends at 4, the second one ends at 8 etc.

And finally, the SQL statement that retrieves the final result:

SELECT
  t1.gap_start, t2.gap_end
FROM
  dbo.udfFindGapStarts() AS t1
  INNER JOIN dbo.udfFindGapEnds() AS t2
    ON (t2.row_num = t1.row_num)
OPTION
  (MERGE JOIN)

Listing 4: The final SQL statement that finds the islands

Another MERGE JOIN hint is necessary, to ensure optimal execution.

Finding Gaps

Similarly, another two UDFs do the job for gaps. Here is the first one:

CREATE FUNCTION dbo.udfFindGapStarts()
  RETURNS @res_tbl TABLE (
    row_num int identity(1, 1) NOT NULL PRIMARY KEY,
    gap_start int NOT NULL)
AS
BEGIN
  INSERT INTO @res_tbl (gap_start)
    SELECT gapID + 1
    FROM gaps AS g1
    WHERE NOT EXISTS (SELECT gapID FROM gaps AS g2
                      WHERE (g2.gapID = g1.gapID + 1))
    ORDER BY gapID + 1
    OPTION (MERGE JOIN)
  RETURN  
END
GO

Listing 5: The UDF for finding gap lower bounds

The result set of

SELECT * FROM dbo.udfFindGapStarts()

looks like this:

row_num     gap_start   
----------- ----------- 
1           5
2           9
3           11
4           18
5           39

which means that the first gap starts at 5, the second one starts at 9, ... OOOPS ! At the end, there is one obsolete record, that means that a gap starts at 39. Obviously, 39 is not in the result set, and it is here due to the specific rule for finding gaps - dealing with data that does not really exists. This last record will be ignored in the final result.

At last, the code that finds gap upper bounds:

CREATE FUNCTION dbo.udfFindGapEnds()
  RETURNS @res_tbl TABLE (
    row_num int identity(0, 1) NOT NULL PRIMARY KEY,
    gap_end int NOT NULL)
AS
BEGIN
  INSERT INTO @res_tbl (gap_end)
    SELECT gapID - 1
    FROM gaps AS g1
    WHERE NOT EXISTS (SELECT gapID FROM gaps AS g2
                      WHERE (g2.gapID = g1.gapID - 1))
    ORDER BY gapID - 1
    OPTION (MERGE JOIN)
  RETURN  
END
GO

Listing 6: The UDF for finding gap upper bounds

The result set of

SELECT * FROM dbo.udfFindGapEnds()

looks like this:

0           0
1           5
2           9
3           13
4           37

Another phantom row here, the first one, that is a result of the effort of the UDF to find a gap before the first record.

This last UDF is little unusual. If the IDENTITY starts at 1, the row_num column in the result set will be shifted by 1, yielding into a result set like this one:

row_num     gap_end     
----------- ----------- 
1           0
2           5
3           9
4           13
5           37

This will require a little correction in the JOIN condition, so that corresponding row_nums are matched. But, that "little" correction may become expensive, if it is done for large number of rows. That is why, the IDENTITY here starts at 0.

And finally, the SQL statement for gaps, that JOINS two UDF results.

SELECT
  t1.island_start, t2.island_end
FROM
  dbo.udfFindIslandStarts() AS t1
  INNER JOIN dbo.udfFindIslandEnds() AS t2
    ON (t2.row_num = t1.row_num)
OPTION
  (MERGE JOIN)

Listing 7: The final SQL statement that finds the gaps

Due to the JOIN condition, two phantom records do not appear in the result set.

Testing and Results

I used the UDF from the script I contributed to SQLServerCentral earlier, to load test data into the gaps table. After the loading, I deleted each third record (as Kozak did in his tests), to simulate large data fragmentation. The upper bound value in test tables and charts is the upper bound of input data.

Since I was interested in the performance of core statements, I wrote stored procedures that created temporary tables and simply filled them wit the result of two final SQL statements. So, the times measured were the times necessary to load the data into these temp tables. I didn't SELECT all those data to the client. I modified Kozak's row-based solution accordingly, by commenting the last SELECT statement.

First test starts finding islands and gaps at upper bound of 100,000, ending at upper bound of 1,000,000, with a step of 100,000.

The Excel chart based on this test data shows this:Second test starts finding islands and gaps at upper bound of 1,000,000, ending at upper bound of 10,000,000, with a step of 1,000,000. The time measured and the chart based on them are shown below:ConclusionsObviously, when pre-ordered result sets should be JOINed, the MERGE JOIN is the most efficient way. Ordering can be achieved by using IDENTITY columns in table valued UDFs to obtain a numbered result set. Regarding the islands and gaps problem, it seems that the SQL Server feature that usually causes fragmentation of sequential number (IDENTITY columns) helped to efficiently detect that fragmentation ... something like "fight fire with fire" :-).

Upper bound	Old (islands)	Optimized (islands)	Optimized (gaps)
100.000	12	7	7
200.000	24	15	15
300.000	36	22	25
400.000	48	30	34
500.000	59	39	42
600.000	72	46	51
700.000	84	54	61
800.000	96	62	70
900.000	109	72	79
1.000.000	120	79	91
Upper bound	Old (islands)	Optimized (islands)	Optimized (gaps)
1.000.000	131	81	89
2.000.000	248	166	188
3.000.000	377	306	354
4.000.000	591	454	490
5.000.000	722	644	648
6.000.000	870	711	799
7.000.000	1037	837	944
8.000.000	1149	932	1034
9.000.000	1285	1005	1150
10.000.000	1440	1143	1280

An Efficient Set-based Solution for Islands and Gaps

Introduction

The Pitfall of Set-Based Solutions and How To Eliminate It

Finding Islands

Finding Gaps

Testing and Results

Rate

Share

Categories

Share

Rate

An Efficient Set-based Solution for Islands and Gaps

Introduction

The Pitfall of Set-Based Solutions and How To Eliminate It

Finding Islands

Finding Gaps

Testing and Results

Rate

Share

Categories

Share

Rate

Related content

Tame Those Strings - Using SUBSTRING

Tame Those Strings! Part 2 - Using CHARINDEX

Tame Those Strings! Part 5 - Using STUFF

SP_MSFOREACHDB - Getting a list of databases via TSQL.

SP_MSForeachtable - Life Without Cursors.