SQLServerCentral Article

Cursor-Killing: Accessing Data in the Next Row

,

Problem

We often have complex data requirements that seem to defy an elegant set-based approach.  One specific case that often causes difficulty is having to perform calculations between rows of data.  For example, we have a data set such as this one:

InvID ActivityCode Date      OldLocationID NewLocationID    
***** ************ ********* ************* ************
1     1            1/1/2014  NULL          10
1     3            2/1/2014  10            13
1     3            2/5/2014  13            10
1     3            2/10/2014 10            13   
1     3            2/15/2014 13            10
1     3            2/20/2014 10            13
1     3            2/21/2014 10            10
1     3            2/24/2014 10            13
1     3            2/28/2014 10            10
1     3            3/10/2014 10            13
2     1            1/1/2014  NULL          10
2     3            2/1/2014  10            13
2     3            2/5/2014  13            10
2     3            2/10/2014 10            13
2     3            2/15/2014 13            10

We are interested in calculating the days that an inventory item is located in a given location, but want to do so efficiently in a single select statement.

How do we write a query to do this without a WHILE loop or a cursor so that we can retrieve this data quickly and efficiently?  What is the best way to relate data in one row to another row before or after it?

Solution

We somehow need to link any given row to the next row and calculate the number of days in between.  There are two set-based approaches that we will look at here; the first will work on any version of SQL Server since 2005 whereas the second will only work on SQL Server 2012 and later.

The first solution takes the problem and provides a literal approach.  If we can perform a join between one row and the next in the table, then we can easily calculate the result we are looking for.

SELECT
       *
FROM history HISTORY_CURRENT_ROW
INNER JOIN history HISTORY_NEXT_ROW
ON ?????

Since this data set doesn’t have an identity or reliable primary key, filling in those question marks will require a bit more work.  If we use the ROW_NUMBER window function, we can create a surrogate key on the fly, and then use it to solve our problem:

;
WITH    CTE_HISTORY
          AS ( SELECT ROW_NUMBER() OVER ( PARTITION BY InventoryItemID ORDER BY Date ) AS row_num
                ,   InventoryItemID
                ,   ActivityCodeID
                ,   Date
                ,   OldLocationID
                ,   NewLocationID
                FROM history
             )
    SELECT HISTORY_CURRENT_ROW.*
        ,   DATEDIFF(DAY, HISTORY_CURRENT_ROW.Date, HISTORY_NEXT_ROW.Date) AS days_in_current_location
        FROM CTE_HISTORY HISTORY_CURRENT_ROW
            LEFT JOIN CTE_HISTORY HISTORY_NEXT_ROW
            ON HISTORY_CURRENT_ROW.row_num + 1 = HISTORY_NEXT_ROW.row_num -- Join to the next row in a given frame
               AND HISTORY_CURRENT_ROW.InventoryItemID = HISTORY_NEXT_ROW.InventoryItemID -- Also join on InventoryItemID, so we don't duplicate results

The CTE creates a row number set for each inventory item, ordering by date so that each move is chronological order.  We then join the CTE to itself based on InventoryItemID and our newly created row_num to link one row to the next.  A LEFT JOIN is used since the last row for each inventory item will not have a next row to join to, and we do not want to inadvertently filter these out.  Finally, the DATEDIFF finishes the job, calculating the days in between when our item arrived in its current location and when it arrives in its next location.

This solution returns the results we were looking for, but it isn’t terribly efficient.  A quick look at our execution plan reveals 2 table scans:

IO statistics confirms this and shows the need for a worktable in TempDB to support our table spool operations.  For a large data set, this could be a very slow query.

Table 'Worktable'. Scan count 1, logical reads 39, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'history'. Scan count 2, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

If you are using SQL Server 2012, you can take advantage of one of the new window functions to get the same results as above far more efficiently.  LEAD and LAG can be used to retrieve a column from a previous or upcoming row using these standard window functions.  You can implement LEAD like this:

SELECT
       LEAD(Date) OVER (PARTITION BY InventoryItemID ORDER BY Date) AS next_location_date,
       InventoryItemID,
       ActivityCodeID,
       Date,
       OldLocationID,
       NewLocationID
FROM history

Note that the PARTITION BY and ORDER BY are exactly the same as before.  Since LEAD and LAG are treated like standard functions within a SELECT query, you can manipulate them at will to get our days in each location by wrapping a DATEDIFF around the date in the current row and our LEAD date:

SELECT
       InventoryItemID,
       ActivityCodeID,
       Date,
       OldLocationID,
       NewLocationID,
       DATEDIFF(DAY, history.Date, LEAD(Date) OVER (PARTITION BY InventoryItemID ORDER BY Date)) AS days_in_current_location
FROM history

By using LEAD, we avoid the need for a self-join, which greatly improves performance.  The result is a much cleaner execution plan and statistics that confirm the reduced reads necessary to calculate our results.  The worktable used by the window spool is far more efficient than the table spool required previously.

The query statistics:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'history'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As an additional bit of trivia, you can add an offset to LEAD, which will allow you to access data from a row in any given span from the current row.  For example, LEAD(Date, 2) would not get the date from the next row, but from 2 rows ahead.  LAG(Date, 5) would get the date from 5 rows prior to the current row.

Conclusion

Calculating date between different rows can be done efficiently without resorting to an iterative solution.  By utilizing a few SQL Server window functions and DATEDIFF, we can turn a complex TSQL challenge into an 8 line solution.  When confronted with a situation where it seems as though no set-based solution is available, consider experimenting with CTEs and window functions as both can allow you to restructure or enhance your data set to include additional useful information.

Rate

4.86 (94)

You rated this post out of 5. Change rating

Share

Share

Rate

4.86 (94)

You rated this post out of 5. Change rating