The next version of SQL Server (v11.0), SQL Server 2012, is currently in a prerelease status known as RC0 (Release Candidate 0) from its former codenamed “Denali” name. It is possible that the information in this article may change when this version is released to manufacturing (RTM). However the changes described in this article adhere to the ANSI specification, so I don’t expect to see any changes.
In SQL Server 2012, the TSQL Windowing functions all utilize the OVER clause to perform the windowing operations. The functions come in four groups: Ranking, Aggregate, Analytic and Sequence Number generation. This article will cover the new analytic functions. Please see my article ““The OVER Clause enhancements in SQL Server code named Denali, CTP3” for more details of the OVER clause and its enhancements.
Analytic Functions
It’s hard to describe the overall purpose of the analytic functions better than how Books Online (BOL) does: “Analytic functions compute an aggregate value based on a group of rows. However, unlike aggregate functions, they can return multiple rows for each group. You can use analytic functions to compute moving averages, running totals, percentages or topN results within a group.”
SQL Server 2012 adds eight analytic functions, all of which require the use of the OVER clause. These are LAG, LEAD, FIRST_VALUE, LAST_VALUE, CUME_DIST, PERCENT_RANK, PERCENTILE_CONT, and PERCENTILE_DISC. We will look at each one of these in more detail.
LAG / LEAD
The LAG/LEAD functions return a column from a previous/following row in the partition, with respect to the current row, as specified by the row offset in the function, without the use of a selfjoin. The syntax for these functions is:
LAGLEAD (scalar_expression [,offset] [,default])
OVER ( [ partition_by_clause ] order_by_clause )
Note that the partition by clause is optional, the order by clause is required, and the windowing clause (ROWSRANGE) is not supported.
The scalar_expression is the value to be returned – this will normally be a column, but it can also be a subquery or any other expression that results in a single value.
The offset is the number of previous (LAG) or following (LEAD) rows from the current row from which to obtain a value, and if not specified it will default to 1.
The default is the value to use if the value at the offset is NULL, and if not specified it will default to NULL.
Using the LEAD function
An example for using LEAD would be to determine the gaps of a numeric column. For the following table, we want to determine the starting and ending range of values for each gap:
DECLARE @Gaps TABLE (col1 int PRIMARY KEY CLUSTERED);
INSERT INTO @Gaps (col1)
VALUES (1), (2), (3),
(50), (51), (52), (53), (54), (55),
(100), (101), (102),
(500),
(950), (951), (952);
The gaps are determined by comparing col1 in the current row to col1 in the next row, and if the difference is > 1 then there is a gap. For each row that has a gap, the next row’s value 1 is the end of the gap, and the current row’s value +1 is the start of the gap. Since we want to compare the current row to the next row, we’ll use the LEAD function. The code for this query and its results are:
WITH cte AS
(
SELECT curr = col1,
nxt = LEAD(col1, 1, NULL)
OVER (ORDER BY col1)
FROM @Gaps
)
SELECT [Start of Gap] = cte.curr + 1,
[End of Gap] = cte.nxt  1
FROM cte
WHERE cte.nxt  cte.curr > 1;
Start of Gap End of Gap
 
4 49
56 99
103 499
501 949
Using the LAG function
Let’s say that we have RFID tags attached to different items. Sensors periodically determine the location of each tag, and store the item, the location where it was picked up at, and a timestamp of when the sensor picked up the item. The following code creates a sample table, and then builds 100 rows of data – 10 items for being at each of 10 locations, with a random timestamp. The code then replaces all occurrences of rows 3, 6 and 9 for each item (as ordered by the timestamp) with the previous rows location value – giving us for each item some rows that are changing location, and some rows that are in the same location.
DECLARE @Locations TABLE (
ItemId INT,
LocationId INT,
TrackDateTime DATETIME);
 build 100 rows of data. 10 items * 10 locations,
 and each row has a random date.
WITH
Tens (N) AS (SELECT 1 UNION ALL SELECT 2 UNION ALL
SELECT 3 UNION ALL SELECT 4 UNION ALL
SELECT 5 UNION ALL SELECT 6 UNION ALL
SELECT 7 UNION ALL SELECT 8 UNION ALL
SELECT 9 UNION ALL SELECT 10)
INSERT INTO @Locations (ItemId, LocationId, TrackDateTime)
SELECT t1.N,
t2.N,
CAST(ABS(CHECKSUM(NEWID()))%3653+36524 AS DATETIME)
FROM Tens t1
CROSS JOIN Tens t2;
WITH CTE AS
(
SELECT ItemId, LocationId, TrackDateTime,
PriorLocation = LAG(LocationId)
OVER (PARTITION BY ItemId
ORDER BY TrackDateTime),
RN = ROW_NUMBER()
OVER (PARTITION BY ItemId
ORDER BY TrackDateTime)
FROM @Locations
)
UPDATE CTE
SET LocationId = PriorLocation
WHERE RN%3=0;
Here the LAG function was utilized to get the previous row’s Location within each Item, where the previous row is determined by its timestamp value. It’s already been a helpful function, and I still have the rest of the code to show!
Using both the LAG and LEAD functions
Now let’s see a list of items that did not move – or where the current location and the previous location are the same. And for good measure, let’s look at where the item next moved to. We’ll use the LAG function to get the previous value; the LEAD function was used to get the next value, and then return just the rows where the current location and prior location are the same. The code and the results are:
WITH CTE AS
(
SELECT *,
PriorLocation = LAG(LocationId)
OVER (PARTITION BY ItemId
ORDER BY TrackDateTime),
NextLocation = LEAD(LocationId)
OVER (PARTITION BY ItemId
ORDER BY TrackDateTime)
FROM @Locations
)
SELECT *
FROM CTE
WHERE LocationId = PriorLocation;
ItemId LocationId TrackDateTime PriorLocation NextLocation
    
1 10 20041027 00:00:00.000 10 5
1 3 20080407 00:00:00.000 3 7
1 6 20090117 00:00:00.000 6 2
2 2 20040102 00:00:00.000 2 4
2 1 20070612 00:00:00.000 1 6
2
3 20081012 00:00:00.000 3 5
3 8 20021216 00:00:00.000 8 9
3 6 20050821 00:00:00.000 6 10
3 2 20070716 00:00:00.000 2 3
4 1 20030703 00:00:00.000 1 6
4 7 20041109 00:00:00.000 7 3
4 2 20080826 00:00:00.000 2 4
5 8 20020930 00:00:00.000 8 5
5 4 20040403 00:00:00.000 4 10
5 6 20080519 00:00:00.000 6 2
6 5 20030728 00:00:00.000 5 8
6 4 20060111 00:00:00.000 4 9
6 1 20081120 00:00:00.000 1 2
7 6 20020208 00:00:00.000 6 3
7 4 20030318 00:00:00.000 4 2
7 1 20060805 00:00:00.000 1 10
8 4 20021003 00:00:00.000 4 5
8 3 20050804 00:00:00.000 3 8
8 1 20060705 00:00:00.000 1 10
9 4 20011117 00:00:00.000 4 6
9 3 20050418 00:00:00.000 3 10
9 5 20060802 00:00:00.000 5 2
10 6 20020315 00:00:00.000 6 2
10 8 20040130 00:00:00.000 8 1
10 4 20080601 00:00:00.000 4 5
I think that these two functions offer some incredible possibilities, by taking a peek at the rows adjacent to the current row. The ability to look a specified number of rows away from the current row leads to some very interesting possibilities – for instance, by looking at a row that is four rows away, you can compare quarters in a year to the corresponding quarter of the previous year.
FIRST_VALUE / LAST_VALUE
The FIRST_VALUE and LAST_VALUE functions return the first or last value in the ordered set of your partition, optionally applying a windowing clause to limit the rows being considered. The syntax of these functions is:
FIRST_VALUE  LAST_VALUE ( [scalar_expression )
OVER ( [ partition_by_clause ] order_by_clause [ rows_range_clause ] )
Note that the order by clause is required, and that the partition by and windowing clauses are optional. The scalar_expression is the value to be returned – this will normally be a column, but it can also be a subquery or any other expression that results in a single value.
For this example, I’m going to use the AdventureWorks for Denali database, which can be downloaded from http://sqlserversamples.codeplex.com/ (it will probably be renamed to 2012 soon). In this database, there is a Sales.SalesOrderHeader table, which has information about each order. This information includes CustomerID, OrderDate, and TotalDue columns. Let’s run a query that shows, for each CustomerID, the OrderDate for when they placed their least and most expensive orders. What this entails is using the FIRST_VALUE and LAST_VALUE functions to return the OrderDate column. The OVER clause specifies that the PARTITION BY is the CustomerID, the ORDER BY is the TotalDue, and the ROWS clause specifies all rows for that CustomerID. The resulting code and results are:
SELECT DISTINCT TOP (10)
CustomerID,
OrderDateLow = FIRST_VALUE(OrderDate)
OVER (PARTITION BY CustomerID
ORDER BY TotalDue
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING),
OrderDateHigh = LAST_VALUE(OrderDate)
OVER (PARTITION BY CustomerID
ORDER BY TotalDue
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
FROM Sales.SalesOrderHeader
ORDER BY CustomerID;
CustomerID OrderDateLow OrderDateHigh
  
11000 20070722 00:00:00.000 20050722 00:00:00.000
11001 20080612 00:00:00.000 20050718 00:00:00.000
11002 20070704 00:00:00.000 20050710 00:00:00.000
11003 20070709 00:00:00.000 20050701 00:00:00.000
11004 20070726 00:00:00.000 20050726 00:00:00.000
11005 20070727 00:00:00.000 20050702 00:00:00.000
11006 20070702 00:00:00.000 20050727 00:00:00.000
11007 20070725 00:00:00.000 20050712 00:00:00.000
11008 20070707 00:00:00.000 20050728 00:00:00.000
11009 20070724 00:00:00.000 20050730 00:00:00.000
CUME_DIST / PERCENT_RANK
The CUME_DIST function is the number of rows with a value (as specified in the ORDER BY clause) less than or equal to the current value, divided by the total number of rows in this partition.
The PERCENT_RANK function returns the percentage that the value (as specified in the ORDER BY clause) is within all of the values of the partition.
For both functions, NULLS are treated as the lowest possible value. The syntax of these functions is:
CUME_DIST  PERCENT_RANK
OVER ( [ partition_by_clause ] order_by_clause )
Note that the ROWSRANGE clause is not supported, the ORDER BY clause is required, and the PARTITION BY clause is optional. These functions will be a bit easier to understand when we see an example, so let’s just jump straight to one, and then we’ll explain what it is doing.
DECLARE @Salaries TABLE (
DepartmentID INT,
Salary INT);
INSERT INTO @Salaries
VALUES (1, 15000),
(1, 18000),
(1, 23000),
(1, 23000),
(1, 25000);
SELECT DepartmentID,
Salary,
CumeDist = CUME_DIST()
OVER (PARTITION BY DepartmentID
ORDER BY Salary),
PercentRank = PERCENT_RANK()
OVER (PARTITION BY DepartmentID
ORDER BY Salary)
FROM @Salaries;
DepartmentID Salary CumeDist PercentRank
   
1 15000 0.2 0
1 18000 0.4 0.25
1 23000 0.8 0.5
1 23000 0.8 0.5
1 25000 1 1
Here we have 4 salary values in 5 rows. For CUME_DIST, on the row with the first value, there is 1 row with that value or lower. For the second row, there are 2 rows with that value or lower. For the third and fourth rows, there are 4 rows with that value or lower. And, for the fifth row, there are 5 rows with that value or lower. Each of these row count values are divided by the total number of rows in that partition (5). So, 1/5 = 0.2; 2/5 = 0.4; 4/5 = 0.8, and 5/5 = 1.
For PERCENT_RANK, there are 4 unique values. The first value has a percent rank of 0, and the last value has a percent rank of 1. The other values are divided up by their percentage through the number of rows – 1. Since there are 5 rows, and the first PERCENT_RANK starts at 0, the others (51) are divided equally, so each row has a PERCENT_RANK of 1/4 (or .25). Looking at the example, you can see how the first value is 0, the second value is 0.25, the third value is 0.5 (for the third and fourth rows), and the last value is 1.
Notice that for CUME_DIST, if there is more than one row with the same value, they all have the highest CUME_DIST calculation; for PERCENT_RANK, each row with the same value they will all have the lowest PERCENT_RANK calculation.
These functions really start to dive into probability and analytics – areas that I’m not very strong at. I’d really like to get your feedback on realworld examples of where these would be used.
PERCENTILE_CONT / PERCENTILE_DISC
The PERCENTILE_CONT function calculates a percentile based on a continuous distribution of the column’s value.
The PERCENTILE_DISC function returns the column’s value for the smallest CUME_DIST value that is greater than or equal to the specified percentile value.
For the PERCENTILE_CONT function, the value returned may or may not exist in the column being analyzed; for the PERCENTILE_DISC function the value returned will belong to one of the rows in the partition. Thus, they may or may not return the same value. The syntax for these functions is:
PERCENTILE_CONT 
PERCENTILE_DISC
( numeric_literal )
WITHIN GROUP ( ORDER BY order_by_expression [ ASC  DESC ] )
OVER ( [ <partition_by_clause> ] )
The first thing that you immediately notice (well, at least the first thing that I immediately noticed) is the WITHIN GROUP clause. This clause specifies a list of numeric values to sort and compute the percentile over. Note that only one order_by_expression is allowed. When using the PERCENTILE_CONT function, it must evaluate to one of the exact or approximate numeric data types.
Within the OVER clause, the PARTITION BY is optional, and the ORDER BY and ROWRANGE clauses are not supported. Yes, this means that you can have an empty OVER clause.
The numeric_literal is the percentile to compute, and it must be within the range of 0.0 to 1.0.
For both functions, NULL values are ignored.
Let's look at an example. We will make an employee table, with employees assigned to a department and what each employee’s salary is. We’ll return each employee, along with the median salary in the department and the median salary in the entire table. The medians will be calculated by using both functions, and we’ll pass in a percentile of 0.5 (50%). For good measure, I’m going to include the AVG and CUME_DIST values also.
DECLARE @Employees TABLE (
EmplId INT PRIMARY KEY CLUSTERED,
DeptId INT,
Salary NUMERIC(8,2));
INSERT INTO @Employees
VALUES (1, 1, 10000),
(2, 1, 11000),
(3, 1, 12000),
(4, 2, 25000),
(5, 2, 35000),
(6, 2, 100000),
(7, 2, 100000);
SELECT EmplId,
DeptId,
Salary,
PctC1 = PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY Salary ASC)
OVER (PARTITION BY DeptId),
PctD1 = PERCENTILE_DISC(0.5)
WITHIN GROUP (ORDER BY Salary ASC)
OVER (PARTITION BY DeptId),
CD1 = CUME_DIST()
OVER (PARTITION BY DeptId
ORDER BY Salary),
AVG1 = AVG(Salary)
OVER (PARTITION BY DeptId),
PctC2 = PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY Salary ASC)
OVER (),
PctD2 = PERCENTILE_DISC(0.5)
WITHIN GROUP (ORDER BY Salary ASC)
OVER (),
CD2 = CUME_DIST()
OVER (ORDER BY Salary ASC),
AVG2 = AVG(Salary)
OVER (ORDER BY Salary
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
FROM @Employees
ORDER BY DeptId, EmplId;
Results:
EmplId DeptId Salary PctC1 PctD1 CD1 AVG1 PctC2 PctD2 CD2 AVG2
          
1 1 10000.00 11000 11000.00 0.33 11000 25000 25000.00 0.14 41857
2 1 11000.00 11000 11000.00 0.66 11000 25000 25000.00 0.28 41857
3 1 12000.00 11000 11000.00 1 11000 25000 25000.00 0.42 41857
4 2 25000.00 67500 35000.00 0.25 65000 25000 25000.00 0.57 41857
5 2 35000.00 67500 35000.00 0.5 65000 25000 25000.00 0.71 41857
6 2 100000.00 67500 35000.00 1 65000 25000 25000.00 1 41857
7 2 100000.00 67500 35000.00 1 65000 25000 25000.00 1 41857
The PctC# columns utilize the PERCENTILE_CONT function, and the PctD# columns utilize the PERCENTILE_DISC function. The columns ending with “1″ are partitioned by the Department, and the columns ending in “2″ are across the entire table.
Notice that for Dept 1, that the PctC1 and PctD1 columns both return the same value, and that this value is one of the Salary values for this department. However, when we take a look at Dept 2, we see that the PctC1 column has a value that is the median of the four salaries, and it is a value that is not one of the salaries. The PctD1 column returns the value from the first row where the CUME_DIST is >= 0.5, and in this case it is the second row. But, where is that value in the PctC1 column coming from? After a bit of searching on the internet, I found an article at http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions110.htm that explains how this function is implemented in Oracle. In quoting from it:
“The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, we compute the row number we are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+ (P*(N1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).”
Okay, so now I can see that for Department 2, that it is getting the Salary values for the two middle rows (35,000 and 100,000), adding them together and then dividing by 2, resulting in 67,500.
Summary
The LAG, LEAD, FIRST_VALUE and LAST_VALUE functions have a lot of potential, as I’ve shown. I don’t know if I’ll ever use the other functions, but at least now I understand what they do. If you have a realworld use for these, please post a comment explaining it.
Thanks for taking the time to read this article.
Support for OVER clauses by function
FUNCTION

OVER Clause Usage

PARTITION BY

ORDER BY

ROWS/RANGE

CUME_DIST

R

O

R

X

FIRST_VALUE

R

O

R

O

LAG

R

O

R

X

LAST_VALUE

R

O

R

O

LEAD

R

O

R

X

PERCENTILE_CONT

R

O

X

X

PERCENTILE_DISC

R

O

X

X

PERCENT_RANK

R

O

R

X

O=Optional, R=Required, X=Not Supported
REFERENCES / Links