SQLServerCentral Article

The new Analytic functions in SQL Server 2012

,

The next version of SQL Server (v11.0), SQL Server 2012, is currently  in a pre-release status known as RC0  (Release Candidate 0) from its former code-named “Denali” name.  It is possible that the information in this article may change when this version is released to manufacturing (RTM). However the changes described in this article adhere to the ANSI specification, so I don’t expect to see any changes.

In SQL Server 2012, the T-SQL Windowing functions all utilize the OVER clause to perform the windowing operations. The functions come in four groups: Ranking, Aggregate, Analytic and Sequence Number generation. This article will cover the new analytic functions. Please see my article ““The OVER Clause enhancements in SQL Server code named Denali, CTP3” for more details of the OVER clause and its enhancements.

Analytic Functions

It’s hard to describe the overall purpose of the analytic functions better than how Books Online (BOL) does: “Analytic functions compute an aggregate value based on a group of rows. However, unlike aggregate functions, they can return multiple rows for each group. You can use analytic functions to compute moving averages, running totals, percentages or top-N results within a group.”

SQL Server 2012 adds eight analytic functions, all of which require the use of the OVER clause. These are LAG, LEAD, FIRST_VALUE, LAST_VALUE, CUME_DIST, PERCENT_RANK, PERCENTILE_CONT, and PERCENTILE_DISC. We will look at each one of these in more detail.

LAG / LEAD

The LAG/LEAD functions return a column from a previous/following row in the partition, with respect to the current row, as specified by the row offset in the function, without the use of a self-join. The syntax for these functions is:

LAG|LEAD (scalar_expression [,offset] [,default])

    OVER ( [ partition_by_clause ] order_by_clause )

Note that the partition by clause is optional, the order by clause is required, and the windowing clause (ROWS|RANGE) is not supported.

The scalar_expression is the value to be returned – this will normally be a column, but it can also be a subquery or any other expression that results in a single value.

The offset is the number of previous (LAG) or following (LEAD) rows from the current row from which to obtain a value, and if not specified it will default to 1.

The default is the value to use if the value at the offset is NULL, and if not specified it will default to NULL.

Using the LEAD function

An example for using LEAD would be to determine the gaps of a numeric column. For the following table, we want to determine the starting and ending range of values for each gap:

DECLARE @Gaps TABLE (col1 int PRIMARY KEY CLUSTERED);
INSERT INTO @Gaps (col1)
VALUES (1), (2), (3),
       (50), (51), (52), (53), (54), (55),
       (100), (101), (102),
       (500),
       (950), (951), (952);

The gaps are determined by comparing col1 in the current row to col1 in the next row, and if the difference is > 1 then there is a gap. For each row that has a gap, the next row’s value -1 is the end of the gap, and the current row’s value +1 is the start of the gap. Since we want to compare the current row to the next row, we’ll use the LEAD function. The code for this query and its results are:

WITH cte AS
(
SELECT curr = col1,
       nxt = LEAD(col1, 1, NULL)
             OVER (ORDER BY col1)
  FROM @Gaps
)
SELECT [Start of Gap] = cte.curr + 1,
       [End of Gap] = cte.nxt - 1
  FROM cte
 WHERE cte.nxt - cte.curr > 1;

Start of Gap End of Gap

------------ -----------

4            49

56           99

103          499

501          949

Using the LAG function

Let’s say that we have RFID tags attached to different items. Sensors periodically determine the location of each tag, and store the item, the location where it was picked up at, and a timestamp of when the sensor picked up the item. The following code creates a sample table, and then builds 100 rows of data – 10 items for being at each of 10 locations, with a random timestamp. The code then replaces all occurrences of rows 3, 6 and 9 for each item (as ordered by the timestamp) with the previous rows location value – giving us for each item some rows that are changing location, and some rows that are in the same location.

DECLARE @Locations TABLE (
  ItemId INT,
  LocationId INT,
  TrackDateTime DATETIME);
-- build 100 rows of data. 10 items * 10 locations,
-- and each row has a random date.
WITH
Tens (N) AS (SELECT 1 UNION ALL SELECT 2 UNION ALL
             SELECT 3 UNION ALL SELECT 4 UNION ALL
             SELECT 5 UNION ALL SELECT 6 UNION ALL
             SELECT 7 UNION ALL SELECT 8 UNION ALL
             SELECT 9 UNION ALL SELECT 10)
INSERT INTO @Locations (ItemId, LocationId, TrackDateTime)
SELECT t1.N,
       t2.N,
       CAST(ABS(CHECKSUM(NEWID()))%3653+36524 AS DATETIME)
  FROM Tens t1
       CROSS JOIN Tens t2;
 
WITH CTE AS
(
SELECT ItemId, LocationId, TrackDateTime,
       PriorLocation = LAG(LocationId)
                       OVER (PARTITION BY ItemId
                                 ORDER BY TrackDateTime),
       RN = ROW_NUMBER()
            OVER (PARTITION BY ItemId
                      ORDER BY TrackDateTime)
  FROM @Locations
)
UPDATE CTE
   SET LocationId = PriorLocation
 WHERE RN%3=0;

Here the LAG function was utilized to get the previous row’s Location within each Item, where the previous row is determined by its timestamp value. It’s already been a helpful function, and I still have the rest of the code to show!

Using both the LAG and LEAD functions

Now let’s see a list of items that did not move – or where the current location and the previous location are the same. And for good measure, let’s look at where the item next moved to. We’ll use the LAG function to get the previous value; the LEAD function was used to get the next value, and then return just the rows where the current location and prior location are the same. The code and the results are:

WITH CTE AS
(
SELECT *,
       PriorLocation = LAG(LocationId)
                       OVER (PARTITION BY ItemId
                                 ORDER BY TrackDateTime),
       NextLocation  = LEAD(LocationId)
                       OVER (PARTITION BY ItemId
                                 ORDER BY TrackDateTime)
  FROM @Locations
)
SELECT *
  FROM CTE
 WHERE LocationId = PriorLocation;

ItemId      LocationId  TrackDateTime           PriorLocation NextLocation

----------- ----------- ----------------------- ------------- ------------

1           10          2004-10-27 00:00:00.000 10            5

1           3           2008-04-07 00:00:00.000 3             7

1           6           2009-01-17 00:00:00.000 6             2

2           2           2004-01-02 00:00:00.000 2             4

2           1           2007-06-12 00:00:00.000 1             6

2          

3           2008-10-12 00:00:00.000 3             5

3           8           2002-12-16 00:00:00.000 8             9

3           6           2005-08-21 00:00:00.000 6             10

3           2           2007-07-16 00:00:00.000 2             3

4           1           2003-07-03 00:00:00.000 1             6

4           7           2004-11-09 00:00:00.000 7             3

4           2           2008-08-26 00:00:00.000 2             4

5           8           2002-09-30 00:00:00.000 8             5

5           4           2004-04-03 00:00:00.000 4             10

5           6           2008-05-19 00:00:00.000 6             2

6          5           2003-07-28 00:00:00.000 5             8

6           4           2006-01-11 00:00:00.000 4             9

6           1           2008-11-20 00:00:00.000 1             2

7           6           2002-02-08 00:00:00.000 6             3

7           4           2003-03-18 00:00:00.000 4             2

7           1           2006-08-05 00:00:00.000 1             10

8           4           2002-10-03 00:00:00.000 4             5

8           3           2005-08-04 00:00:00.000 3             8

8           1           2006-07-05 00:00:00.000 1             10

9           4           2001-11-17 00:00:00.000 4             6

9          3           2005-04-18 00:00:00.000 3             10

9           5           2006-08-02 00:00:00.000 5             2

10          6           2002-03-15 00:00:00.000 6             2

10          8           2004-01-30 00:00:00.000 8             1

10          4           2008-06-01 00:00:00.000 4             5

I think that these two functions offer some incredible possibilities, by taking a peek at the rows adjacent to the current row. The ability to look a specified number of rows away from the current row leads to some very interesting possibilities – for instance, by looking at a row that is four rows away, you can compare quarters in a year to the corresponding quarter of the previous year.

FIRST_VALUE / LAST_VALUE

The FIRST_VALUE and LAST_VALUE functions return the first or last value in the ordered set of your partition, optionally applying a windowing clause to limit the rows being considered. The syntax of these functions is:

FIRST_VALUE | LAST_VALUE ( [scalar_expression )

    OVER ( [ partition_by_clause ] order_by_clause [ rows_range_clause ] )

                Note that the order by clause is required, and that the partition by and windowing clauses are optional. The scalar_expression is the value to be returned – this will normally be a column, but it can also be a subquery or any other expression that results in a single value.

                For this example, I’m going to use the AdventureWorks for Denali database, which can be downloaded from http://sqlserversamples.codeplex.com/ (it will probably be renamed to 2012 soon). In this database, there is a Sales.SalesOrderHeader table, which has information about each order. This information includes CustomerID, OrderDate, and TotalDue columns. Let’s run a query that shows, for each CustomerID, the OrderDate for when they placed their least and most expensive orders. What this entails is using the FIRST_VALUE and LAST_VALUE functions to return the OrderDate column. The OVER clause specifies that the PARTITION BY is the CustomerID, the ORDER BY is the TotalDue, and the ROWS clause specifies all rows for that CustomerID. The resulting code and results are: 

SELECT DISTINCT TOP (10)
       CustomerID,
       OrderDateLow  = FIRST_VALUE(OrderDate)
                       OVER (PARTITION BY CustomerID
                                 ORDER BY TotalDue
                                  ROWS BETWEEN UNBOUNDED PRECEDING
                                           AND UNBOUNDED FOLLOWING),
       OrderDateHigh = LAST_VALUE(OrderDate)
                       OVER (PARTITION BY CustomerID
                                 ORDER BY TotalDue
                                  ROWS BETWEEN UNBOUNDED PRECEDING
                                           AND UNBOUNDED FOLLOWING)
  FROM Sales.SalesOrderHeader
 ORDER BY CustomerID;

CustomerID  OrderDateLow              OrderDateHigh

----------- -----------------------   -----------------------

11000       2007-07-22 00:00:00.000   2005-07-22 00:00:00.000

11001       2008-06-12 00:00:00.000   2005-07-18 00:00:00.000

11002       2007-07-04 00:00:00.000   2005-07-10 00:00:00.000

11003       2007-07-09 00:00:00.000   2005-07-01 00:00:00.000

11004       2007-07-26 00:00:00.000   2005-07-26 00:00:00.000

11005       2007-07-27 00:00:00.000   2005-07-02 00:00:00.000

11006       2007-07-02 00:00:00.000   2005-07-27 00:00:00.000

11007       2007-07-25 00:00:00.000   2005-07-12 00:00:00.000

11008       2007-07-07 00:00:00.000   2005-07-28 00:00:00.000

11009       2007-07-24 00:00:00.000   2005-07-30 00:00:00.000

CUME_DIST / PERCENT_RANK

The CUME_DIST function is the number of rows with a value (as specified in the ORDER BY clause) less than or equal to the current value, divided by the total number of rows in this partition.

The PERCENT_RANK function returns the percentage that the value (as specified in the ORDER BY clause) is within all of the values of the partition.

For both functions, NULLS are treated as the lowest possible value. The syntax of these functions is:

CUME_DIST | PERCENT_RANK

    OVER ( [ partition_by_clause ] order_by_clause )

Note that the ROWS|RANGE clause is not supported, the ORDER BY clause is required, and the PARTITION BY clause is optional. These functions will be a bit easier to understand when we see an example, so let’s just jump straight to one, and then we’ll explain what it is doing.

DECLARE @Salaries TABLE (
    DepartmentID INT,
    Salary       INT);
INSERT INTO @Salaries
VALUES (1, 15000),
       (1, 18000),
       (1, 23000),
       (1, 23000),
       (1, 25000);
SELECT DepartmentID,
       Salary,
       CumeDist    = CUME_DIST()
                     OVER (PARTITION BY DepartmentID
                               ORDER BY Salary),
       PercentRank = PERCENT_RANK()
                     OVER (PARTITION BY DepartmentID
                               ORDER BY Salary)
  FROM @Salaries;

DepartmentID Salary      CumeDist               PercentRank

------------ ----------- ---------------------- ----------------------

1            15000       0.2                    0

1            18000       0.4                    0.25

1            23000       0.8                    0.5

1            23000       0.8                    0.5

1            25000       1                      1

Here we have 4 salary values in 5 rows. For CUME_DIST, on the row with the first value, there is 1 row with that value or lower. For the second row, there are 2 rows with that value or lower. For the third and fourth rows, there are 4 rows with that value or lower. And, for the fifth row, there are 5 rows with that value or lower. Each of these row count values are divided by the total number of rows in that partition (5). So, 1/5 = 0.2; 2/5 = 0.4; 4/5 = 0.8, and 5/5 = 1.

For PERCENT_RANK, there are 4 unique values. The first value has a percent rank of 0, and the last value has a percent rank of 1. The other values are divided up by their percentage through the number of rows – 1. Since there are 5 rows, and the first PERCENT_RANK starts at 0, the others (5-1) are divided equally, so each row has a PERCENT_RANK of 1/4 (or .25). Looking at the example, you can see how the first value is 0, the second value is 0.25, the third value is 0.5 (for the third and fourth rows), and the last value is 1.

Notice that for CUME_DIST, if there is more than one row with the same value, they all have the highest CUME_DIST calculation; for PERCENT_RANK, each row with the same value they will all have the lowest PERCENT_RANK calculation.

These functions really start to dive into probability and analytics – areas that I’m not very strong at. I’d really like to get your feedback on real-world examples of where these would be used.

PERCENTILE_CONT / PERCENTILE_DISC

The PERCENTILE_CONT function calculates a percentile based on a continuous distribution of the column’s value.

The PERCENTILE_DISC function returns the column’s value for the smallest CUME_DIST value that is greater than or equal to the specified percentile value.

For the PERCENTILE_CONT function, the value returned may or may not exist in the column being analyzed; for the PERCENTILE_DISC function the value returned will belong to one of the rows in the partition. Thus, they may or may not return the same value. The syntax for these functions is:

PERCENTILE_CONT | PERCENTILE_DISC ( numeric_literal )

    WITHIN GROUP ( ORDER BY order_by_expression [ ASC | DESC ] )

    OVER ( [ <partition_by_clause> ] )

The first thing that you immediately notice (well, at least the first thing that I immediately noticed) is the WITHIN GROUP clause. This clause specifies a list of numeric values to sort and compute the percentile over. Note that only one order_by_expression is allowed. When using the PERCENTILE_CONT function, it must evaluate to one of the exact or approximate numeric data types.

Within the OVER clause, the PARTITION BY is optional, and the ORDER BY and ROW|RANGE clauses are not supported. Yes, this means that you can have an empty OVER clause.

The numeric_literal is the percentile to compute, and it must be within the range of 0.0 to 1.0.

For both functions, NULL values are ignored.

Let's look at an example. We will make an employee table, with employees assigned to a department and what each employee’s salary is. We’ll return each employee, along with the median salary in the department and the median salary in the entire table. The medians will be calculated by using both functions, and we’ll pass in a percentile of 0.5 (50%). For good measure, I’m going to include the AVG and CUME_DIST values also.

DECLARE @Employees TABLE (
    EmplId INT PRIMARY KEY CLUSTERED,
    DeptId INT,
    Salary NUMERIC(8,2));
INSERT INTO @Employees
VALUES (1, 1, 10000),
       (2, 1, 11000),
       (3, 1, 12000),
       (4, 2, 25000),
       (5, 2, 35000),
       (6, 2, 100000),
       (7, 2, 100000);
SELECT EmplId,
       DeptId,
       Salary,
       PctC1 = PERCENTILE_CONT(0.5)
               WITHIN GROUP (ORDER BY Salary ASC)
               OVER (PARTITION BY DeptId),
       PctD1 = PERCENTILE_DISC(0.5)
               WITHIN GROUP (ORDER BY Salary ASC)
               OVER (PARTITION BY DeptId),
       CD1   = CUME_DIST()
               OVER (PARTITION BY DeptId
                         ORDER BY Salary),
       AVG1  = AVG(Salary)
               OVER (PARTITION BY DeptId),
       PctC2 = PERCENTILE_CONT(0.5)
               WITHIN GROUP (ORDER BY Salary ASC)
               OVER (),
       PctD2 = PERCENTILE_DISC(0.5)
               WITHIN GROUP (ORDER BY Salary ASC)
               OVER (),
       CD2   = CUME_DIST()
               OVER (ORDER BY Salary ASC),
       AVG2  = AVG(Salary)
               OVER (ORDER BY Salary
                      ROWS BETWEEN UNBOUNDED PRECEDING
                               AND UNBOUNDED FOLLOWING)
  FROM @Employees
 ORDER BY DeptId, EmplId;

Results:

EmplId DeptId Salary     PctC1 PctD1     CD1   AVG1   PctC2  PctD2     CD2   AVG2

------ ------ ---------  ----- --------  ----  -----  -----  --------  ----  -----

1      1      10000.00   11000 11000.00  0.33  11000  25000  25000.00  0.14  41857

2      1      11000.00   11000 11000.00  0.66  11000  25000  25000.00  0.28  41857

3      1      12000.00   11000 11000.00  1     11000  25000  25000.00  0.42  41857

4      2      25000.00   67500 35000.00  0.25  65000  25000  25000.00  0.57  41857

5      2      35000.00   67500 35000.00  0.5   65000  25000  25000.00  0.71  41857

6      2      100000.00  67500 35000.00  1     65000  25000  25000.00  1     41857

7      2      100000.00  67500 35000.00  1     65000  25000  25000.00  1     41857

The PctC# columns utilize the PERCENTILE_CONT function, and the PctD# columns utilize the PERCENTILE_DISC function. The columns ending with “1″ are partitioned by the Department, and the columns ending in “2″ are across the entire table.

Notice that for Dept 1, that the PctC1 and PctD1 columns both return the same value, and that this value is one of the Salary values for this department. However, when we take a look at Dept 2, we see that the PctC1 column has a value that is the median of the four salaries, and it is a value that is not one of the salaries. The PctD1 column returns the value from the first row where the CUME_DIST is >= 0.5, and in this case it is the second row. But, where is that value in the PctC1 column coming from? After a bit of searching on the internet, I found an article at http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions110.htm that explains how this function is implemented in Oracle. In quoting from it:

“The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, we compute the row number we are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+ (P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).”

Okay, so now I can see that for Department 2, that it is getting the Salary values for the two middle rows (35,000 and 100,000), adding them together and then dividing by 2, resulting in 67,500.

Summary

The LAG, LEAD, FIRST_VALUE and LAST_VALUE functions have a lot of potential, as I’ve shown. I don’t know if I’ll ever use the other functions, but at least now I understand what they do. If you have a real-world use for these, please post a comment explaining it.

Thanks for taking the time to read this article. 

Support for OVER clauses by function

FUNCTION

OVER Clause Usage

PARTITION BY

ORDER BY

ROWS/RANGE

CUME_DIST

R

O

R

X

FIRST_VALUE

R

O

R

O

LAG

R

O

R

X

LAST_VALUE

R

O

R

O

LEAD

R

O

R

X

PERCENTILE_CONT

R

O

X

X

PERCENTILE_DISC

R

O

X

X

PERCENT_RANK

R

O

R

X

O=Optional, R=Required, X=Not Supported

REFERENCES / Links

OVER Clause (BOL)

http://msdn.microsoft.com/en-us/library/ms189461(v=sql.110).aspx

Analytic Functions (BOL)

http://msdn.microsoft.com/en-us/library/hh213234(v=sql.110).aspx

CUME_DIST

http://msdn.microsoft.com/en-us/library/hh231078(v=SQL.110).aspx

FIRST_VALUE

http://msdn.microsoft.com/en-us/library/hh213018(v=SQL.110).aspx

LAG

http://msdn.microsoft.com/en-us/library/hh231256(v=SQL.110).aspx

LAST_VALUE

http://msdn.microsoft.com/en-us/library/hh231517(v=SQL.110).aspx

LEAD

http://msdn.microsoft.com/en-us/library/hh213125(v=SQL.110).aspx

PERCENTILE_CONT

http://msdn.microsoft.com/en-us/library/hh231473(v=SQL.110).aspx

PERCENTILE_DISC

http://msdn.microsoft.com/en-us/library/hh231327(v=SQL.110).aspx

PERCENT_RANK

http://msdn.microsoft.com/en-us/library/hh213573(v=SQL.110).aspx

PERCENTILE_CONT (Oracle link)

http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions110.htm

OVER Clause enhancements in SQL Server code-named “Denali”, CTP3

http://www.sqlservercentral.com/articles/T-SQL/75466/

AdventureWorks for Denali

http://sqlserversamples.codeplex.com/

Rate

4.79 (76)

You rated this post out of 5. Change rating

Share

Share

Rate

4.79 (76)

You rated this post out of 5. Change rating