Analytic Functions – PERCENTILE_CONT / PERCENTILE_DISC

This post is part of the series discussing the new Analytic functions in SQL Server “Denali”, CTP3.

Analytic Functions in SQL Server “Denali”

All of the new Analytic functions require the use of the OVER clause. To see what changes to the OVER clause that “Denali” brings, please view my article The OVER Clause enhancements in SQL Server code named “Denali”, CTP3 on SQLServerCentral.com.

Introducing the PERCENTILE_CONT and PERCENTILE_DISC functions

Rounding out the Analytic Functions series are the PERCENTILE_CONT and PERCENTILE_DISC functions. The PERCENTILE_CONT function calculates a percentile based on a continuous distribution of the column’s value. The PERCENTILE_DISC function returns the column’s value for the smallest CUME_DIST value that is greater than or equal to the specified percentile value. For the PERCENTILE_CONT function, the value returned may or may not exist in the column being analyzed; for the PERCENTILE_DISC function the value returned will belong to one of the rows in the partition. Thus, they may or may not return the same value. The syntax for these functions is:

Source code

PERCENTILE_CONT ( numeric_literal )
    WITHIN GROUP ( ORDER BY order_by_expression [ ASC | DESC ] )
    OVER ( [ <partition_by_clause> ] )

The first thing that you immediately notice (well, at least the first thing that I immediately noticed) is the WITHIN GROUP clause. This clause specifies a list of numeric values to sort and compute the percentile over. Note that only one order_by_expression is allowed. When using the PERCENTILE_CONT function, it must evaluate to one of the exact or approximate numeric data types.

Within the OVER clause, the PARTITION BY is optional, and the ORDER BY and ROW|RANGE clauses are not supported.

The numeric_literal is the percentile to compute, and it must be within the range of 0.0 to 1.0.

For both functions, NULL values are ignored.

Okay, let’s make an employee table, with employees assigned to a department and what each employee’s salary is. We’ll return each employee, along with the median salary in the department and the median salary in the entire table. The medians will be calculated by using both functions, and we’ll pass in a percentile of 0.5 (50%). For good measure, I’m going to include the AVG and CUME_DIST values also.

Source code

DECLARE @test TABLE (
    EmplId INT PRIMARY KEY CLUSTERED,
    DeptId INT,
    Salary NUMERIC(8,2));
 
INSERT INTO @test
VALUES (1, 1, 10000),
       (2, 1, 11000),
       (3, 1, 12000),
       (4, 2, 25000),
       (5, 2, 35000),
       (6, 2, 100000),
       (7, 2, 100000);
 
SELECT EmplId,
       DeptId,
       Salary,
       PctC1 = PERCENTILE_CONT(0.5)
               WITHIN GROUP (ORDER BY Salary ASC)
               OVER (PARTITION BY DeptId),
       PctD1 = PERCENTILE_DISC(0.5)
               WITHIN GROUP (ORDER BY Salary ASC)
               OVER (PARTITION BY DeptId),
       CD1   = CUME_DIST()
               OVER (PARTITION BY DeptId
                         ORDER BY Salary),
       AVG1  = AVG(Salary)
               OVER (PARTITION BY DeptId),
       PctC2 = PERCENTILE_CONT(0.5)
               WITHIN GROUP (ORDER BY Salary ASC)
               OVER (),
       PctD2 = PERCENTILE_DISC(0.5)
               WITHIN GROUP (ORDER BY Salary ASC)
               OVER (),
       CD2   = CUME_DIST()
               OVER (ORDER BY Salary ASC),
       AVG2  = AVG(Salary)
               OVER (ORDER BY Salary
                      ROWS BETWEEN UNBOUNDED PRECEDING
                               AND UNBOUNDED FOLLOWING)
  FROM @test
 ORDER BY DeptId, EmplId;

And this returns this result set:

Source code

EmplId DeptId Salary     PctC1 PctD1     CD1   AVG1   PctC2  PctD2     CD2   AVG2
------ ------ ---------  ----- --------  ----  -----  -----  --------  ----  -----
1      1      10000.00   11000 11000.00  0.33  11000  25000  25000.00  0.14  41857
2      1      11000.00   11000 11000.00  0.66  11000  25000  25000.00  0.28  41857
3      1      12000.00   11000 11000.00  1     11000  25000  25000.00  0.42  41857
4      2      25000.00   67500 35000.00  0.25  65000  25000  25000.00  0.57  41857
5      2      35000.00   67500 35000.00  0.5   65000  25000  25000.00  0.71  41857
6      2      100000.00  67500 35000.00  1     65000  25000  25000.00  1     41857
7      2      100000.00  67500 35000.00  1     65000  25000  25000.00  1     41857

The PctC# columns utilize the PERCENTILE_CONT function, and the PctD# columns utilize the PERCENTILE_DISC function. The columns ending with “1″ are partitioned by the Department, and the columns ending in “2″ are across the entire table.

Notice that for Dept 1, that the PctC1 and PctD1 columns both return the same value, and that this value is one of the Salary values for this department. However, when we take a look at Dept 2, we see that the PctC1 column has a value that is the median of the four salaries, and it is a value that is not one of the salaries. The PctD1 column returns the value from the first row where the CUME_DIST is >= 0.5, and in this case it is the second row. But, where is that value in the PctC1 column coming from? After a bit of searching on the internet, I found this article that explains how this function is implement in Oracle. In quoting from it:

“The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, we compute the row number we are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+ (P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).”

Okay, so now I can see that for Department 2, that it is getting the Salary values for the two middle rows (35,000 and 100,000), adding them together and then dividing by 2, resulting in 67,500.

I don’t know if I’ll ever use these functions, but at least now I understand what they do. If you have a real-world use for these, please post a comment explaining it.

This concludes this series on the new Analytic Functions presented by SQL Server “Denali”.