This post is part of the series discussing the new Analytic functions in SQL Server “Denali”, CTP3.
Analytic Functions in SQL Server “Denali”
All of the new Analytic functions require the use of the OVER clause. To see what changes to the OVER clause that “Denali” brings, please view my article The OVER Clause enhancements in SQL Server code named “Denali”, CTP3 on SQLServerCentral.com.
Introducing the CUME_DIST / PERCENT_RANK functions
The CUME_DIST function is the number of rows with a value (as specified in the ORDER BY clause) less than or equal to the current value, divided by the total number of rows in this partition.
The PERCENT_RANK function return the percentage that the value (as specified in the ORDER BY clause) is within all of the values of the partition.
For both functions, NULLS are treated as the lowest possible value.
The syntax of these functions is:
CUME_DIST | PERCENT_RANK OVER ( [ partition_by_clause ] order_by_clause )
Note that the ROWS|RANGE clause is not supported, the ORDER BY clause is required, and the PARTITION BY clause is optional.
This will be a bit easier to understand when we see an example, so let’s just jump straight to one, then we’ll explain what it’s doing.
DECLARE @test TABLE ( DepartmentID INT, Salary INT); INSERT INTO @test VALUES (1, 15000), (1, 18000), (1, 23000), (1, 23000), (1, 25000); SELECT DepartmentID, Salary, CumeDist = CUME_DIST() OVER (PARTITION BY DepartmentID ORDER BY Salary), PercentRank = PERCENT_RANK() OVER (PARTITION BY DepartmentID ORDER BY Salary) FROM @test;
DepartmentID Salary CumeDist PercentRank ------------ ----------- ---------------------- ---------------------- 1 15000 0.2 0 1 18000 0.4 0.25 1 23000 0.8 0.5 1 23000 0.8 0.5 1 25000 1 1
Here we have 4 salary values in 5 rows. For CUME_DIST, on the row with the first value, there is 1 row with that value or lower. For the second row, there are 2 rows with that value or lower. For the third and fourth rows, there are 4 rows with that value or lower. And, for the fifth row, there are 5 rows with that value or lower. Each of these row count values are divided by the total number of rows in that partition (5). So, 1/5 = 0.2; 2/5 = 0.4; 4/5 = 0.8, and 5/5 = 1.
For PERCENT_RANK, there are 4 unique values. The first value has a percent rank of 0, and the last value has a percent rank of 1. The other values are divided up by their percentage through the number of rows – 1. Since there are 5 rows, and the first PERCENT_RANK starts at 0, the others (5-1) are divided equally, so each row has a PERCENT_RANK of 1/4 (or .25). Looking at the example, you can see how the first value is 0, the second value is 0.25, the third value is 0.5 (for the third and fourth rows), and the last value is 1.
Notice that for CUME_DIST, if there are more than one row with the same value, they all have the highest CUME_DIST calculation; for PERCENT_RANK, each row with the same value they all have the lowest PERCENT_RANK calculation.
These functions really start to dive into probability and analytics – areas that I’m not very strong at. I’d really like to get your feedback on real-world examples of where these would be used.