SQLServerCentral Article

ROW_NUMBER(): An Efficient Alternative to Subqueries

,

Introduction

SQL Server 2005 offers an array of ranking

and windowing functions that can be used to evaluate the data and only return

appropriate rows.  For instance, a development cycle of a product may

include hundreds of releases, and those releases may have versions associated

with them: a "0" version with its many minor versions, a

"1" version with its minor versions, etc.  If the history of a

particular product's releases is kept, then analysis can be done to see how far

a version is developed before it is released.  An ORDER BY clause alone cannot fulfill this need

because the query would most likely still return the entire history of the product,

not necessarily the last release of every version.  The code is also available

for download.  The name of the file is RowCountScenario1-CodeDownload.sql.

Scenario 1 – Versioning

In order to demonstrate the usage

of the ROW_NUMBER() windowing function, I started

with Microsoft's AdventureWorks database.  In particular, I used the data

in the Production.ProductCostHistory table.  The products in this table

are identified by the ProductID column; this is a foreign key to the

Production.Product table.  Using the Production.ProductCostHistory table,

I mocked up some data to create versions for each Product in the table.  I

used a random number generation process to create attributes called Version,

MinorVersion and ReleaseVersion for each product.  These attributes are meant

to show detailed information about the product.  Together 7.0.59 represents

that the 7th version of the product is currently being used, a minor

version represents the iteration of the version, and the release version of

this particular installation is 59.  The next iteration of the product's life

cycle could result with 7.2.19.  I also used the existing StandardCost to

create different costs for each of the Versions, to create some sense of value

for the particular Version.

I created a table called Production.ProductVersion

with the ProductID, Version, MinorVersion and ReleaseVersion defined as the primary

key and the StandardCost as an attribute.  I inserted the mocked up data

generated by the code into this table to model a simple product version/cost history.

CREATE TABLE Production.ProductVersion
(
      ProductID int NOT NULL,
      Version int NOT NULL,
      MinorVersion int NOT NULL,
      ReleaseVersion int NOT NULL,
      StandardCost numeric(30, 4) NOT NULL,
      CONSTRAINT PK_ProductVersion PRIMARY KEY CLUSTERED 
      (
            ProductID ASC,
            Version ASC,
            MinorVersion ASC,
            ReleaseVersion ASC
      )
);

I used the following code to populate the table with randomized data.  I created the data using a common

table expression (CTE) and inserted the data into the table after it was generated. The data is based on the Production.ProductCostHistory table.

WITH ProductVersion
AS
(
   SELECT ProductID,
      1 AS Version,
      CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
      CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
      CAST(StandardCost  AS NUMERIC(30,4)) AS StandardCost
   FROM Production.ProductCostHistory WITH (NOLOCK)
   UNION ALL
   SELECT
      ProductID,
      ABS(CHECKSUM(NEWID())) % 3 AS Version,
      CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
      CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
      CAST((CAST(StandardCost  AS NUMERIC(30,4)) * 1.10) AS NUMERIC(30,4)) AS StandardCost
   FROM Production.ProductCostHistory WITH (NOLOCK)
   UNION ALL
   SELECT
      ProductID,
      ABS(CHECKSUM(NEWID())) % 5 AS Version,
      CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
      CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
      CAST((CAST(StandardCost  AS NUMERIC(30,4)) * 2.10) AS NUMERIC(30,4)) AS StandardCost
   FROM Production.ProductCostHistory WITH (NOLOCK)
)
INSERT INTO Production.ProductVersion
SELECT ProductID,
   Version,
   MinorVersion,
   ReleaseVersion,
   MAX(StandardCost) AS StandardCost
FROM ProductVersion
GROUP BY ProductID,
   Version,
   MinorVersion,
   ReleaseVersion;

The ABS(CHECKSUM(NEWID()))

is utilized as the random number generator; the modulus operator provides the

upper bound for the random number that was generated.  The NEWID()

function is guaranteed to generate a globally unique identifier for each row. 

The GROUP BY clause is

used to avoid any Primary Key constraint violations that might be encountered.

The purpose of this exercise is to

avoid the complexity of certain code by using the ROW_NUMBER()

windowing function.  Suppose you are required to return only the latest version

of a Product with its associated MinorVersion, ReleaseVersion and

StandardCost.  The following query will not return the correct result set.

SELECT
   ProductID,
   MAX(Version) AS Version,
   MAX(MinorVersion) AS MinorVersion,
   MAX(ReleaseVersion) AS ReleaseVersion,
   MAX(StandardCost) AS StandardCost
FROM Production.ProductVersion WITH (NOLOCK)
GROUP BY ProductID;

In fact, this query violates

integrity of the rows of data in the table.  It simply returns the maximum

Version, the maximum MinorVersion, the maximum ReleaseVersion and the maximum

StandardCost of a particular Product.  This is an easy and tempting trap to

fall into.  Compare the results displayed in Figure 2 and Figure 3.  The actual

data that is in Production.ProductVersion is in Figure 1.

The following sample query captures

the actual requirements, and returns the correct result, but it is long and

convoluted.

SELECT ProductID,
   (
       SELECT MAX(Version)
       FROM Production.ProductVersion pv2 WITH (NOLOCK)
       WHERE pv2.ProductID = pv.ProductID
   ) AS Version,
   (
       SELECT MAX(MinorVersion)
       FROM Production.ProductVersion pv3 WITH (NOLOCK)
       WHERE pv3.ProductID = pv.ProductID
          AND pv3.Version = (
                           SELECT MAX(Version)
                           FROM Production.ProductVersion pv2 WITH (NOLOCK)
                           WHERE pv2.ProductID = pv.ProductID
                           )
   ) AS MinorVersion,
   (
       SELECT MAX(ReleaseVersion)
       FROM Production.ProductVersion pv4 WITH (NOLOCK)
       WHERE pv4.ProductID = pv.ProductID
          AND pv4.Version = (
                           SELECT MAX(Version)
                           FROM Production.ProductVersion pv2 WITH (NOLOCK)
                           WHERE pv2.ProductID = pv.ProductID
                           )
          AND pv4.MinorVersion = (
                               SELECT MAX(MinorVersion)
                               FROM Production.ProductVersion pv3 WITH (NOLOCK) 
                               WHERE pv3.ProductID = pv.ProductID
                                  AND pv3.Version = (
                                                   SELECT MAX(Version)
                                                   FROM Production.ProductVersion pv2 WITH (NOLOCK)
                                                   WHERE pv2.ProductID = pv.ProductID
                                                   )
                               )
   ) AS ReleaseVersion,
   (
       SELECT StandardCost
       FROM Production.ProductVersion pv5 WITH (NOLOCK)
       WHERE pv5.ProductID = pv.ProductID
          AND pv5.Version = (
                            SELECT MAX(Version)
                           FROM Production.ProductVersion pv2 WITH (NOLOCK)
                           WHERE pv2.ProductID = pv.ProductID
                           )
          AND pv5.MinorVersion = (
                               SELECT MAX(MinorVersion)
                               FROM Production.ProductVersion pv3 WITH (NOLOCK)
                               WHERE pv3.ProductID = pv.ProductID
                                  AND pv3.Version = (
                                                   SELECT MAX(Version)
                                                   FROM Production.ProductVersion pv2 WITH (NOLOCK)
                                                   WHERE pv2.ProductID = pv.ProductID
                                                   )
                               )
          AND pv5.ReleaseVersion = (
                                  SELECT MAX(ReleaseVersion)
                                  FROM Production.ProductVersion pv4 WITH (NOLOCK)
                                  WHERE pv4.ProductID = pv.ProductID
                                      AND pv4.Version = (
                                                       SELECT MAX(Version)
                                                       FROM Production.ProductVersion pv2 WITH (NOLOCK)
                                                       WHERE pv2.ProductID = pv.ProductID
                                                       )
                                      AND pv4.MinorVersion = (
                                                          SELECT MAX(MinorVersion)
                                                          FROM Production.ProductVersion pv3 WITH (NOLOCK)
                                                          WHERE pv3.ProductID = pv.ProductID
                                                              AND pv3.Version = (
                                                                           SELECT MAX(Version)
                                                                           FROM Production.ProductVersion pv2 WITH (NOLOCK)
                                                                               WHERE pv2.ProductID = pv.ProductID
                                                                               )
                                                          )
                               )
   ) AS StandardCost
FROM Production.ProductVersion pv WITH (NOLOCK)
GROUP BY ProductID;

This query utilizes nested

subqueries in order to ensure the integrity of each row.  The first subquery

(lines 117-25 in the code download) provides the maximum Version for each

Product.  The second subquery provides the maximum MinorVersion for the maximum

Version of each Product.  The subsubquery in the WHERE clause ensures that the

MinorVersions and the Versions match.  The third subquery,

SELECT MAX(ReleaseVersion)
FROM Production.ProductVersion pv4 WITH (NOLOCK)
WHERE pv4.ProductID = pv.ProductID
   AND pv4.Version = (
                     SELECT MAX(Version)
                     FROM Production.ProductVersion pv2 WITH (NOLOCK)
                    WHERE pv2.ProductID = pv.ProductID
                     )
   AND pv4.MinorVersion = (
                        SELECT MAX(MinorVersion)
                        FROM Production.ProductVersion pv3 WITH (NOLOCK)
                        WHERE pv3.ProductID = pv.ProductID
                           AND pv3.Version = (
                                             SELECT MAX(Version)
                                             FROM Production.ProductVersion pv2 WITH (NOLOCK)
                                             WHERE pv2.ProductID = pv.ProductID
                                             )
                        )

(lines 127-46 in the code download) provides the maximum

ReleaseVersion for the maximum MinorVersion of the maximum Version of each

Product.  Once again the subqueries ensure that only complete rows of data are retrieved. 

If this logic sounds too complicated, that is simply because it is.  The

estimated subtree cost for this query turns out to be 0.583005.  This includes

several Clustered Index Scan and Seek operations.  The query plan is displayed

in Figure 6.  The complexity of the subquery approach can increase if the

requirements change.

A simplified approach uses the ROW_NUMBER()

function as shown below.

WITH RowExample1
AS
(
   SELECT ROW_NUMBER() OVER(PARTITION BY ProductID
                     ORDER BY ProductID,
                        Version DESC,
                        MinorVersion DESC,
                        ReleaseVersion DESC
      ) AS MaxVersion,
      ProductID,
      Version,
      MinorVersion,
      ReleaseVersion,
      StandardCost
   FROM Production.ProductVersion pv WITH (NOLOCK)
)
SELECT ProductID,
   Version,
   MinorVersion,
   ReleaseVersion,
   StandardCost
FROM RowExample1
WHERE MaxVersion = 1
ORDER BY ProductID;

The result set for this query looks like the Figure 3.

Figure 1: Sample from Production.ProductVersion

Figure 2: Sample from incorrect query

Figure 3: Sample from properly implemented query

The PARTITION BY clause allows a set of row numbers to be

assigned for all distinct Products.  When a new ProductID is encountered, the

row numbering will start over at 1 and continue incrementing for each row with

the same ProductID.  The row number will be assigned according to the sort

order of the columns that you specify in the OVER clause's ORDER BY clause.  The estimated subtree cost for

this improved query is 0.039954.  This query has only one Clustered Index Scan

operation.  The query plan is displayed in Figure 7.

With the OVER clause, the ROW_NUMBER() function can efficiently assign

a row number to each row in a query.  The fact that I've ordered the partitions

by ProductID and then in descending order by Version, MinorVersion, and

ReleaseVersion, guarantees the maximum version will be in the first row of each

ProductID partition.  This allows me to use a simple WHERE MaxVersion = 1 predicate in place of the convoluted

sub-query logic in the previous sample query.

To test the effects of indexing on difference between the two methods, I used the following table.

CREATE TABLE Production.ProductVersion2
(
   ProductID int NOT NULL,
   Version int NOT NULL,
   MinorVersion int NOT NULL,
   ReleaseVersion int NOT NULL,
   StandardCost numeric(30, 4) NOT NULL,
);

I used the following query to generate a large set of randomized data to compare the estimated query costs for different record size sets.

WITH RCTE
AS
(
   SELECT 0 AS X,
       ProductID,
       1 AS Version,
       CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
       CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
       CAST(StandardCost AS NUMERIC(30,4)) AS StandardCost
   FROM Production.ProductCostHistory WITH (NOLOCK)
   UNION ALL
   SELECT X + 1,
       pv.ProductID,
       ABS(CHECKSUM(NEWID())) % 3 AS Version,
       CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
       CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
       CAST((CAST(pv.StandardCost AS NUMERIC(30,4)) * 1.10) AS NUMERIC(30,4)) AS StandardCost
   FROM RCTE pv
   WHERE X < 14
   UNION ALL
   SELECT X + 2,
       pv.ProductID,
       ABS(CHECKSUM(NEWID())) % 5 AS Version,
       CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
       CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
       CAST((CAST(pv.StandardCost AS NUMERIC(30,4)) * 2.10) AS NUMERIC(30,4)) AS StandardCost
   FROM RCTE pv
   WHERE X < 15
)
INSERT INTO Production.ProductVersion2
SELECT TOP (100000) ProductID,
      Version,
      MinorVersion,
      ReleaseVersion,
      MAX(StandardCost)
FROM RCTE
GROUP BY ProductID,
      Version,
      MinorVersion,
      ReleaseVersion
OPTION (MAXRECURSION 0);

The following charts show the estimated query costs for different row sizes.  I chose to start at 1,000

because there are 286 distinct ProductIDs, starting at 100 would have eliminated too many rows.

The following figure shows the estimated query costs for the Production.ProductVersion.  The subquery implementation actually took less than 1 second to complete where as the ROW_NUMBER() implementation took about 2 seconds to complete for 1,000,000 rows.

Row Size

Subquery

Implementation Cost

ROW_NUMBER()

Implementation Cost

1000

0.0652462

0.0355736

10000

0.238573

0.673282

100000

2.2258

5.97198

1000000

14.3881

83.7228

Figure 4: Indexed estimated query costs

The following figure shows the

estimated query costs for the Production.ProductVersion2.  The subquery

implementation took 43 seconds to complete where as the ROW_NUMBER()

implementation took 5 seconds to complete for 1,000,000 rows.

Row Size

Subquery Implementation

Cost

ROW_NUMBER()

Implementation Cost

1000

0.0355736

0.225896

10000

1.6397

0.673282

100000

44.1332

5.97202

1000000

448.47

83.7229

Figure 5: Non-indexed estimated query costs

These results may differ according

to the hardware used to run the queries.  A quick look at the ROW_NUMBER()

implementation column shows that indexing does not significantly impact this

implementation's query cost where as it is very important to the subquery

implementation's query cost.

Scenario 1 – Change in Requirements

Suppose the requirement changes and

you need to grab the maximum MinorVersions for every (ProductID, Version)

combination.  Changing the subquery implementation has a large overhead, namely

breaking down the logic.  The subquery approach looks like this with the new

set of requirements:

SELECT ProductID,
   Version,
   (
       SELECT MAX(MinorVersion)
       FROM Production.ProductVersion pv3 WITH (NOLOCK)
       WHERE pv3.ProductID = pv.ProductID
          AND pv3.Version = pv.Version
   ) AS MinorVersion,
   (
       SELECT MAX(ReleaseVersion)
       FROM Production.ProductVersion pv4 WITH (NOLOCK)
       WHERE pv4.ProductID = pv.ProductID
          AND pv4.Version = pv.Version
          AND pv4.MinorVersion = (
                               SELECT MAX(MinorVersion)
                               FROM Production.ProductVersion pv3 WITH (NOLOCK)
                               WHERE pv3.ProductID = pv.ProductID
                                  AND pv3.Version = pv.Version
                               )
   ) AS ReleaseVersion,
   (
       SELECT StandardCost
       FROM Production.ProductVersion pv5 WITH (NOLOCK)
       WHERE pv5.ProductID = pv.ProductID
          AND pv5.Version = pv.Version
          AND pv5.MinorVersion = (
                               SELECT MAX(MinorVersion)
                               FROM Production.ProductVersion pv3 WITH (NOLOCK)
                               WHERE pv3.ProductID = pv.ProductID
                                  AND pv3.Version = pv.Version
                               )
          AND pv5.ReleaseVersion = (
                               SELECT MAX(ReleaseVersion)
                               FROM Production.ProductVersion pv4 WITH (NOLOCK)
                               WHERE pv4.ProductID = pv.ProductID
                                  AND pv4.Version = pv.Version
                                  AND pv4.MinorVersion = (
                                                       SELECT MAX(MinorVersion)
                                                       FROM Production.ProductVersion pv3 WITH (NOLOCK)
                                                       WHERE pv3.ProductID = pv.ProductID
                                                          AND pv3.Version = pv.Version
                                                       )
                               )
   ) AS StandardCost
FROM Production.ProductVersion pv WITH (NOLOCK)
GROUP BY ProductID,
   Version;

The new requirements actually

eliminate some of the nested subqueries.  The estimated query cost, however,

does not change significantly for Production.ProductVersion.

This new change requires only a small modification to the ROW_NUMBER() implementation. This is what the ROW_NUMBER()

implementation looks like for the new set of requirements:

WITH RowExample2
AS
(
   SELECT ROW_NUMBER() OVER(PARTITION BY ProductID,
                        Version
                     ORDER BY ProductID,
                        Version,
                        MinorVersion DESC,
                        ReleaseVersion DESC
      ) AS MaxVersion,
      ProductID,
      Version,
      MinorVersion,
      ReleaseVersion,
      StandardCost
   FROM Production.ProductVersion pv WITH (NOLOCK)
)
SELECT ProductID,
   Version,
   MinorVersion,
   ReleaseVersion,
   StandardCost
FROM RowExample2
WHERE MaxVersion = 1
ORDER BY ProductID;

In this example, the row numbers

are partitioned according to both ProductID and Version.  The WHERE

clause is still valid here because the maximum MinorVersion for each

(ProductID, Version) combination is guaranteed to be the first row.  The

estimated query cost did not change greatly for the modified code.  The

readability, manageability and the efficiency of the function make it a better

choice than the subquery approach.

Estimated Query Plans

Figure 6: Subquery approach

Figure 7: ROW_NUMBER() approach

Conclusion

These examples show the critical

role of indexing in the subquery approach.  The ROW_NUMBER() implementation is

far more readable and therefore easier to maintain.  It also remains relatively

independent of indexing even for large amounts of data.  Since the function takes

advantage of the SQL Server's ability to sort records, most queries that need

to uphold a level of sequencing should at the very least explore its

implementation.  The sorting itself can greatly reduce or replace all together

the extra logic necessary to enforce the integrity of data at the row level. 

The readability of the function's implementation also plays a key role in its

manageability.  Modifying the code with the ROW_NUMBER()

implementation is easy because the logic is performed in easy to spot areas and

is performed once, whereas in the subquery the logic appears in several places

and could be repeated.

Rate

4.52 (122)

You rated this post out of 5. Change rating

Share

Share

Rate

4.52 (122)

You rated this post out of 5. Change rating