Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

ROW_NUMBER(): An Efficient Alternative to Subqueries

By Francis Rodrigues, (first published: 2009/05/12)

Introduction

SQL Server 2005 offers an array of ranking and windowing functions that can be used to evaluate the data and only return appropriate rows.  For instance, a development cycle of a product may include hundreds of releases, and those releases may have versions associated with them: a "0" version with its many minor versions, a "1" version with its minor versions, etc.  If the history of a particular product's releases is kept, then analysis can be done to see how far a version is developed before it is released.  An ORDER BY clause alone cannot fulfill this need because the query would most likely still return the entire history of the product, not necessarily the last release of every version.  The code is also available for download.  The name of the file is RowCountScenario1-CodeDownload.sql.

Scenario 1 – Versioning

In order to demonstrate the usage of the ROW_NUMBER() windowing function, I started with Microsoft's AdventureWorks database.  In particular, I used the data in the Production.ProductCostHistory table.  The products in this table are identified by the ProductID column; this is a foreign key to the Production.Product table.  Using the Production.ProductCostHistory table, I mocked up some data to create versions for each Product in the table.  I used a random number generation process to create attributes called Version, MinorVersion and ReleaseVersion for each product.  These attributes are meant to show detailed information about the product.  Together 7.0.59 represents that the 7th version of the product is currently being used, a minor version represents the iteration of the version, and the release version of this particular installation is 59.  The next iteration of the product's life cycle could result with 7.2.19.  I also used the existing StandardCost to create different costs for each of the Versions, to create some sense of value for the particular Version.

I created a table called Production.ProductVersion with the ProductID, Version, MinorVersion and ReleaseVersion defined as the primary key and the StandardCost as an attribute.  I inserted the mocked up data generated by the code into this table to model a simple product version/cost history.

CREATE TABLE Production.ProductVersion
(
      ProductID int NOT NULL,
      Version int NOT NULL,
      MinorVersion int NOT NULL,
      ReleaseVersion int NOT NULL,
      StandardCost numeric(30, 4) NOT NULL,
      CONSTRAINT PK_ProductVersion PRIMARY KEY CLUSTERED 
      (
            ProductID ASC,
            Version ASC,
            MinorVersion ASC,
            ReleaseVersion ASC
      )
);

I used the following code to populate the table with randomized data.  I created the data using a common table expression (CTE) and inserted the data into the table after it was generated. The data is based on the Production.ProductCostHistory table.

WITH ProductVersion
AS
(
   SELECT ProductID,
      1 AS Version,
      CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
      CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
      CAST(StandardCost  AS NUMERIC(30,4)) AS StandardCost
   FROM Production.ProductCostHistory WITH (NOLOCK)
   UNION ALL
   SELECT
      ProductID,
      ABS(CHECKSUM(NEWID())) % 3 AS Version,
      CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
      CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
      CAST((CAST(StandardCost  AS NUMERIC(30,4)) * 1.10) AS NUMERIC(30,4)) AS StandardCost
   FROM Production.ProductCostHistory WITH (NOLOCK)
   UNION ALL
   SELECT
      ProductID,
      ABS(CHECKSUM(NEWID())) % 5 AS Version,
      CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
      CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
      CAST((CAST(StandardCost  AS NUMERIC(30,4)) * 2.10) AS NUMERIC(30,4)) AS StandardCost
   FROM Production.ProductCostHistory WITH (NOLOCK)
)

INSERT INTO Production.ProductVersion
SELECT ProductID,
   Version,
   MinorVersion,
   ReleaseVersion,
   MAX(StandardCost) AS StandardCost
FROM ProductVersion
GROUP BY ProductID,
   Version,
   MinorVersion,
   ReleaseVersion;

The ABS(CHECKSUM(NEWID())) is utilized as the random number generator; the modulus operator provides the upper bound for the random number that was generated.  The NEWID() function is guaranteed to generate a globally unique identifier for each row.  The GROUP BY clause is used to avoid any Primary Key constraint violations that might be encountered.

The purpose of this exercise is to avoid the complexity of certain code by using the ROW_NUMBER() windowing function.  Suppose you are required to return only the latest version of a Product with its associated MinorVersion, ReleaseVersion and StandardCost.  The following query will not return the correct result set.

SELECT
   ProductID,
   MAX(Version) AS Version,
   MAX(MinorVersion) AS MinorVersion,
   MAX(ReleaseVersion) AS ReleaseVersion,
   MAX(StandardCost) AS StandardCost
FROM Production.ProductVersion WITH (NOLOCK)
GROUP BY ProductID;
In fact, this query violates integrity of the rows of data in the table.  It simply returns the maximum Version, the maximum MinorVersion, the maximum ReleaseVersion and the maximum StandardCost of a particular Product.  This is an easy and tempting trap to fall into.  Compare the results displayed in Figure 2 and Figure 3.  The actual data that is in Production.ProductVersion is in Figure 1.

The following sample query captures the actual requirements, and returns the correct result, but it is long and convoluted.

SELECT ProductID,
   (
       SELECT MAX(Version)
       FROM Production.ProductVersion pv2 WITH (NOLOCK)
       WHERE pv2.ProductID = pv.ProductID
   ) AS Version,
   (
       SELECT MAX(MinorVersion)
       FROM Production.ProductVersion pv3 WITH (NOLOCK)
       WHERE pv3.ProductID = pv.ProductID
          AND pv3.Version = (
                           SELECT MAX(Version)
                           FROM Production.ProductVersion pv2 WITH (NOLOCK)
                           WHERE pv2.ProductID = pv.ProductID
                           )
   ) AS MinorVersion,
   (
       SELECT MAX(ReleaseVersion)
       FROM Production.ProductVersion pv4 WITH (NOLOCK)
       WHERE pv4.ProductID = pv.ProductID
          AND pv4.Version = (
                           SELECT MAX(Version)
                           FROM Production.ProductVersion pv2 WITH (NOLOCK)
                           WHERE pv2.ProductID = pv.ProductID
                           )
          AND pv4.MinorVersion = (
                               SELECT MAX(MinorVersion)
                               FROM Production.ProductVersion pv3 WITH (NOLOCK) 
                               WHERE pv3.ProductID = pv.ProductID
                                  AND pv3.Version = (
                                                   SELECT MAX(Version)
                                                   FROM Production.ProductVersion pv2 WITH (NOLOCK)
                                                   WHERE pv2.ProductID = pv.ProductID
                                                   )
                               )
   ) AS ReleaseVersion,
   (
       SELECT StandardCost
       FROM Production.ProductVersion pv5 WITH (NOLOCK)
       WHERE pv5.ProductID = pv.ProductID
          AND pv5.Version = (
                            SELECT MAX(Version)
                           FROM Production.ProductVersion pv2 WITH (NOLOCK)
                           WHERE pv2.ProductID = pv.ProductID
                           )
          AND pv5.MinorVersion = (
                               SELECT MAX(MinorVersion)
                               FROM Production.ProductVersion pv3 WITH (NOLOCK)
                               WHERE pv3.ProductID = pv.ProductID
                                  AND pv3.Version = (
                                                   SELECT MAX(Version)
                                                   FROM Production.ProductVersion pv2 WITH (NOLOCK)
                                                   WHERE pv2.ProductID = pv.ProductID
                                                   )
                               )
          AND pv5.ReleaseVersion = (
                                  SELECT MAX(ReleaseVersion)
                                  FROM Production.ProductVersion pv4 WITH (NOLOCK)
                                  WHERE pv4.ProductID = pv.ProductID
                                      AND pv4.Version = (
                                                       SELECT MAX(Version)
                                                       FROM Production.ProductVersion pv2 WITH (NOLOCK)
                                                       WHERE pv2.ProductID = pv.ProductID
                                                       )
                                      AND pv4.MinorVersion = (
                                                          SELECT MAX(MinorVersion)
                                                          FROM Production.ProductVersion pv3 WITH (NOLOCK)
                                                          WHERE pv3.ProductID = pv.ProductID
                                                              AND pv3.Version = (
                                                                           SELECT MAX(Version)
                                                                           FROM Production.ProductVersion pv2 WITH (NOLOCK)
                                                                               WHERE pv2.ProductID = pv.ProductID
                                                                               )
                                                          )
                               )
   ) AS StandardCost
FROM Production.ProductVersion pv WITH (NOLOCK)
GROUP BY ProductID;

This query utilizes nested subqueries in order to ensure the integrity of each row.  The first subquery (lines 117-25 in the code download) provides the maximum Version for each Product.  The second subquery provides the maximum MinorVersion for the maximum Version of each Product.  The subsubquery in the WHERE clause ensures that the MinorVersions and the Versions match.  The third subquery,

SELECT MAX(ReleaseVersion)
FROM Production.ProductVersion pv4 WITH (NOLOCK)
WHERE pv4.ProductID = pv.ProductID
   AND pv4.Version = (
                     SELECT MAX(Version)
                     FROM Production.ProductVersion pv2 WITH (NOLOCK)
                    WHERE pv2.ProductID = pv.ProductID
                     )
   AND pv4.MinorVersion = (
                        SELECT MAX(MinorVersion)
                        FROM Production.ProductVersion pv3 WITH (NOLOCK)
                        WHERE pv3.ProductID = pv.ProductID
                           AND pv3.Version = (
                                             SELECT MAX(Version)
                                             FROM Production.ProductVersion pv2 WITH (NOLOCK)
                                             WHERE pv2.ProductID = pv.ProductID
                                             )
                        )

(lines 127-46 in the code download) provides the maximum ReleaseVersion for the maximum MinorVersion of the maximum Version of each Product.  Once again the subqueries ensure that only complete rows of data are retrieved.  If this logic sounds too complicated, that is simply because it is.  The estimated subtree cost for this query turns out to be 0.583005.  This includes several Clustered Index Scan and Seek operations.  The query plan is displayed in Figure 6.  The complexity of the subquery approach can increase if the requirements change.

A simplified approach uses the ROW_NUMBER() function as shown below.

WITH RowExample1
AS
(
   SELECT ROW_NUMBER() OVER(PARTITION BY ProductID
                     ORDER BY ProductID,
                        Version DESC,
                        MinorVersion DESC,
                        ReleaseVersion DESC
      ) AS MaxVersion,
      ProductID,
      Version,
      MinorVersion,
      ReleaseVersion,
      StandardCost
   FROM Production.ProductVersion pv WITH (NOLOCK)
)
SELECT ProductID,
   Version,
   MinorVersion,
   ReleaseVersion,
   StandardCost
FROM RowExample1
WHERE MaxVersion = 1
ORDER BY ProductID;

The result set for this query looks like the Figure 3.

Figure 1: Sample from Production.ProductVersion

Figure 2: Sample from incorrect query

Figure 3: Sample from properly implemented query

The PARTITION BY clause allows a set of row numbers to be assigned for all distinct Products.  When a new ProductID is encountered, the row numbering will start over at 1 and continue incrementing for each row with the same ProductID.  The row number will be assigned according to the sort order of the columns that you specify in the OVER clause's ORDER BY clause.  The estimated subtree cost for this improved query is 0.039954.  This query has only one Clustered Index Scan operation.  The query plan is displayed in Figure 7.

With the OVER clause, the ROW_NUMBER() function can efficiently assign a row number to each row in a query.  The fact that I've ordered the partitions by ProductID and then in descending order by Version, MinorVersion, and ReleaseVersion, guarantees the maximum version will be in the first row of each ProductID partition.  This allows me to use a simple WHERE MaxVersion = 1 predicate in place of the convoluted sub-query logic in the previous sample query.

To test the effects of indexing on difference between the two methods, I used the following table.

CREATE TABLE Production.ProductVersion2
(
   ProductID int NOT NULL,
   Version int NOT NULL,
   MinorVersion int NOT NULL,
   ReleaseVersion int NOT NULL,
   StandardCost numeric(30, 4) NOT NULL,
);

I used the following query to generate a large set of randomized data to compare the estimated query costs for different record size sets.

WITH RCTE
AS
(
   SELECT 0 AS X,
       ProductID,
       1 AS Version,
       CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
       CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
       CAST(StandardCost AS NUMERIC(30,4)) AS StandardCost
   FROM Production.ProductCostHistory WITH (NOLOCK)
   UNION ALL
   SELECT X + 1,
       pv.ProductID,
       ABS(CHECKSUM(NEWID())) % 3 AS Version,
       CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
       CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
       CAST((CAST(pv.StandardCost AS NUMERIC(30,4)) * 1.10) AS NUMERIC(30,4)) AS StandardCost
   FROM RCTE pv
   WHERE X < 14
   UNION ALL
   SELECT X + 2,
       pv.ProductID,
       ABS(CHECKSUM(NEWID())) % 5 AS Version,
       CAST((ABS(CHECKSUM(NEWID())) % 1001) AS INT) AS MinorVersion,
       CAST((ABS(CHECKSUM(NEWID())) % 20001) AS INT) AS ReleaseVersion,
       CAST((CAST(pv.StandardCost AS NUMERIC(30,4)) * 2.10) AS NUMERIC(30,4)) AS StandardCost
   FROM RCTE pv
   WHERE X < 15
)
INSERT INTO Production.ProductVersion2
SELECT TOP (100000) ProductID,
      Version,
      MinorVersion,
      ReleaseVersion,
      MAX(StandardCost)
FROM RCTE
GROUP BY ProductID,
      Version,
      MinorVersion,
      ReleaseVersion
OPTION (MAXRECURSION 0);

The following charts show the estimated query costs for different row sizes.  I chose to start at 1,000 because there are 286 distinct ProductIDs, starting at 100 would have eliminated too many rows.

The following figure shows the estimated query costs for the Production.ProductVersion.  The subquery implementation actually took less than 1 second to complete where as the ROW_NUMBER() implementation took about 2 seconds to complete for 1,000,000 rows.

Row Size

Subquery Implementation Cost

ROW_NUMBER() Implementation Cost

1000

0.0652462

0.0355736

10000

0.238573

0.673282

100000

2.2258

5.97198

1000000

14.3881

83.7228

Figure 4: Indexed estimated query costs

The following figure shows the estimated query costs for the Production.ProductVersion2.  The subquery implementation took 43 seconds to complete where as the ROW_NUMBER() implementation took 5 seconds to complete for 1,000,000 rows.

Row Size

Subquery Implementation Cost

ROW_NUMBER() Implementation Cost

1000

0.0355736

0.225896

10000

1.6397

0.673282

100000

44.1332

5.97202

1000000

448.47

83.7229

Figure 5: Non-indexed estimated query costs

These results may differ according to the hardware used to run the queries.  A quick look at the ROW_NUMBER() implementation column shows that indexing does not significantly impact this implementation's query cost where as it is very important to the subquery implementation's query cost.

Scenario 1 – Change in Requirements

Suppose the requirement changes and you need to grab the maximum MinorVersions for every (ProductID, Version) combination.  Changing the subquery implementation has a large overhead, namely breaking down the logic.  The subquery approach looks like this with the new set of requirements:

SELECT ProductID,
   Version,
   (
       SELECT MAX(MinorVersion)
       FROM Production.ProductVersion pv3 WITH (NOLOCK)
       WHERE pv3.ProductID = pv.ProductID
          AND pv3.Version = pv.Version
   ) AS MinorVersion,
   (
       SELECT MAX(ReleaseVersion)
       FROM Production.ProductVersion pv4 WITH (NOLOCK)
       WHERE pv4.ProductID = pv.ProductID
          AND pv4.Version = pv.Version
          AND pv4.MinorVersion = (
                               SELECT MAX(MinorVersion)
                               FROM Production.ProductVersion pv3 WITH (NOLOCK)
                               WHERE pv3.ProductID = pv.ProductID
                                  AND pv3.Version = pv.Version
                               )
   ) AS ReleaseVersion,
   (
       SELECT StandardCost
       FROM Production.ProductVersion pv5 WITH (NOLOCK)
       WHERE pv5.ProductID = pv.ProductID
          AND pv5.Version = pv.Version
          AND pv5.MinorVersion = (
                               SELECT MAX(MinorVersion)
                               FROM Production.ProductVersion pv3 WITH (NOLOCK)
                               WHERE pv3.ProductID = pv.ProductID
                                  AND pv3.Version = pv.Version
                               )
          AND pv5.ReleaseVersion = (
                               SELECT MAX(ReleaseVersion)
                               FROM Production.ProductVersion pv4 WITH (NOLOCK)
                               WHERE pv4.ProductID = pv.ProductID
                                  AND pv4.Version = pv.Version
                                  AND pv4.MinorVersion = (
                                                       SELECT MAX(MinorVersion)
                                                       FROM Production.ProductVersion pv3 WITH (NOLOCK)
                                                       WHERE pv3.ProductID = pv.ProductID
                                                          AND pv3.Version = pv.Version
                                                       )
                               )
   ) AS StandardCost
FROM Production.ProductVersion pv WITH (NOLOCK)
GROUP BY ProductID,
   Version;

The new requirements actually eliminate some of the nested subqueries.  The estimated query cost, however, does not change significantly for Production.ProductVersion.

This new change requires only a small modification to the ROW_NUMBER() implementation. This is what the ROW_NUMBER() implementation looks like for the new set of requirements:

WITH RowExample2
AS
(
   SELECT ROW_NUMBER() OVER(PARTITION BY ProductID,
                        Version
                     ORDER BY ProductID,
                        Version,
                        MinorVersion DESC,
                        ReleaseVersion DESC
      ) AS MaxVersion,
      ProductID,
      Version,
      MinorVersion,
      ReleaseVersion,
      StandardCost
   FROM Production.ProductVersion pv WITH (NOLOCK)
)
SELECT ProductID,
   Version,
   MinorVersion,
   ReleaseVersion,
   StandardCost
FROM RowExample2
WHERE MaxVersion = 1
ORDER BY ProductID;

In this example, the row numbers are partitioned according to both ProductID and Version.  The WHERE clause is still valid here because the maximum MinorVersion for each (ProductID, Version) combination is guaranteed to be the first row.  The estimated query cost did not change greatly for the modified code.  The readability, manageability and the efficiency of the function make it a better choice than the subquery approach.

Estimated Query Plans

Figure 6: Subquery approach

Figure 7: ROW_NUMBER() approach

Conclusion

These examples show the critical role of indexing in the subquery approach.  The ROW_NUMBER() implementation is far more readable and therefore easier to maintain.  It also remains relatively independent of indexing even for large amounts of data.  Since the function takes advantage of the SQL Server's ability to sort records, most queries that need to uphold a level of sequencing should at the very least explore its implementation.  The sorting itself can greatly reduce or replace all together the extra logic necessary to enforce the integrity of data at the row level.  The readability of the function's implementation also plays a key role in its manageability.  Modifying the code with the ROW_NUMBER() implementation is easy because the logic is performed in easy to spot areas and is performed once, whereas in the subquery the logic appears in several places and could be repeated.

Total article views: 68327 | Views in the last 30 days: 32
 
Related Articles
FORUM

SELECT @@VERSION

SELECT @@VERSION

BLOG

How to Interpret SELECT @@VERSION in SQL Server

One common point of confusion that I often see when people run the SELECT @@VERSION query is how to ...

FORUM

ProductId Search

Return Unique Values

BLOG

How to Read the Results of SELECT @@VERSION in SQL Server

The simple query SELECT @@VERSION can quickly tell you quite a bit about a SQL Server installation (...

FORUM

Select

Select particular table

Tags
row_number()    
t-sql    
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones