ROW_NUMBER(): An Efficient Alternative to Subqueries

Question

ROW_NUMBER(): An Efficient Alternative to Subqueries

Viewing 15 posts - 16 through 30 (of 61 total)

You must be logged in to reply to this topic. Login to reply

Ryan C. Price Ten Centuries Points: 1184 More actions · Answer 1

I think this was a good demonstration of where these 2005+ features can be useful. I have a job coming up where this could really help.

IMO, none of the few alternative queries posted so far are as clear in their intent, which I think is almost as important as performance, and an acceptable trade-off where performance isn't dramatically affected.

Slightly off-topic, but I couldn't help being reminded of someone's suggestion I saw recently of a 'FIRST' aggregate (MS Access has one), which of course would make the query really simple, and probably perform really well. I suspect there must be someone who'd done one using CLR:

SELECT ProductID,

FIRST(Version) AS Version,

FIRST(MinorVersion) AS MinorVersion,

FIRST(ReleaseVersion) AS ReleaseVersion,

FIRST(StandardCost) AS StandardCost

FROM Production.ProductVersion

ORDER BY ProductID, Version DESC, MinorVersion DESC, ReleaseVersion DESC

GROUP BY ProductID

chill-1062987 SSC Enthusiast Points: 158 More actions · Answer 2

Thanks for a good article, I definitely learned what Row_Number() is - definitely had no idea before.

cantor Say Hey Kid Points: 667 More actions · Answer 3

I have seen similar examples of ROW_NUMBER() usage and the following has always puzzled me:

Why do we include columns in the ORDER BY when they are already in the PARTITION BY ?

i.e. I would expect

SELECT ROW_NUMBER() OVER(PARTITION BY ProductID,

Version ORDER BY ProductID, Version, MinorVersion DESC, ReleaseVersion DESC) AS MaxVersion

to be equivalent to

SELECT ROW_NUMBER() OVER(PARTITION BY ProductID,

Version ORDER BY MinorVersion DESC, ReleaseVersion DESC) AS MaxVersion

I know I must be missing obvious with this but I'm hanged if I can figure out what it is...

RBarryYoung SSC Guru Points: 143329 More actions · Answer 4

UncleJimBob (5/12/2009)
I have seen similar examples of ROW_NUMBER() usage and the following has always puzzled me:
Why do we include columns in the ORDER BY when they are already in the PARTITION BY ?
i.e. I would expect
SELECT ROW_NUMBER() OVER(PARTITION BY ProductID,
Version ORDER BY ProductID, Version, MinorVersion DESC, ReleaseVersion DESC) AS MaxVersion
to be equivalent to
SELECT ROW_NUMBER() OVER(PARTITION BY ProductID,
Version ORDER BY MinorVersion DESC, ReleaseVersion DESC) AS MaxVersion
I know I must be missing obvious with this but I'm hanged if I can figure out what it is...

AFAIK, you are correct. There's no reason to include a column in the ORDER BY if it's already in the PARTITION BY, it has no effect.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

wbrianwhite SSC Eights! Points: 954 More actions · Answer 5

William Hutton (5/12/2009)
I think it's just a matter of ...elegance? To me using Row_Number() statements (and CTEs in general) are easier to write and easier to follow in specific situations (provided the one reading knows what they do). And I can tell you just from testing this morning that the Row_Number() query has a performance improvement over the left joins and isnull tests I had been using previously.

Do they? How big is it? Is that before or after you make a clustered index on the four significant columns? In figure 4 it looks like a well-indexed table is faster when queried by the sub-query approach. And since this table is going to be used in lots and lots of joins besides the one in the article, it needs to be well-indexed anyway.

The comparison of query plans in the article is skewed because the 14 nested subqueries can be replaced by a single select with 4 left outer joins, and it is that query plan comparison I'd be interested in seeing.

Also, isn't that subquery getting the max release version and standard cost as well, while the Row_Number() version is not?

William Hutton-462896 SSC Enthusiast Points: 173 More actions · Answer 6

I just tried it on my query and both of them appear to have returned the same results, so I suspect that you're correct about them being the same.

Francis Rodrigues-459442 SSC Eights! Points: 860 More actions · Answer 7

wbrianwhite (5/12/2009)
William Hutton (5/12/2009)
I think it's just a matter of ...elegance? To me using Row_Number() statements (and CTEs in general) are easier to write and easier to follow in specific situations (provided the one reading knows what they do). And I can tell you just from testing this morning that the Row_Number() query has a performance improvement over the left joins and isnull tests I had been using previously.
Do they? How big is it? Is that before or after you make a clustered index on the four significant columns? In figure 4 it looks like a well-indexed table is faster when queried by the sub-query approach. And since this table is going to be used in lots and lots of joins besides the one in the article, it needs to be well-indexed anyway.
The comparison of query plans in the article is skewed because the 14 nested subqueries can be replaced by a single select with 4 left outer joins, and it is that query plan comparison I'd be interested in seeing.
Also, isn't that subquery getting the max release version and standard cost as well, while the Row_Number() version is not?

The MAX release version is inherently pulled by the ORDER BY ... DESC. The standard cost that is selected is the one that is associated with that particular minor version and release version for a product.

Ion Freeman Right there with Babe Points: 781 More actions · Answer 8

Well, exactly. I wouldn't compare this to subqueries, I'd compare it to using top and a sort. Is it faster than that? I'd expect them to be about the same -- I don't see what Row_Number() gets us.

Mike C SSC-Insane Points: 23224 More actions · Answer 9

wbrianwhite (5/12/2009)
In what way was simple set logic unable to perform this operation? It looks like this could be solved with left outer joins and is null tests. I don't have the test db mentioned, so my > may be wrong, it could need to be a <, I can never remember without testing it. But something like this should do the trick

There's always more than one way to skin a cat. The question is twofold: (1) What's the performance of the alternative solution and (2) is the alternative solution easier to manage? I think the demonstration here was that the performance and manageability are both improved with CTEs and ranking/windowing functions.

wbrianwhite SSC Eights! Points: 954 More actions · Answer 10

Mike C (5/13/2009)
There's always more than one way to skin a cat. The question is twofold: (1) What's the performance of the alternative solution and (2) is the alternative solution easier to manage? I think the demonstration here was that the performance and manageability are both improved with CTEs and ranking/windowing functions.

But the performance of the subquery approach was actually 100% - 500% BETTER than the performance of the ROW_NUMBER() approach, when the table in question was properly indexed (and tables should always be properly indexed). From Figure 4:

Rows - Sub - Row_Number()
100000 - 2.2258 - 5.97198
1000000 -14.3881 - 83.7228

Ignoring a 100% - 500% performance improvement to use a nifty new trick is what I would consider suicidal.

Mike C SSC-Insane Points: 23224 More actions · Answer 11

wbrianwhite (5/13/2009)
Mike C (5/13/2009)
There's always more than one way to skin a cat. The question is twofold: (1) What's the performance of the alternative solution and (2) is the alternative solution easier to manage? I think the demonstration here was that the performance and manageability are both improved with CTEs and ranking/windowing functions.
But the performance of the subquery approach was actually 100% - 500% BETTER than the performance of the ROW_NUMBER() approach, when the table in question was properly indexed (and tables should always be properly indexed). From Figure 4:
Rows - Sub - Row_Number()
100000 - 2.2258 - 5.97198
1000000 -14.3881 - 83.7228
Ignoring a 100% - 500% performance improvement to use a nifty new trick is what I would consider suicidal.

When you say "properly indexed" you're referring to the clustered index on the versioning columns, correct? I'm trying to think of situations where I would use my single clustered index per table on the versioning columns. I think a more realistic example would be no index on the versioning columns, or nonclustered index on them in which case you might have to contend with RID lookups which can be even more expensive (unless you create covering indexes of course).

Pritesh Keniya SSC Journeyman Points: 75 More actions · Answer 12

Thanks for introducing the concept of Row_Number(). A good example to quickly understand the concept.

Just out of curiosity I thought: "What if we use CTE?"

------------------------------------------------------------------------------------------------

With MaxVer as

(

SELECT ProductID, MAX(Version) AS Version

from Production.ProductVersion WITH (NOLOCK)

Group by ProductID

),

MaxMinorVer as

(

Select PV.ProductID, PV.Version, MAX(MinorVersion) as MinorVersion

From Production.ProductVersion PV WITH (NOLOCK)

Inner Join MaxVer MV on MV.ProductID = PV.ProductID and MV.Version = PV.Version

Group by PV.ProductID, PV.Version

),

MaxReleaseVersion as

(

Select PV.ProductID, PV.Version, PV.MinorVersion, MAX(ReleaseVersion) as ReleaseVersion

From Production.ProductVersion PV WITH (NOLOCK)

Inner Join MaxMinorVer MV on MV.ProductID = PV.ProductID and MV.Version = PV.Version and MV.MinorVersion = PV.MinorVersion

Group by PV.ProductID, PV.Version, PV.MinorVersion

)

Select PV.* from Production.ProductVersion PV

Inner Join MaxReleaseVersion MV WITH (NOLOCK) on MV.ProductID = PV.ProductID and MV.Version = PV.Version

and MV.MinorVersion = PV.MinorVersion and MV.ReleaseVersion = PV.ReleaseVersion

order by PV.ProductID

-------------------------------------------------------------------------------------------------

And I found out:

Following is the subtree cost (CTE vs Row_Number()) for 100000 rows:

CTE: 0.670628

Row_Number(): 5.97198

Just an other approach to the example.

Francis Rodrigues-459442 SSC Eights! Points: 860 More actions · Answer 13

Pritesh Keniya (5/21/2009)
Thanks for introducing the concept of Row_Number(). A good example to quickly understand the concept.
Just out of curiosity I thought: "What if we use CTE?"
------------------------------------------------------------------------------------------------
With MaxVer as
(
SELECT ProductID, MAX(Version) AS Version
from Production.ProductVersion WITH (NOLOCK)
Group by ProductID
),
MaxMinorVer as
(
Select PV.ProductID, PV.Version, MAX(MinorVersion) as MinorVersion
From Production.ProductVersion PV WITH (NOLOCK)
Inner Join MaxVer MV on MV.ProductID = PV.ProductID and MV.Version = PV.Version
Group by PV.ProductID, PV.Version
),
MaxReleaseVersion as
(
Select PV.ProductID, PV.Version, PV.MinorVersion, MAX(ReleaseVersion) as ReleaseVersion
From Production.ProductVersion PV WITH (NOLOCK)
Inner Join MaxMinorVer MV on MV.ProductID = PV.ProductID and MV.Version = PV.Version and MV.MinorVersion = PV.MinorVersion
Group by PV.ProductID, PV.Version, PV.MinorVersion
)
Select PV.* from Production.ProductVersion PV
Inner Join MaxReleaseVersion MV WITH (NOLOCK) on MV.ProductID = PV.ProductID and MV.Version = PV.Version
and MV.MinorVersion = PV.MinorVersion and MV.ReleaseVersion = PV.ReleaseVersion
order by PV.ProductID
-------------------------------------------------------------------------------------------------
And I found out:
Following is the subtree cost (CTE vs Row_Number()) for 100000 rows:
CTE: 0.670628
Row_Number(): 5.97198
Just an other approach to the example.

This approach is fine. The estimate even indicates that it is faster. One of my main goals was to show that if changes needed to be made to the code, ROW_NUMBER() offers one place for the change to be made rather than cascading through different sections. Ultimately it is up to the programmers to decide what suits their needs the best.

bob amy Valued Member Points: 70 More actions · Answer 14

I had come to this conclusion independently. It is nice to see collaboration.

One difference is that in my test case, the ROW_NUMBER was always faster even when properly indexed.

One other thing I noticed; my subquery returns ties, whereas ROW_NUMBER does not. In order to return ties (two or more emp_id with the same comm), I used RANK() instead of ROW_NUMBER.

The RANK() solution took about twice as long as the ROW_NUMBER() solution, but returned the same results as the correlated subquery. It was still a faster solution, and the IO was much less

-- create clustered index cicomm on comm(emp_id,comm desc)

set statistics time on

set statistics io on

DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS -- clear all data cache

DBCC FREEPROCCACHE WITH NO_INFOMSGS -- clear stored procedure cache

print '********************CORRELATED SUBQUERY'

select *

from comm a

where a.comm in (select top 3 comm from comm cq

where cq.emp_id = a.emp_id

order by comm desc )

order by emp_id, comm desc

;

DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS -- clear all data cache

DBCC FREEPROCCACHE WITH NO_INFOMSGS -- clear stored procedure cache

print '********************ROW_NUMBER'

select *

from (

select *, ROW_NUMBER() over(partition by emp_id order by emp_id, comm desc ) rank

from comm) as a

where rank <= 3

DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS -- clear all data cache

DBCC FREEPROCCACHE WITH NO_INFOMSGS -- clear stored procedure cache

print '********************RANK'

select *

from (

select *, RANK() over(partition by emp_id order by emp_id, comm desc ) rank

from comm) as a

where rank <= 3

Excerpt from messages:

********************CORRELATED SUBQUERY

(29999 row(s) affected)

Table 'comm'. Scan count 430135, logical reads 1375435, physical reads 4, read-ahead reads 1119, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 983 ms, elapsed time = 1748 ms.

********************ROW_NUMBER

(29997 row(s) affected)

Table 'comm'. Scan count 1, logical reads 1123, physical reads 4, read-ahead reads 1119, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 156 ms, elapsed time = 381 ms.

********************RANK

(29999 row(s) affected)

Table 'comm'. Scan count 1, logical reads 1123, physical reads 3, read-ahead reads 1119, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:

CPU time = 343 ms, elapsed time = 567 ms.

grxcls SSC Enthusiast Points: 195 More actions · Answer 15

Just another great article. I've recently discovered Window functions in SQL Server 2005 and they certainly eliminate some of the overhead caused by subqueries and allow me to get data out easier and quicker.

ROW_NUMBER(): An Efficient Alternative to Subqueries

Cookies on SQLServerCentral