ROW_NUMBER(): An Efficient Alternative to Subqueries

  • I think this was a good demonstration of where these 2005+ features can be useful. I have a job coming up where this could really help.

    IMO, none of the few alternative queries posted so far are as clear in their intent, which I think is almost as important as performance, and an acceptable trade-off where performance isn't dramatically affected.

    Slightly off-topic, but I couldn't help being reminded of someone's suggestion I saw recently of a 'FIRST' aggregate (MS Access has one), which of course would make the query really simple, and probably perform really well. I suspect there must be someone who'd done one using CLR:

    SELECT ProductID,

    FIRST(Version) AS Version,

    FIRST(MinorVersion) AS MinorVersion,

    FIRST(ReleaseVersion) AS ReleaseVersion,

    FIRST(StandardCost) AS StandardCost

    FROM Production.ProductVersion

    ORDER BY ProductID, Version DESC, MinorVersion DESC, ReleaseVersion DESC

    GROUP BY ProductID

  • Thanks for a good article, I definitely learned what Row_Number() is - definitely had no idea before.

  • I have seen similar examples of ROW_NUMBER() usage and the following has always puzzled me:

    Why do we include columns in the ORDER BY when they are already in the PARTITION BY ?

    i.e. I would expect

    SELECT ROW_NUMBER() OVER(PARTITION BY ProductID,

    Version ORDER BY ProductID, Version, MinorVersion DESC, ReleaseVersion DESC) AS MaxVersion

    to be equivalent to

    SELECT ROW_NUMBER() OVER(PARTITION BY ProductID,

    Version ORDER BY MinorVersion DESC, ReleaseVersion DESC) AS MaxVersion

    I know I must be missing obvious with this but I'm hanged if I can figure out what it is... 🙂

  • UncleJimBob (5/12/2009)


    I have seen similar examples of ROW_NUMBER() usage and the following has always puzzled me:

    Why do we include columns in the ORDER BY when they are already in the PARTITION BY ?

    i.e. I would expect

    SELECT ROW_NUMBER() OVER(PARTITION BY ProductID,

    Version ORDER BY ProductID, Version, MinorVersion DESC, ReleaseVersion DESC) AS MaxVersion

    to be equivalent to

    SELECT ROW_NUMBER() OVER(PARTITION BY ProductID,

    Version ORDER BY MinorVersion DESC, ReleaseVersion DESC) AS MaxVersion

    I know I must be missing obvious with this but I'm hanged if I can figure out what it is... 🙂

    AFAIK, you are correct. There's no reason to include a column in the ORDER BY if it's already in the PARTITION BY, it has no effect.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • William Hutton (5/12/2009)


    I think it's just a matter of ...elegance? To me using Row_Number() statements (and CTEs in general) are easier to write and easier to follow in specific situations (provided the one reading knows what they do). And I can tell you just from testing this morning that the Row_Number() query has a performance improvement over the left joins and isnull tests I had been using previously.

    Do they? How big is it? Is that before or after you make a clustered index on the four significant columns? In figure 4 it looks like a well-indexed table is faster when queried by the sub-query approach. And since this table is going to be used in lots and lots of joins besides the one in the article, it needs to be well-indexed anyway.

    The comparison of query plans in the article is skewed because the 14 nested subqueries can be replaced by a single select with 4 left outer joins, and it is that query plan comparison I'd be interested in seeing.

    Also, isn't that subquery getting the max release version and standard cost as well, while the Row_Number() version is not?

  • I just tried it on my query and both of them appear to have returned the same results, so I suspect that you're correct about them being the same.

  • wbrianwhite (5/12/2009)


    William Hutton (5/12/2009)


    I think it's just a matter of ...elegance? To me using Row_Number() statements (and CTEs in general) are easier to write and easier to follow in specific situations (provided the one reading knows what they do). And I can tell you just from testing this morning that the Row_Number() query has a performance improvement over the left joins and isnull tests I had been using previously.

    Do they? How big is it? Is that before or after you make a clustered index on the four significant columns? In figure 4 it looks like a well-indexed table is faster when queried by the sub-query approach. And since this table is going to be used in lots and lots of joins besides the one in the article, it needs to be well-indexed anyway.

    The comparison of query plans in the article is skewed because the 14 nested subqueries can be replaced by a single select with 4 left outer joins, and it is that query plan comparison I'd be interested in seeing.

    Also, isn't that subquery getting the max release version and standard cost as well, while the Row_Number() version is not?

    The MAX release version is inherently pulled by the ORDER BY ... DESC. The standard cost that is selected is the one that is associated with that particular minor version and release version for a product.

  • Well, exactly. I wouldn't compare this to subqueries, I'd compare it to using top and a sort. Is it faster than that? I'd expect them to be about the same -- I don't see what Row_Number() gets us.

  • wbrianwhite (5/12/2009)


    In what way was simple set logic unable to perform this operation? It looks like this could be solved with left outer joins and is null tests. I don't have the test db mentioned, so my > may be wrong, it could need to be a <, I can never remember without testing it. But something like this should do the trick

    There's always more than one way to skin a cat. The question is twofold: (1) What's the performance of the alternative solution and (2) is the alternative solution easier to manage? I think the demonstration here was that the performance and manageability are both improved with CTEs and ranking/windowing functions.

  • Mike C (5/13/2009)


    There's always more than one way to skin a cat. The question is twofold: (1) What's the performance of the alternative solution and (2) is the alternative solution easier to manage? I think the demonstration here was that the performance and manageability are both improved with CTEs and ranking/windowing functions.

    But the performance of the subquery approach was actually 100% - 500% BETTER than the performance of the ROW_NUMBER() approach, when the table in question was properly indexed (and tables should always be properly indexed). From Figure 4:

    Rows - Sub - Row_Number()

    100000 - 2.2258 - 5.97198

    1000000 -14.3881 - 83.7228

    Ignoring a 100% - 500% performance improvement to use a nifty new trick is what I would consider suicidal.

  • wbrianwhite (5/13/2009)


    Mike C (5/13/2009)


    There's always more than one way to skin a cat. The question is twofold: (1) What's the performance of the alternative solution and (2) is the alternative solution easier to manage? I think the demonstration here was that the performance and manageability are both improved with CTEs and ranking/windowing functions.

    But the performance of the subquery approach was actually 100% - 500% BETTER than the performance of the ROW_NUMBER() approach, when the table in question was properly indexed (and tables should always be properly indexed). From Figure 4:

    Rows - Sub - Row_Number()

    100000 - 2.2258 - 5.97198

    1000000 -14.3881 - 83.7228

    Ignoring a 100% - 500% performance improvement to use a nifty new trick is what I would consider suicidal.

    When you say "properly indexed" you're referring to the clustered index on the versioning columns, correct? I'm trying to think of situations where I would use my single clustered index per table on the versioning columns. I think a more realistic example would be no index on the versioning columns, or nonclustered index on them in which case you might have to contend with RID lookups which can be even more expensive (unless you create covering indexes of course).

  • Thanks for introducing the concept of Row_Number(). A good example to quickly understand the concept.

    Just out of curiosity I thought: "What if we use CTE?"

    ------------------------------------------------------------------------------------------------

    With MaxVer as

    (

    SELECT ProductID, MAX(Version) AS Version

    from Production.ProductVersion WITH (NOLOCK)

    Group by ProductID

    ),

    MaxMinorVer as

    (

    Select PV.ProductID, PV.Version, MAX(MinorVersion) as MinorVersion

    From Production.ProductVersion PV WITH (NOLOCK)

    Inner Join MaxVer MV on MV.ProductID = PV.ProductID and MV.Version = PV.Version

    Group by PV.ProductID, PV.Version

    ),

    MaxReleaseVersion as

    (

    Select PV.ProductID, PV.Version, PV.MinorVersion, MAX(ReleaseVersion) as ReleaseVersion

    From Production.ProductVersion PV WITH (NOLOCK)

    Inner Join MaxMinorVer MV on MV.ProductID = PV.ProductID and MV.Version = PV.Version and MV.MinorVersion = PV.MinorVersion

    Group by PV.ProductID, PV.Version, PV.MinorVersion

    )

    Select PV.* from Production.ProductVersion PV

    Inner Join MaxReleaseVersion MV WITH (NOLOCK) on MV.ProductID = PV.ProductID and MV.Version = PV.Version

    and MV.MinorVersion = PV.MinorVersion and MV.ReleaseVersion = PV.ReleaseVersion

    order by PV.ProductID

    -------------------------------------------------------------------------------------------------

    And I found out:

    Following is the subtree cost (CTE vs Row_Number()) for 100000 rows:

    CTE: 0.670628

    Row_Number(): 5.97198

    Just an other approach to the example.

    🙂

  • Pritesh Keniya (5/21/2009)


    Thanks for introducing the concept of Row_Number(). A good example to quickly understand the concept.

    Just out of curiosity I thought: "What if we use CTE?"

    ------------------------------------------------------------------------------------------------

    With MaxVer as

    (

    SELECT ProductID, MAX(Version) AS Version

    from Production.ProductVersion WITH (NOLOCK)

    Group by ProductID

    ),

    MaxMinorVer as

    (

    Select PV.ProductID, PV.Version, MAX(MinorVersion) as MinorVersion

    From Production.ProductVersion PV WITH (NOLOCK)

    Inner Join MaxVer MV on MV.ProductID = PV.ProductID and MV.Version = PV.Version

    Group by PV.ProductID, PV.Version

    ),

    MaxReleaseVersion as

    (

    Select PV.ProductID, PV.Version, PV.MinorVersion, MAX(ReleaseVersion) as ReleaseVersion

    From Production.ProductVersion PV WITH (NOLOCK)

    Inner Join MaxMinorVer MV on MV.ProductID = PV.ProductID and MV.Version = PV.Version and MV.MinorVersion = PV.MinorVersion

    Group by PV.ProductID, PV.Version, PV.MinorVersion

    )

    Select PV.* from Production.ProductVersion PV

    Inner Join MaxReleaseVersion MV WITH (NOLOCK) on MV.ProductID = PV.ProductID and MV.Version = PV.Version

    and MV.MinorVersion = PV.MinorVersion and MV.ReleaseVersion = PV.ReleaseVersion

    order by PV.ProductID

    -------------------------------------------------------------------------------------------------

    And I found out:

    Following is the subtree cost (CTE vs Row_Number()) for 100000 rows:

    CTE: 0.670628

    Row_Number(): 5.97198

    Just an other approach to the example.

    🙂

    This approach is fine. The estimate even indicates that it is faster. One of my main goals was to show that if changes needed to be made to the code, ROW_NUMBER() offers one place for the change to be made rather than cascading through different sections. Ultimately it is up to the programmers to decide what suits their needs the best.

  • I had come to this conclusion independently. It is nice to see collaboration.

    One difference is that in my test case, the ROW_NUMBER was always faster even when properly indexed.

    One other thing I noticed; my subquery returns ties, whereas ROW_NUMBER does not. In order to return ties (two or more emp_id with the same comm), I used RANK() instead of ROW_NUMBER.

    The RANK() solution took about twice as long as the ROW_NUMBER() solution, but returned the same results as the correlated subquery. It was still a faster solution, and the IO was much less

    -- create clustered index cicomm on comm(emp_id,comm desc)

    set statistics time on

    set statistics io on

    DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS -- clear all data cache

    DBCC FREEPROCCACHE WITH NO_INFOMSGS -- clear stored procedure cache

    print '********************CORRELATED SUBQUERY'

    select *

    from comm a

    where a.comm in (select top 3 comm from comm cq

    where cq.emp_id = a.emp_id

    order by comm desc )

    order by emp_id, comm desc

    ;

    DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS -- clear all data cache

    DBCC FREEPROCCACHE WITH NO_INFOMSGS -- clear stored procedure cache

    print '********************ROW_NUMBER'

    select *

    from (

    select *, ROW_NUMBER() over(partition by emp_id order by emp_id, comm desc ) rank

    from comm) as a

    where rank <= 3

    DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS -- clear all data cache

    DBCC FREEPROCCACHE WITH NO_INFOMSGS -- clear stored procedure cache

    print '********************RANK'

    select *

    from (

    select *, RANK() over(partition by emp_id order by emp_id, comm desc ) rank

    from comm) as a

    where rank <= 3

    Excerpt from messages:

    ********************CORRELATED SUBQUERY

    (29999 row(s) affected)

    Table 'comm'. Scan count 430135, logical reads 1375435, physical reads 4, read-ahead reads 1119, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 983 ms, elapsed time = 1748 ms.

    ********************ROW_NUMBER

    (29997 row(s) affected)

    Table 'comm'. Scan count 1, logical reads 1123, physical reads 4, read-ahead reads 1119, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 156 ms, elapsed time = 381 ms.

    ********************RANK

    (29999 row(s) affected)

    Table 'comm'. Scan count 1, logical reads 1123, physical reads 3, read-ahead reads 1119, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 343 ms, elapsed time = 567 ms.

  • Just another great article. I've recently discovered Window functions in SQL Server 2005 and they certainly eliminate some of the overhead caused by subqueries and allow me to get data out easier and quicker. 🙂

Viewing 15 posts - 16 through 30 (of 60 total)

You must be logged in to reply to this topic. Login to reply