ROW_NUMBER(): An Efficient Alternative to Subqueries

  • Thanks for comparing the window function to the subquery. Nice article.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • A few alternatives (just for fun)...

    --row_number() version [generally the preferred approach, I would imagine]

    ; with t1 as (

    select *, row_number() over (partition by ProductID order by Version desc, MinorVersion desc, ReleaseVersion desc) as Row from Production.ProductVersion2)

    select * from t1 where Row = 1 order by ProductID

    --/

    --without row_number() [assuming all version numbers < 10000000]

    ; with

    t1 as (select *, Version * 10000000000000 + MinorVersion * 1000000 + ReleaseVersion as FullVersion

    from Production.ProductVersion2)

    , t2 as (select ProductID, max(FullVersion) as MaxFullVersion from t1 group by ProductID)

    select * from t1 where exists (select * from t2 where FullVersion = MaxFullVersion)

    order by ProductID

    --/

    --Equivalent for SQL 2000 [assuming all version numbers < 10000000]

    select * from

    (select *, Version * 10000000000000 + MinorVersion * 1000000 + ReleaseVersion as FullVersion

    from Production.ProductVersion2) a

    where exists (select * from (select ProductID, max(FullVersion) as MaxFullVersion

    from (select *, Version * 10000000000000 + MinorVersion * 1000000 + ReleaseVersion as FullVersion

    from Production.ProductVersion2) b

    group by ProductID) c

    where FullVersion = MaxFullVersion)

    order by ProductID

    --/

    --Alternative for SQL 2000 [assuming all version numbers < 10000000]

    select * from Production.ProductVersion2 a

    where Version * 10000000000000 + MinorVersion * 1000000 + ReleaseVersion =

    (select top 1 Version * 10000000000000 + MinorVersion * 1000000 + ReleaseVersion as FullVersion

    from Production.ProductVersion2 where ProductID = a.ProductID

    order by Version desc, MinorVersion desc, ReleaseVersion desc)

    order by ProductID

    --/

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • RBarryYoung (5/12/2009)


    tony rogerson (5/12/2009)


    Using non-deterministic functions like ROW_NUMBER and NEWID need to be done carefully - my blog article: http://sqlblogcasts.com/blogs/tonyrogerson/archive/2008/06/12/bug-in-inline-expansion-of-non-deterministic-functions-in-derived-tables-and-cte-s-causes-incorrect-results.aspx explains the serious bug that exists in SQL Server 2005 with using them.

    Afraid that I do not understand this. AFAIK, CTEs are table expressions like non-indexed views. If you write them non-deterministically (which is what you do in your blog article), then they return non-deterministic results. And if you reference a non-deterministic table expression multiple times, you are not guaranteed to get the same results. That's straight ANSI SQL to the best of my knowledge and I see no bug in Transact-SQL demonstrated here.

    And as far as ROW_NUMBER() it only returns non-deterministic results if the OVER clause parameters are non-deterministic. If you use a deterministic ORDER BY (which also requires that the column set is uniquely orderable), then you will get deterministic results.

    All this really demonstrates is that if you input non-deterministic expressions without instantiating them, then you will get non-deterministic results back. Garbage In, Garbage Out.

    Couldn't agree more. A regular CTE is nothing more than a more flexible and better-laid-out derived table. Just because you can reference a side-effecting function like NEWID() in a CTE, reference the CTE twice and get two different results from each reference doesn't make it a bug! If you try to write the equivalent query using derived tables, you'll have to duplicate the code, making the coding error clear to see.

    It's no more surprising than the fact that the following poorly-written code (attempting to produce a random integer from 1 to 3) produces NULL quite frequently:

    SELECT CASE FLOOR(RAND(CHECKSUM(NEWID())) * 3 + 1) WHEN 1 THEN 'One' WHEN 2 THEN 'Two' WHEN 3 THEN 'Three' ELSE NULL END;

    SQL Server does not guarantee that scalar functions will be called a certain number of times, or in any particular order. It is up to the programmer to avoid potentially problematic use of such side-effecting functions.

    Paul

  • Great article!! We use this exact technique in our ETL and it's great to see such solid confirmation that we have chosen a good strategy.

    Two comments.

    1. Why did you include ProductID in your ORDER BY clause?

    OVER(PARTITION BY ProductID

    ORDER BY ProductID,

    Version DESC,

    It doesn't seem necessary for determining the latest version?

    2. I have garnered about a 60% decrease in cost by sorting the index in the same direction as the ORDER BY clauses; in this case, DESC. Could you try that for us and report back on your results?

    Thanks a lot!

    ---------------------------
    |Ted Pin >>

  • DISCLAIMER: I do not consider myself to be a "DBA", and I don't have any sort of certification.

    I'm going to have to disagree with most of you that this was a great article. While ROW_NUMBER() certainly has its uses, the example given in the article seems contrived and is anything but efficient in my testing.

    First off, the first query provided was a joke. If anyone wrote a monster like that I would be giving them that look. You know what look I'm talking about, the "Seriously? Who hired you?" look. In fact, I can't believe the author took the time to write something like that for the article. Essentially I'm saying that the comparison in the article is unfair, the author is comparing a crappy query to an even crappier query.

    I used a similar approach to generate a random ProductVersion table with 50 versions for each of 1,000,000 products (50 million rows). I then compared his query to my own using SSMS's built-in "client statistics" using the average values obtained from running the query 3 times (I could have done more, but as you will see later, I was getting aggravated waiting for his query to finish). I directed the output to a file each time, and when I was finished I compared the contents of the files using the commandline tools "sort" and "fc" to ensure they were giving the same results.

    Here is my version of the query:

    SELECTx.*

    FROM (

    SELECTDISTINCT ProductId

    FROMdbo.ProductVersion

    ) pv

    CROSS APPLY (

    SELECTTOP(1) *

    FROMdbo.ProductVersion

    WHEREProductId = pv.ProductId

    ORDER BY Version DESC, MinorVersion DESC, ReleaseVersion DESC

    ) x

    First off, I find this MUCH easier to read. It's less than half the size and just as easy to modify, but the biggest reason I like this query more is because if you paraphrase what the query is doing, you sound like you are describing the problem. "Grab the ProductId's out of the ProductVersion table, and for each ProductId get the row with the most recent version." In the future, you could replace the query in the CROSS APPLY with a function like dbo.udf_latest_version_for_product(pv.ProductId) or something similar. This drops the total lines down to 6 and gives you a reusable function to work with.

    And now, most importantly, here are the execution results I got running these queries on my 50 million row table. This was run on my work laptop (a Core 2 Duo with your typical not-spectacular HDD and 4GB of ram) on a local SQL Server 2008 database.

    The Author's ROW_NUMBER() query:

    Trial 3 Trial 2 Trial 1 Average

    Client processing time 233120 241950 241094 238721.3000

    Total execution time 288418 359298 282119 309945.0000

    Wait time on server replies 55298 117348 41025 71223.6600

    My CROSS APPLY query:

    Trial 3 Trial 2 Trial 1 Average

    Client processing time 2075 2246 2379 2233.3330

    Total execution time 8546 8796 14732 10691.3300

    Wait time on server replies 6471 6550 12353 8458.0000

    I do not know the units that SSMS uses for those measurements or what exactly each metric means. I looked mostly at "Wait time on server replies" as it sounds like a good indicator of how long it took the database engine to process the query. As far as results, I think its enough to notice that there's almost an order of magnitude difference. I don't think I could possibly spin this in any way that would make the author's query seem "efficient". On average I waited about 15 seconds for my query to complete, whereas I waited about 5 minutes or more for the author's query. I did NOT try to tweak his query in any way. This post is only meant to point out the weaknesses in the article and show people a valid alternative.

    Why did I use a 50 million row test set? C'mon people, everyone knows that if you use a small enough test set everything seems efficient. When I'm dealing with a table that has a couple thousand rows in it, and I know this table won't get much larger, I focus all my energy on writing clear, concise, and easy to maintain queries. I only address efficiency if it somehow becomes a problem, although it rarely does.

    Again I will point out my disclaimer above and my tests were hardly scientific, but when there's such a large difference, it doesn't seem to matter much.

  • First, please note the subtitle of the article, An Efficient Alternative to Subqueries. There were no claims about it being the best, just an alternative to subqueries. From the results you discovered, it looks like you have found an alternative to ROW_NUMBER(). The purpose of this article was to demonstrate that there are different ways to achieve the same end.

    --edit

    Also, the code was written on SQL Server 2005 on a laptop with 2gb of ram and Core 2 cpu. Can you try your alternative on a 2005 instance and report the results?

    Matt Arpidone (1/29/2010)


    DISCLAIMER: I do not consider myself to be a "DBA", and I don't have any sort of certification.

    I'm going to have to disagree with most of you that this was a great article. While ROW_NUMBER() certainly has its uses, the example given in the article seems contrived and is anything but efficient in my testing.

    First off, the first query provided was a joke. If anyone wrote a monster like that I would be giving them that look. You know what look I'm talking about, the "Seriously? Who hired you?" look. In fact, I can't believe the author took the time to write something like that for the article. Essentially I'm saying that the comparison in the article is unfair, the author is comparing a crappy query to an even crappier query.

    I used a similar approach to generate a random ProductVersion table with 50 versions for each of 1,000,000 products (50 million rows). I then compared his query to my own using SSMS's built-in "client statistics" using the average values obtained from running the query 3 times (I could have done more, but as you will see later, I was getting aggravated waiting for his query to finish). I directed the output to a file each time, and when I was finished I compared the contents of the files using the commandline tools "sort" and "fc" to ensure they were giving the same results.

    Here is my version of the query:

    SELECTx.*

    FROM (

    SELECTDISTINCT ProductId

    FROMdbo.ProductVersion

    ) pv

    CROSS APPLY (

    SELECTTOP(1) *

    FROMdbo.ProductVersion

    WHEREProductId = pv.ProductId

    ORDER BY Version DESC, MinorVersion DESC, ReleaseVersion DESC

    ) x

    First off, I find this MUCH easier to read. It's less than half the size and just as easy to modify, but the biggest reason I like this query more is because if you paraphrase what the query is doing, you sound like you are describing the problem. "Grab the ProductId's out of the ProductVersion table, and for each ProductId get the row with the most recent version." In the future, you could replace the query in the CROSS APPLY with a function like dbo.udf_latest_version_for_product(pv.ProductId) or something similar. This drops the total lines down to 6 and gives you a reusable function to work with.

    And now, most importantly, here are the execution results I got running these queries on my 50 million row table. This was run on my work laptop (a Core 2 Duo with your typical not-spectacular HDD and 4GB of ram) on a local SQL Server 2008 database.

    The Author's ROW_NUMBER() query:

    Trial 3 Trial 2 Trial 1 Average

    Client processing time 233120 241950 241094 238721.3000

    Total execution time 288418 359298 282119 309945.0000

    Wait time on server replies 55298 117348 41025 71223.6600

    My CROSS APPLY query:

    Trial 3 Trial 2 Trial 1 Average

    Client processing time 2075 2246 2379 2233.3330

    Total execution time 8546 8796 14732 10691.3300

    Wait time on server replies 6471 6550 12353 8458.0000

    I do not know the units that SSMS uses for those measurements or what exactly each metric means. I looked mostly at "Wait time on server replies" as it sounds like a good indicator of how long it took the database engine to process the query. As far as results, I think its enough to notice that there's almost an order of magnitude difference. I don't think I could possibly spin this in any way that would make the author's query seem "efficient". On average I waited about 15 seconds for my query to complete, whereas I waited about 5 minutes or more for the author's query. I did NOT try to tweak his query in any way. This post is only meant to point out the weaknesses in the article and show people a valid alternative.

    Why did I use a 50 million row test set? C'mon people, everyone knows that if you use a small enough test set everything seems efficient. When I'm dealing with a table that has a couple thousand rows in it, and I know this table won't get much larger, I focus all my energy on writing clear, concise, and easy to maintain queries. I only address efficiency if it somehow becomes a problem, although it rarely does.

    Again I will point out my disclaimer above and my tests were hardly scientific, but when there's such a large difference, it doesn't seem to matter much.

  • The reason I had the index sorting in the opposite direction was to demonstrate the structure of table. The query itself represents a requirement that was needed and the structure of the table of might not necessarily be able to be modified. Sorting the index in the order of the requirement will definitely give you a boost in performance.

    As for the ORDER BY clause, you don't need the partitioning column in there. I included it just to demonstrate how the row numbers will be assigned.

    Ted Pin (1/29/2010)


    Great article!! We use this exact technique in our ETL and it's great to see such solid confirmation that we have chosen a good strategy.

    Two comments.

    1. Why did you include ProductID in your ORDER BY clause?

    OVER(PARTITION BY ProductID

    ORDER BY ProductID,

    Version DESC,

    It doesn't seem necessary for determining the latest version?

    2. I have garnered about a 60% decrease in cost by sorting the index in the same direction as the ORDER BY clauses; in this case, DESC. Could you try that for us and report back on your results?

    Thanks a lot!

  • I gave it a shot in a VM we have at work running SQL Server 2005. I'm a little unclear on the exact hardware its running on, but I believe it's a dual core Xeon with 3GB ram allocated to the VM. I expected the results to be a little slower and slightly more erratic since there are other VM's running on the same machine, although at the time I performed these tests I would expect the machine load to be very low.

    Here's the data:

    ROW_NUMBER:

    Trial 5 Trial 4 Trial 3 Trial 2 Trial 1 Average

    Client processing time 4479 5322 5042 5011 4791 4929.0000

    Total execution time 279921 281011 282715 273348 288659 281130.8000

    Wait time on server replies 275442 275689 277673 268337 283868 276201.8000

    CROSS APPLY:

    Trial 5 Trial 4 Trial 3 Trial 2 Trial 1 Average

    Client processing time 64775 104076 68224 101766 111740 90116.2000

    Total execution time 117143 147420 111318 152149 166852 138976.4000

    Wait time on server replies 52368 43344 43094 50383 55112 48860.2000

    According to these numbers the CROSS APPLY query is many times faster than the ROW_NUMBER query, however each time I ran the CROSS APPLY it took around 2:20 and the ROW_NUMBER query took around 5:00, so it was only about twice as fast. I'm not sure what the discrepancy is there.

    I guess optimizations have been added in SQL Server 2008 that make APPLY's an even more attractive option.

  • Matt Arpidone (1/29/2010)


    DISCLAIMER: I do not consider myself to be a "DBA", and I don't have any sort of certification.

    I'm going to have to disagree with most of you that this was a great article. While ROW_NUMBER() certainly has its uses, the example given in the article seems contrived and is anything but efficient in my testing.

    First off, the first query provided was a joke. If anyone wrote a monster like that I would be giving them that look. You know what look I'm talking about, the "Seriously? Who hired you?" look. In fact, I can't believe the author took the time to write something like that for the article. Essentially I'm saying that the comparison in the article is unfair, the author is comparing a crappy query to an even crappier query.

    I know Steve is always looking for smart, talented members of the community who have found a better way to do x, y, or z to submit articles to him for publication. It's very easy to criticize; it's a bit harder to put your own work up for peer review and harsh criticism from the Matt Arpidones of the world. While this article dealt with how to simplify/flatten out multiple levels of subqueries into a format that's easier to read and understand, and proves to be more efficient than multiple nested subqueries (which is where many people first go when they're developing this kind of application), I'd recommend publishing an article if you have a better way.

    I used a similar approach to generate a random ProductVersion table with 50 versions for each of 1,000,000 products (50 million rows). I then compared his query to my own using SSMS's built-in "client statistics" using the average values obtained from running the query 3 times (I could have done more, but as you will see later, I was getting aggravated waiting for his query to finish). I directed the output to a file each time, and when I was finished I compared the contents of the files using the commandline tools "sort" and "fc" to ensure they were giving the same results.

    Here is my version of the query:

    SELECTx.*

    FROM (

    SELECTDISTINCT ProductId

    FROMdbo.ProductVersion

    ) pv

    CROSS APPLY (

    SELECTTOP(1) *

    FROMdbo.ProductVersion

    WHEREProductId = pv.ProductId

    ORDER BY Version DESC, MinorVersion DESC, ReleaseVersion DESC

    ) x

    First off, I find this MUCH easier to read. It's less than half the size and just as easy to modify, but the biggest reason I like this query more is because if you paraphrase what the query is doing, you sound like you are describing the problem. "Grab the ProductId's out of the ProductVersion table, and for each ProductId get the row with the most recent version." In the future, you could replace the query in the CROSS APPLY with a function like dbo.udf_latest_version_for_product(pv.ProductId) or something similar. This drops the total lines down to 6 and gives you a reusable function to work with.

    In your article you might wish to discuss the internals of how CROSS APPLY works. It might surprise you, and just might change your mind about using a UDF. I'll leave it for you to research CROSS APPLY, as I don't want to pre-empt publication of what I expect to be an excellent article with your name on it in the near future.

    And now, most importantly, here are the execution results I got running these queries on my 50 million row table. This was run on my work laptop (a Core 2 Duo with your typical not-spectacular HDD and 4GB of ram) on a local SQL Server 2008 database.

    The Author's ROW_NUMBER() query:

    Trial 3 Trial 2 Trial 1 Average

    Client processing time 233120 241950 241094 238721.3000

    Total execution time 288418 359298 282119 309945.0000

    Wait time on server replies 55298 117348 41025 71223.6600

    My CROSS APPLY query:

    Trial 3 Trial 2 Trial 1 Average

    Client processing time 2075 2246 2379 2233.3330

    Total execution time 8546 8796 14732 10691.3300

    Wait time on server replies 6471 6550 12353 8458.0000

    I do not know the units that SSMS uses for those measurements or what exactly each metric means. I looked mostly at "Wait time on server replies" as it sounds like a good indicator of how long it took the database engine to process the query. As far as results, I think its enough to notice that there's almost an order of magnitude difference. I don't think I could possibly spin this in any way that would make the author's query seem "efficient". On average I waited about 15 seconds for my query to complete, whereas I waited about 5 minutes or more for the author's query. I did NOT try to tweak his query in any way. This post is only meant to point out the weaknesses in the article and show people a valid alternative.

    Interesting. I expect all the developers I work with to know and understand the client statistics that SQL Server generates. If one of them did not know the unit of measure or meaning of the client statistics, I would look at them like "Who hired you?" If you haven't already you might want to look into it; it is documented well in BOL and in several articles on the Web.

    BTW I ran tests similar to yours on a SQL Server 2008 (x64, dual core AMD 2.9 GHz, 8 GB, 3 TB HDD) and found your method to be about 25 - 30% faster when tested correctly. You can easily get invalid statistics if you don't clear the buffers and cache before you execute multiple queries. As a case in point, I ran your query and then ran the author's query against my sample data without clearing the buffers and cache in between runs and found that the author's query responded about 5% faster than yours.

    Why did I use a 50 million row test set? C'mon people, everyone knows that if you use a small enough test set everything seems efficient. When I'm dealing with a table that has a couple thousand rows in it, and I know this table won't get much larger, I focus all my energy on writing clear, concise, and easy to maintain queries. I only address efficiency if it somehow becomes a problem, although it rarely does.

    Again I will point out my disclaimer above and my tests were hardly scientific, but when there's such a large difference, it doesn't seem to matter much.

    "Hardly scientific" puts quite a light touch on it after coming out swinging at the article author like the 'undisputed champion'. While the author did not address the CROSS APPLY method in this article, he did a good job of explaining how to simplify a specific class of problematic queries using ranking and windowing functions. There are other interesting tasks that can be accomplished with ranking and windowing functions that this particular article did not address (and hopefully a lot more once MS adds the additional ranking/windowing functionality that's been requested by the community). I'm personally looking forward to more articles on it, and I also look forward to improved performance from Microsoft in CTEs in the future.

  • Paul White (1/29/2010)


    RBarryYoung (5/12/2009)


    tony rogerson (5/12/2009)


    Using non-deterministic functions like ROW_NUMBER and NEWID need to be done carefully - my blog article: http://sqlblogcasts.com/blogs/tonyrogerson/archive/2008/06/12/bug-in-inline-expansion-of-non-deterministic-functions-in-derived-tables-and-cte-s-causes-incorrect-results.aspx explains the serious bug that exists in SQL Server 2005 with using them.

    Afraid that I do not understand this. AFAIK, CTEs are table expressions like non-indexed views. If you write them non-deterministically (which is what you do in your blog article), then they return non-deterministic results. And if you reference a non-deterministic table expression multiple times, you are not guaranteed to get the same results. That's straight ANSI SQL to the best of my knowledge and I see no bug in Transact-SQL demonstrated here.

    And as far as ROW_NUMBER() it only returns non-deterministic results if the OVER clause parameters are non-deterministic. If you use a deterministic ORDER BY (which also requires that the column set is uniquely orderable), then you will get deterministic results.

    All this really demonstrates is that if you input non-deterministic expressions without instantiating them, then you will get non-deterministic results back. Garbage In, Garbage Out.

    Couldn't agree more. A regular CTE is nothing more than a more flexible and better-laid-out derived table. Just because you can reference a side-effecting function like NEWID() in a CTE, reference the CTE twice and get two different results from each reference doesn't make it a bug! If you try to write the equivalent query using derived tables, you'll have to duplicate the code, making the coding error clear to see.

    It's no more surprising than the fact that the following poorly-written code (attempting to produce a random integer from 1 to 3) produces NULL quite frequently:

    SELECT CASE FLOOR(RAND(CHECKSUM(NEWID())) * 3 + 1) WHEN 1 THEN 'One' WHEN 2 THEN 'Two' WHEN 3 THEN 'Three' ELSE NULL END;

    SQL Server does not guarantee that scalar functions will be called a certain number of times, or in any particular order. It is up to the programmer to avoid potentially problematic use of such side-effecting functions.

    Paul

    What Tony is referring to is the fact that the CTE regenerates the GUID every time the "rn" column is referenced from the CTE. This is a bug. The GUID should be generated every time NEWID is referenced. NEWID is only directly referenced once (in the CTE, by the ROW_NUMBER function). Run Tony's example and look at the two "rn" columns. They should match up with one another in every row.

  • Mike C (1/29/2010)


    What Tony is referring to is the fact that the CTE regenerates the GUID every time the "rn" column is referenced from the CTE. This is a bug. The GUID should be generated every time NEWID is referenced. NEWID is only directly referenced once (in the CTE, by the ROW_NUMBER function). Run Tony's example and look at the two "rn" columns. They should match up with one another in every row.

    Yes I know - and I disagree, and previously explained why. Read Microsoft's replies to the Connect item entered by Itzik Ben-Gan for a fuller explanation of why this is not a bug - it's a sensible design decision.

  • Thanks Mike, I'll keep in mind that any criticism of articles or disagreement will be met with animosity. I guess my opinion isn't welcome unless I agree with the author, which, as I can clearly see now, makes total sense on a site that welcomes "peer review".

    I do not know the author at all, nor have I read any of his other work, and I was very careful in my post not to resort to empty ad hominem attacks. I did express concern about the content of the article, and if you would like to disagree with me on a particular point, feel free.

    Regarding the testing, I completely understand your concerns, although I find the aggravated tone of your post a little disturbing (maybe I'm reading it wrong, if so I apologize).

    The reason I did not look up the unit of measure is because units are irrelevant when discussing relative speed differences. I would rather you not point out my deficiencies as a developer (of which there are many) unless they actually have relevance to my argument. Certainly, my lack of knowledge about "client statistics" doesn't discredit anything I've posted (unless I've missed something), and furthermore doesn't impair my ability to do my job at all - which makes me curious why you insist that every one of your developers know about them.

    I am a little worried that I seemed harsh in my post. I had previously typed up another post (in response to the author's post after mine) about how I thought his title was misleading and that caused me to misjudge the article - along with some additional disagreements - but then I decided to just drop it. I do still stand by what I've already presented, but I recognize that I seem to have missed the true purpose of the article.

    Also I find it distasteful that you point out weaknesses in my testing and indicate that you have done your own improved testing, and then you fail to post any details about what you have done. What do you mean when you say 25-30% faster when tested correctly? Faster than what? What do you mean "correctly"? Personally, I feel that my tests were adequate to get my point across. If you have numbers that disagree with mine, why don't you post them instead of giving nonconstructive retorts?

    In your article you might wish to discuss the internals of how CROSS APPLY works. It might surprise you, and just might change your mind about using a UDF. I'll leave it for you to research CROSS APPLY, as I don't want to pre-empt publication of what I expect to be an excellent article with your name on it in the near future.

    What did you mean here exactly? As I understand it, APPLY is similar to joining on a correlated subquery. Actually, I confess that my love of APPLY comes from it being so similar to a foreach loop in a typical programming language. I don't see what kind of crazy internal things could be going on that make it behave differently. I did try doing some research on how it works internally, and while I didn't find much helpful material, I did come across this article: http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/. It explains a specific scenario where APPLY is more efficient than a query using ROW_NUMBER(), and it explains the minor pitfall with ROW_NUMBER(). I suspect something similar is happening with the product version query.

    Anyway, I'm not asking you to spoonfeed me a lecture all about how APPLY works, but a link to an article describing this surprising internal behavior would be appreciated since most of the stuff I'm finding gives basic overviews and is pretty much consistent with what I expected.

  • Matt,

    I wouldn't take it too much to heart. There is always a risk of misunderstanding arising when communication occurs in this medium. Personally, I didn't find too much wrong with your first post, but that's just my impression.

    For what it's worth, I admit I cannot find the UDF Mike refers to in your APPLY solution. 🙂

    Paul

  • Matt Arpidone (1/30/2010)


    Thanks Mike, I'll keep in mind that any criticism of articles or disagreement will be met with animosity. I guess my opinion isn't welcome unless I agree with the author, which, as I can clearly see now, makes total sense on a site that welcomes "peer review".

    I do not know the author at all, nor have I read any of his other work, and I was very careful in my post not to resort to empty ad hominem attacks. I did express concern about the content of the article, and if you would like to disagree with me on a particular point, feel free.

    Regarding the testing, I completely understand your concerns, although I find the aggravated tone of your post a little disturbing (maybe I'm reading it wrong, if so I apologize).

    The reason I did not look up the unit of measure is because units are irrelevant when discussing relative speed differences. I would rather you not point out my deficiencies as a developer (of which there are many) unless they actually have relevance to my argument. Certainly, my lack of knowledge about "client statistics" doesn't discredit anything I've posted (unless I've missed something), and furthermore doesn't impair my ability to do my job at all - which makes me curious why you insist that every one of your developers know about them.

    I am a little worried that I seemed harsh in my post. I had previously typed up another post (in response to the author's post after mine) about how I thought his title was misleading and that caused me to misjudge the article - along with some additional disagreements - but then I decided to just drop it. I do still stand by what I've already presented, but I recognize that I seem to have missed the true purpose of the article.

    Also I find it distasteful that you point out weaknesses in my testing and indicate that you have done your own improved testing, and then you fail to post any details about what you have done. What do you mean when you say 25-30% faster when tested correctly? Faster than what? What do you mean "correctly"? Personally, I feel that my tests were adequate to get my point across. If you have numbers that disagree with mine, why don't you post them instead of giving nonconstructive retorts?

    In your article you might wish to discuss the internals of how CROSS APPLY works. It might surprise you, and just might change your mind about using a UDF. I'll leave it for you to research CROSS APPLY, as I don't want to pre-empt publication of what I expect to be an excellent article with your name on it in the near future.

    What did you mean here exactly? As I understand it, APPLY is similar to joining on a correlated subquery. Actually, I confess that my love of APPLY comes from it being so similar to a foreach loop in a typical programming language. I don't see what kind of crazy internal things could be going on that make it behave differently. I did try doing some research on how it works internally, and while I didn't find much helpful material, I did come across this article: http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/. It explains a specific scenario where APPLY is more efficient than a query using ROW_NUMBER(), and it explains the minor pitfall with ROW_NUMBER(). I suspect something similar is happening with the product version query.

    Anyway, I'm not asking you to spoonfeed me a lecture all about how APPLY works, but a link to an article describing this surprising internal behavior would be appreciated since most of the stuff I'm finding gives basic overviews and is pretty much consistent with what I expected.

    I was actually going through your post and responding point by point, then decided it wasn't worth it. If all you got out of my response was "criticism of articles or disagreement will be met with animosity", then perhaps you should consider an alternative reading. Maybe something along the lines of "animosity may be met with animosity"; or more accurately "someone might recommend I 'put my money where my mouth is'". Despite you considering my response as some sort of attack on you, my tone was considerably nicer than yours.

    After you called the author's work garbage because he did not discuss what you consider the optimal alternative, you proceeded to make suggestions about employability for the same reason. Nice. Then you mentioned that you don't even understand the client statistics which formed the basis of your argument! That's simply amazing.

    I merely suggested you take your issue with the author's work (and I assume you're focusing on the author's work despite the fact that you made suggestions about the author in your post) and put it on the front page with an article of your own. I then offered a few suggestions for possible topics your article could cover.

    I also mentioned that all of my employees are trained to understand client statistics (and other optimization tools) that are necessary to write efficient code and optimize it properly. Whether writing and optimizing efficient code are requirements of your job or not is actually irrelevant to my statement, as it is a requirement on my job. And anyone who doesn't know how to do it will get that look. You know the one I'm talking about.

    Also, as I mentioned, before running tests it's somewhat helpful to understand the role of buffers and cache, and how they can skew results severely. Oh, and here's some "relative number" results for you:

    ROW_NUMBER 1.00

    CROSS APPLY 0.73

    As you can plainly see, these "relative number" results prove that CROSS APPLY ran 27% faster than ROW_NUMBER. Nice. I won't even begin to address the kind of crazy results you can get on a poorly configured VM.

    If you're interested in learning more about CROSS APPLY get Itzik Ben-Gan's books - he gives a very good and detailed discussion of CROSS APPLY and OUTER APPLY.

  • Paul White (1/30/2010)


    Matt,

    I wouldn't take it too much to heart. There is always a risk of misunderstanding arising when communication occurs in this medium. Personally, I didn't find too much wrong with your first post, but that's just my impression.

    Agreed. Consider the following miscommunication as an example:

    Matt Arpidone - "In the future, you could replace the query in the CROSS APPLY with a function like dbo.udf_latest_version_for_product(pv.ProductId) or something similar. This drops the total lines down to 6 and gives you a reusable function to work with."

    Mike C - "In your article you might wish to discuss the internals of how CROSS APPLY works. It might surprise you, and just might change your mind about using a UDF. "

    Paul - "For what it's worth, I admit I cannot find the UDF Mike refers to in your APPLY solution. :-)"

    Hope that helps with the search for the UDF 🙂

Viewing 15 posts - 31 through 45 (of 60 total)

You must be logged in to reply to this topic. Login to reply