What is faster, MAX() with GROUP BY, OR CTE with ROW_NUMBER()?

  • I have to manage various kinds of "data" and maintain "versions" of that data. To simplify the problem I have created a dummy data table and a dummy version table that has foreign key to the data id. I could just type the rest in English but the SQL code speaks for itself. My questions is at the bottom in SQL comments. Need to figure out which is faster because I'm getting sporadic results. Sometimes one is faster and sometimes the other.

    One can simply copy / paste the SQL and run it

    Thanks in advance.

    ---------------------------------

    CREATE TABLE SomeData (

    SomeDataIdBIGINT IDENTITY(1, 1) NOT NULL PRIMARY KEY,

    SomeTextVARCHAR(64),

    CreateDtDATETIME NOT NULL DEFAULT (GETDATE()),

    )

    CREATE TABLE SomeDataVersion (

    SomeDataVersionIdBIGINT IDENTITY(1, 1) NOT NULL,

    SomeDataIdBIGINT NOT NULL REFERENCES SomeData(SomeDataId),

    CreateDtDATETIME NOT NULL DEFAULT (GETDATE())

    )

    DECLARE @Id BIGINT

    INSERT INTO SomeData (SomeText) VALUES ('DontCare'); SELECT @Id = SCOPE_IDENTITY();

    INSERT INTO SomeDataVersion(SomeDataId) VALUES (@Id);

    INSERT INTO SomeDataVersion(SomeDataId) VALUES (@Id);

    INSERT INTO SomeData (SomeText) VALUES ('DontCare'); SELECT @Id = SCOPE_IDENTITY();

    INSERT INTO SomeDataVersion(SomeDataId) VALUES (@Id);

    INSERT INTO SomeDataVersion(SomeDataId) VALUES (@Id);

    INSERT INTO SomeDataVersion(SomeDataId) VALUES (@Id);

    INSERT INTO SomeData (SomeText) VALUES ('DontCare'); SELECT @Id = SCOPE_IDENTITY();

    INSERT INTO SomeDataVersion(SomeDataId) VALUES (@Id);

    INSERT INTO SomeDataVersion(SomeDataId) VALUES (@Id);

    INSERT INTO SomeDataVersion(SomeDataId) VALUES (@Id);

    INSERT INTO SomeDataVersion(SomeDataId) VALUES (@Id);

    SELECT * FROM SomeData

    SELECT * FROM SomeDataVersion

    -- Which is FASTER?

    -- THIS?

    SELECT SomeDataId, MAX(SomeDataVersionId)

    FROM SomeDataVersion t

    GROUP BY SomeDataId

    -- OR THIS?

    ;WITH cte AS (

    SELECT

    t.SomeDataVersionId,

    t.SomeDataId,

    ROW_NUMBER() OVER (PARTITION BY t.SomeDataId ORDER BY t.SomeDataVersionId DESC) AS RowNum

    FROM SomeDataVersion t

    )

    SELECT cte.SomeDataId, cte.SomeDataVersionId FROM cte WHERE RowNum = 1

  • Row_Number() is rarely faster, and usually only due to bad indexing/heaps, or because the data is incredibly limited that it has to number before moving forward. On large data tests it has failed rather badly on a number of benchmarks I've built.

    Use MAX() or CrossApply as your default go-tos for logging table pulls. Use MAX() as your primary if you can't add supporting indexes for the Cross Apply (in particular because you need one in a DESC order for the TOP 1 clause).


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • I think the experts would say "It Depends"[/url]. I found this article by The Dixie Flatline to be extremely informative.

  • Pagan DBA (8/1/2011)


    -- Which is FASTER?

    Do what any of us would do... TEST IT and see! 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You guys actually test your code? More power to you :hehe:.

  • Performance is dependant on the amount of data and indexing. Even with indexes on the columns you are using for the ROW_NUMBER(), you still need to pull all results and then rank them based on your partiioning and ordering. If you just need the top 1, as suggested before, I have found that a subquery with MAX and GROUP BY is much quicker. I have also used CROSS APPLY as someone else suggested and it was faster than the ROW_NUMBER() method. I would limit ROW_NUMBER() and it's siblings to when you actually need to rank something vs pulling top 1 type operations.

  • Seriously, I did say I was getting sporadic results. How would I know that if I hadn't tested?

    The idea behind the question was to get some insight into SQL and SQL Server.

    Thanks to the gents who actually gave useful answer regarding ROW_NUMBER() rarely being faster on larger datasets and setting me straight on MAX()/GROUP BY

    As an aside, I'm sorry but BIT(1) = 1/0 is my preferred choice to VARCHAR(1) = Y/N. This is because in spite of my efforts, I have always found a bozo who will circumvent whatever protection I offer using database constraints to but thinks like B (both?!?!?!) or U (unknown?!?!?!?) and mess up my application. Can't put say 9 in a BIT(1) can we now?

    I really don't want to have to TEST my code to find this out. Prevention is the best cure. Y/N is for report writers, and they can write the logic for display.

    Apologies for hijacking my own thread.

  • I assure you nobody wanted to insult you.

    The bottom line is that each server / app is different and that I rarely if ever see something really standard accross the board.

    The only rule I've seen stay true all those years is the less work the server does the better. As annoying as this explaination is it always hold true.

    With that in mind, you can read this to understand a little better how the server works and how to see what's going on with each of your queries :

    http://www.sqlservercentral.com/articles/books/65831/

  • Pagan DBA (8/2/2011)


    ...

    As an aside, I'm sorry but BIT(1) = 1/0 is my preferred choice to VARCHAR(1) = Y/N. This is because in spite of my efforts, I have always found a bozo who will circumvent whatever protection I offer using database constraints to but thinks like B (both?!?!?!) or U (unknown?!?!?!?) and mess up my application. Can't put say 9 in a BIT(1) can we now?

    ...

    First of all you can put 9 in a BIT(1) , it will just turn to 1 😀

    The second: it's not only 0 and 1 can be there... There is a third possible value: NULL

    So, you can always find a hardcore bozo who will circumvent whatever protection you can setup and mess up your application. You just need to look in right places :hehe:

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Pagan DBA (8/2/2011)


    As an aside, I'm sorry but BIT(1) = 1/0 is my preferred choice to VARCHAR(1) = Y/N. This is because in spite of my efforts, I have always found a bozo who will circumvent whatever protection I offer using database constraints to but thinks like B (both?!?!?!) or U (unknown?!?!?!?) and mess up my application. Can't put say 9 in a BIT(1) can we now?

    If you're discussing this line from my signature:

    Space is cheap! nVARCHAR(1) all your BITs today!

    It's a joke. 😀 8 bits takes up one byte, 8 nVARCHAR(1)'s (Nullable) takes up 32 bytes. Per record. 😉


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Pagan DBA (8/2/2011)


    Seriously, I did say I was getting sporadic results. How would I know that if I hadn't tested?

    The idea behind the question was to get some insight into SQL and SQL Server.

    Seriously... the 9 rows you posted isn't a performance test. The reason why you're getting sporadic results is because both run so fast with that small bit of data that you tested with that even the heartbeat of SQL Server will interfer with such a paltry "test".

    It turns out that the gents gave the correct answer [font="Arial Black"]but YOU still don't really know [/font]because YOU haven't learned how to test and they provided no evidence, not even a link, that they're correct in what they say. What would you have done, as so often happens on forums, if they gave the wrong answer? Would you still believe them just because there were two of them? Of course you would because you just did that very same thing. 😉

    Since you still don't really know if they're right or not, do like I first suggested... test it.

    Here's your first lesson on how to make enough test data for this particular comparison. 😉 I'll let you play with indexes.

    USE TempDB

    ;

    IF OBJECT_ID('TempDB.dbo.SomeDataVersion','U') IS NOT NULL

    DROP TABLE TempDB.dbo.SomeDataVersion

    ;

    SELECT TOP (1000000)

    SomeDataVersionID = IDENTITY(INT,1,1),

    SomeDataID = ABS(CHECKSUM(NEWID()))%250000+1,

    CreateDt = (RAND(CHECKSUM(NEWID()))*DATEDIFF(dd,'2010','2020'))+CAST('2010' AS DATETIME)

    INTO dbo.SomeDataVersion

    FROM sys.all_columns ac1,

    sys.all_columns ac2

    ;

    Protect yourself. Believe no one when it comes to questions like "Which is faster?"... not even me. 😉 Always do your own performance testing and learn what performance testing entails. I assure you that it doesn't have anything to do with 9 rows like you posted or even 10,000 rows.

    Performance testing also has nothing to do with returning the data to the screen. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just as a minor point you should never use varchar(1) anyway - it takes up more storage than a char(1), regardless of if they both have a value in them or not.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Err...the code sample I provided was to serve to explain my question. The sporadic results were on my REAL problem where I have several pieces of data EACH versioned and unpredictable data distributions in terms of how many versions each piece of data has. I couldn't possibly post sample code for all scenarios here. As has been explained to me in multiple replies, my problem is specific and I get that. Unfortunately I guess my code sample was perceived in a way one would think I'm asking a general question, which I wasn't.

    As to what I really don't know yet, it is true I'm not as educated as most people here as far as SQL is concerned. The reason I joined this forum - to learn.

    Finally, one gent DID provide a link on how to write a test. "It Depends" - where performance of TOP(1), MAX()/GROUP BY and ROW_NUMBER() is being compared. The single most succinct, useful and unpatronizing reply to my original question and which is what I plan to use by experimenting with various indexes on my tables.

    Thanks all.

  • Pagan DBA (8/3/2011)


    The reason I joined this forum - to learn.

    Not sure it will help at this point, but I recognized that. Looking back at what I posted, I can see why you might have thought so, but I wasn't trying to be patronizing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I published a series of tests on this subject for retrieving versioned data over here at Simple-Talk[/url]. It might prove useful. The answer, in a nutshell, it depends.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply