Performance of GROUP BY multiple columns

  • Comments posted to this topic are about the item Performance of GROUP BY multiple columns

    God is real, unless declared integer.

  • In this only specific example, you are right, but I think that the correct answer is "Depend" on the order of the query because of the data cache.
    The first and second query have the same execution plan, so the second query always runs better.
    Try to change the order of execution of the queries and you'll see the difference.
    The third query is missing over(PARTITION BY t.order_no)" for the COUNT. In any case it is the slowest one.
  • You are right about the missing OVER() in the third query - sorry, my fault,

    On the other side the execution plans for query one and two are not the same on my system, because it has to sort the second query by all columns in the GROUP BY, while it does not need this sort for the first query, regardless if I group by for the clustered index column(s) or some other random columns.

    Data cache should not really matter, as I asked for the CPU costs (and executed the queries several times to have a hot cache)

    God is real, unless declared integer.

  • Nice one, despite the missing OVER(.....) in Method 3.
    There are, however, instances where using windowing functions appear to be the best option to solve a specific issue, but the load  is high, with heavy parallelism..

    ____________________________________________
    Space, the final frontier? not any more...
    All limits henceforth are self-imposed.
    “libera tute vulgaris ex”

  • Ignoring the missing OVER() clause, this is still a good example.  I have seen way too many situations where DISTINCT was thrown into a query without regard to the consequences.  What really kills me is a DISTINCT placed on an outer query when it was already used on an inner query -- totally unnecessary ignoring the fact that it shouldn't have been used in the first place.  Whenever I see a query beginning with SELECT DISTINCT it tells me that the query is written poorly and DISTINCT is used to clean up the mess.

  • Aaron N. Cutshall - Tuesday, March 27, 2018 6:05 AM

    Ignoring the missing OVER() clause, this is still a good example.  I have seen way too many situations where DISTINCT was thrown into a query without regard to the consequences.  What really kills me is a DISTINCT placed on an outer query when it was already used on an inner query -- totally unnecessary ignoring the fact that it shouldn't have been used in the first place.  Whenever I see a query beginning with SELECT DISTINCT it tells me that the query is written poorly and DISTINCT is used to clean up the mess.

    Concur in 99.999% of all cases...
    The only case where i have found it necessary is where the data provided (usually by an external source, e.g. a web service) contains duplicates.
    Even in those situations, i would rather use group by...

    ____________________________________________
    Space, the final frontier? not any more...
    All limits henceforth are self-imposed.
    “libera tute vulgaris ex”

  • I like this question, thanks for submitting it.


    [font="Tahoma"]Personal blog relating fishing to database administration:[/font]

    [font="Comic Sans MS"]https://davegugg.wordpress.com[/url]/[/font]

  • Stewart "Arturius" Campbell - Tuesday, March 27, 2018 6:36 AM

    Concur in 99.999% of all cases...

    The only case where i have found it necessary is where the data provided (usually by an external source, e.g. a web service) contains duplicates.
    Even in those situations, i would rather use group by...

    True, but even then in those cases there are far more efficient methods of ignoring duplicates.  In most cases duplicates can be identified on key fields.  It's not very often that you have to compare each and every field like the DISTINCT operator does.

  • Aaron N. Cutshall - Tuesday, March 27, 2018 9:32 AM

    True, but even then in those cases there are far more efficient methods of ignoring duplicates.  In most cases duplicates can be identified on key fields.  It's not very often that you have to compare each and every field like the DISTINCT operator does.

    agreed...
    recently had to work on a very badly performing proc. it was running a distinct in a subquery and the main query, both returning several million rows.
    removing the distincts and adding group by's for the necessary columns drastically improved performance.

    ____________________________________________
    Space, the final frontier? not any more...
    All limits henceforth are self-imposed.
    “libera tute vulgaris ex”

  • A nice thought-provoking question!

    The answer really does depend, though. Under some conditions the first two queries got the same query plan on my system (and performed equivalently), and under some circumstances they got different plans (sometimes with the performance difference intended by the question, sometimes not).

    Still, it's interesting to think about and test. At least that the third option is substantially more costly seems clear; perhaps the question should have been "Which of these query forms should be avoided?" 🙂

    Cheers!

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply