Best way to do this kind of group by

  • What would be the best way to write a sql query where the result set contains a group by where one column is grouped by one set of columns and another column is grouped by another set of columns and still have the two different group by appear on the same row.

    Lets say there is a table such as follows:

    IF OBJECT_ID('tempdb..#a') IS NOT NULL DROP TABLE #a

    CREATE TABLE #a (Grp_A CHAR(2), Grp_B CHAR(2), Grp_C CHAR(2), Val_A TINYINT, Val_B TINYINT)

    Sample Data:

    INSERT INTO #a SELECT 'A1', 'B1', 'C1', 1, 1

    INSERT INTO #a SELECT 'A1', 'B1', 'C2', 1, 1

    INSERT INTO #a SELECT 'A1', 'B2', 'C3', 1, 1

    INSERT INTO #a SELECT 'A2', 'B2', 'C4', 1, 1

    The desired result set for the above would be:

    SELECT A1.Grp_A, A1.Grp_B, A1.Grp_C, SUM(Val_A) AS GrpBy_ABC, A2.GrpBy_AB

    FROM #a A1

    INNER JOIN (

    SELECT A.Grp_A, A.Grp_B, SUM(Val_B) AS GrpBy_AB FROM #a A GROUP BY A.Grp_A, A.Grp_B

    ) AS A2

    ON (A1.Grp_A = A2.Grp_A AND A1.Grp_B = A2.Grp_B)

    GROUP BY A1.Grp_A, A1.Grp_B, A1.Grp_C, A2.GrpBy_AB

    While the above query gives me the desired result, it basically involves two scans on the base table which gives me the following concern. For example, if the base table contains a few million rows and if there are around 6 Group by columns, it can be a pretty expensive query cos there would be two huge scans and a join involving 6 columns.

    Using grouping sets helps me to do one scan on the base table and get the aggregates.

    SELECT GROUPING_ID(A1.Grp_A, A1.Grp_B, A1.Grp_C) AS GroupId,

    A1.Grp_A, A1.Grp_B, A1.Grp_C,

    SUM(Val_A) AS GrpBy_ABC, SUM(Val_B) AS GrpBy_AB

    FROM #a A1

    GROUP BY GROUPING SETS(

    (A1.Grp_A, A1.Grp_B, A1.Grp_C),

    (A1.Grp_A, A1.Grp_B)

    )

    But the output is not in the desired format since the second GROUPING sET information is not returned on the same row but on a new row. And to get the output into desired format, I will have to do a self join the intermediate output obtained by GROUPING SETS which in essence results in two more 2 huge scans and ends up being worse than the above query using self join on the base table.

    So what can be the best way to approach this query? This would be part of a Load query, so while there is no requirement to have the query run in a few seconds, I would still like to do it in the best way possible.

    P.S. Do other DBMS or SQL dialects allow for such a query to be written? Basically something like this;

    SELECT A.Grp_A, A.Grp_B, A.Grp_C,

    SUM(Val_A GROUP BY A.Grp_A, A.Grp_B, A.Grp_C) AS GrpBy_ABC,

    SUM(Val_B GROUP BY A.Grp_A, A.Grp_B) AS GrpBy_AB

    FROM #a A

    In essence, there is only one scan and there is no need for a self join.

  • Probably not the best way, but seems to work

    SELECT DISTINCT

    Grp_A, Grp_B, Grp_C,

    SUM(Val_A) OVER(PARTITION BY Grp_A, Grp_B, Grp_C) AS GrpBy_ABC,

    SUM(Val_B) OVER(PARTITION BY Grp_A, Grp_B) AS GrpBy_AB

    FROM #a

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Mark-101232 (9/9/2013)


    Probably not the best way, but seems to work

    SELECT DISTINCT

    Grp_A, Grp_B, Grp_C,

    SUM(Val_A) OVER(PARTITION BY Grp_A, Grp_B, Grp_C) AS GrpBy_ABC,

    SUM(Val_B) OVER(PARTITION BY Grp_A, Grp_B) AS GrpBy_AB

    FROM #a

    So does this:

    SELECT

    d.Grp_A, d.Grp_B, d.Grp_C,

    d.SUM_Val_A,

    SUM_Val_B = SUM(d.PartSUM_Val_B) OVER (PARTITION BY d.Grp_A, d.Grp_B)

    FROM (

    SELECT

    A1.Grp_A, A1.Grp_B, A1.Grp_C,

    SUM(Val_A) AS SUM_Val_A,

    SUM(Val_B) AS PartSUM_Val_B

    FROM #a A1

    GROUP BY A1.Grp_A, A1.Grp_B, A1.Grp_C

    ) d

    both queries read the base table once then use table spools for the dirty work.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Mark and Chris...thank you for your input. I havent yet got a chance to test them; but they do look promising.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply