Inner Join creating duplicates in SQL along with Group By

  • I have two datasets, where I would like to join and then group by a particular column

    df1

    version host date      name purpose
    pat a16 12/1/2019 pat hi
    fam a16 12/1/2019 fam hi
    emp a16 12/1/2019 emp hi
    dan a16 12/1/2019 dan hi

    df2

    pc  size free date
    a16 5 1 12/1/2019
    a40 4 3 12/1/2019
    a41 3 1 12/1/2019

    Desired

    host date      purpose pc  size free
    a16 12/1/2019 hi a16 5 1

    Doing

    SELECT 
    sum(size) AS size, sum(free) AS free,
    df1.purpose, df1.date, df1.host
    FROM df1
    JOIN df2 ON
    df1.host = df2.pc AND
    df1.date = df2.date
    GROUP BY df1.purpose, df1.date, df1.host

    The problem is that this join keeps giving me this below with the sum of size being

    size 20 and free 4, when in actuality, the sum of size should be 5 and free should be 1.

    I am trying to reflect the above in a query. Any suggestion is appreciated.

     


    host date purpose pc size free
    a16 12/1/2019 hi a16 5 1
    a16 12/1/2019 hi a16 5 1
    a16 12/1/2019 hi a16 5 1
    a16 12/1/2019 hi a16 5 1
  • Thats down to only joining df1 and df2 on the host/pc column.  You have 4 different versions/names in df1.

    What was to happen should the purpose change to "bye" for example, how is that supposed to be calculated.

    It sounds like your missing a crucial piece of information in df2 to be able to join the two pieces of information together

    But you would need to do is to group the values from df1 first before joining them to df2, something like the below

    SELECT 
    sum(size) AS size, sum(free) AS free,
    df1.purpose, df1.date, df1.host
    FROM (SELECT host, date, purpose FROM df1 GROUP BY host, date, purpose) as groupedDF1
    JOIN df2 ON
    groupeddf1.host = df2.pc AND
    groupeddf1.date = df2.date
    GROUP BY groupeddf1.purpose, groupeddf1.date, groupeddf1.host

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply