January 20, 2021 at 10:00 am
I have two datasets, where I would like to join and then group by a particular column
df1
version host date name purpose
pat a16 12/1/2019 pat hi
fam a16 12/1/2019 fam hi
emp a16 12/1/2019 emp hi
dan a16 12/1/2019 dan hi
df2
pc size free date
a16 5 1 12/1/2019
a40 4 3 12/1/2019
a41 3 1 12/1/2019
Desired
host date purpose pc size free
a16 12/1/2019 hi a16 5 1
Doing
SELECT
sum(size) AS size, sum(free) AS free,
df1.purpose, df1.date, df1.host
FROM df1
JOIN df2 ON
df1.host = df2.pc AND
df1.date = df2.date
GROUP BY df1.purpose, df1.date, df1.host
The problem is that this join keeps giving me this below with the sum of size being
size 20 and free 4, when in actuality, the sum of size should be 5 and free should be 1.
I am trying to reflect the above in a query. Any suggestion is appreciated.
host date purpose pc size free
a16 12/1/2019 hi a16 5 1
a16 12/1/2019 hi a16 5 1
a16 12/1/2019 hi a16 5 1
a16 12/1/2019 hi a16 5 1
January 20, 2021 at 10:49 am
Thats down to only joining df1 and df2 on the host/pc column. You have 4 different versions/names in df1.
What was to happen should the purpose change to "bye" for example, how is that supposed to be calculated.
It sounds like your missing a crucial piece of information in df2 to be able to join the two pieces of information together
But you would need to do is to group the values from df1 first before joining them to df2, something like the below
SELECT
sum(size) AS size, sum(free) AS free,
df1.purpose, df1.date, df1.host
FROM (SELECT host, date, purpose FROM df1 GROUP BY host, date, purpose) as groupedDF1
JOIN df2 ON
groupeddf1.host = df2.pc AND
groupeddf1.date = df2.date
GROUP BY groupeddf1.purpose, groupeddf1.date, groupeddf1.host
Viewing 2 posts - 1 through 1 (of 1 total)
You must be logged in to reply to this topic. Login to reply