I have the following df that i apply a transformation on. it only returns columns y and x however. is there a way to also return the corresponding values in columns a and b?
sorry. i should have clarified: i dont want to groupby :a and :b. i only want the corresponding values from those columns when i groupby :y and then find the maximum value of :x for each unique value of :y
using Statistics
df |>
x -> groupby(x, :y) |>
x -> subset(x, :x => (x -> x .== maximum(x)))
2Γ4 DataFrame
Row β y x a b
β Int64 Int64 Int64 Int64
ββββββΌββββββββββββββββββββββββββββ
1 β 1 4 1 2
2 β 2 6 1 2
Or using TidierData:
using TidierData
using Statistics
@chain df begin
@group_by(y)
@filter(x == maximum(x))
@ungroup
end
# The ungroup is required because only `@summarize()` operations ungroup data frames automatically.
2Γ4 DataFrame
Row β y x a b
β Int64 Int64 Int64 Int64
ββββββΌββββββββββββββββββββββββββββ
1 β 1 4 1 2
2 β 2 6 1 2
This will not work properly if the a and b fields are not constant as in this example (as the rows wonβt be replicas).
If there are multiple lines attaining the maximum value in x, it might output more than one line per group. It isnβt clear what the OP wants, but it may pose a problem.
Another method (which isnβt sensitive to above issues):
julia> DataFrame(last(g) for g in groupby(sort(df, [:y, :x]),:y))
2Γ4 DataFrame
Row β y x a b
β Int64 Int64 Int64 Int64
ββββββΌββββββββββββββββββββββββββββ
1 β 1 4 1 2
2 β 2 6 1 2
(but this depends on sort stability of groupby within groups, which seems to work, but Iβm not sure is guaranteed).
I believe that the basic operation of the combine(groupby(df,[:colsβ¦]), β¦) function limits the output to the key columns and the columns acted upon by the func() because there is no a unique criterion for choosing which values of the other columns to output.
In the case in question the problem does not exist since :a and :b have a constant value.
So you could do this to get the desired result.
Yes, certainly. This form is only for the specific case proposed as an example by the OP. This is exactly what I was referring to in the previous comment. Any choice of non-key, non-transformed column values is arbitrary.