I have a DataFrame df:
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
I want to apply the following operation:
operator=(:a => sum, :b => maximum)
to get the following result:
df_summary = DataFrame(a=sum(df.a), b=maximum(df.b))
My question: How can I apply the operator in a generic way to get the correct result? I cannot use the function “by”, because it fails when the group_by clause is empty.
nilshg
October 7, 2019, 3:13pm
2
I’m not sure I understand - you want to group reduce but the whole DataFrame is in the same group? Could you do
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8, d = 1)
by(df, :d, a_sum = :a => sum, b_max = :b => maximum)
This doesn’t work:
julia> by(df, :d, a_sum = :a => sum, b_max = :b => maximum)
ERROR: ArgumentError: column name :d not found in the data frame; existing most similar names are: :a, :b and :c
Stacktrace:
nilshg
October 7, 2019, 3:17pm
4
Sorry I should have been clearer in my post - I added d = 1
to the call to the constructor of df
, so you’d have to copy/paste my whole example!
Ok, this works, but it is not so nice. Firstly I need to extend the ordinal DataFrame with a column, choose a column name where I am sure it does not yet exist, than execute “by” and finally remove the redundant column.
Might cost some time and memory if the original dataframe has a million rows.
Using now this code:
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = 1:8);
operator=(:a => sum, :b => maximum)
df_summary = DataFrame()
for pair in operator
df_summary[!, pair[1]] = [pair[2](df[!, pair[1]])]
end
Seams to work, didn’t benchmark it yet.
nilshg
October 7, 2019, 3:39pm
7
Yes something similar would have been my next suggestion - I wrote something quite similar albeit with slightly more complex column naming:
function dfreduce(df, f_pairs)
r = DataFrame(x = f_pairs[1][2](df[!, f_pairs[1][1]]))
names!(r, [Symbol(f_pairs[1][1], "_", f_pairs[1][2])])
if length(f_pairs) > 1
for e ∈ f_pairs[2:end]
r[!, Symbol(e[1], "_", e[2])] .= e[2](df[!, e[1]])
end
end
return r
end
function add_col_reduce(df, f_pairs)
df[!, :for_grouping] .= 1
by(df, :for_grouping, f_pairs)
end
using BenchmarkTools
@btime dfreduce(df, operator) # 10.499 μs (71 allocations: 5.06 KiB)
@btime add_col_reduce(df, operator) # 35.200 μs (183 allocations: 13.30 KiB)
Thank you!
I am nearly happy with my solution, but discussion with you helped!
Only readability should be improved.
Nevertheless I think the “by” function should handle the case of an empty group_by automatically. I might create an issue for that.