Groupreduce of DataFrame with empty group

ufechner7 · October 7, 2019, 2:54pm

I have a DataFrame df:

df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

I want to apply the following operation:

operator=(:a => sum, :b => maximum)

to get the following result:

df_summary = DataFrame(a=sum(df.a), b=maximum(df.b))

My question: How can I apply the operator in a generic way to get the correct result? I cannot use the function “by”, because it fails when the group_by clause is empty.

nilshg · October 7, 2019, 3:13pm

I’m not sure I understand - you want to group reduce but the whole DataFrame is in the same group? Could you do

df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8, d = 1)

by(df, :d, a_sum = :a => sum, b_max = :b => maximum)

ufechner7 · October 7, 2019, 3:16pm

This doesn’t work:

julia> by(df, :d, a_sum = :a => sum, b_max = :b => maximum)
ERROR: ArgumentError: column name :d not found in the data frame; existing most similar names are: :a, :b and :c
Stacktrace:

nilshg · October 7, 2019, 3:17pm

Sorry I should have been clearer in my post - I added d = 1 to the call to the constructor of df, so you’d have to copy/paste my whole example!

ufechner7 · October 7, 2019, 3:21pm

Ok, this works, but it is not so nice. Firstly I need to extend the ordinal DataFrame with a column, choose a column name where I am sure it does not yet exist, than execute “by” and finally remove the redundant column.

Might cost some time and memory if the original dataframe has a million rows.

ufechner7 · October 7, 2019, 3:34pm

Using now this code:

df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                      b = repeat([2, 1], outer=[4]),
                      c = 1:8);

operator=(:a => sum, :b => maximum)

df_summary = DataFrame()
for pair in operator
    df_summary[!, pair[1]] = [pair[2](df[!, pair[1]])]
end

Seams to work, didn’t benchmark it yet.

nilshg · October 7, 2019, 3:39pm

Yes something similar would have been my next suggestion - I wrote something quite similar albeit with slightly more complex column naming:

function dfreduce(df, f_pairs)
    r = DataFrame(x = f_pairs[1][2](df[!, f_pairs[1][1]]))
    names!(r, [Symbol(f_pairs[1][1], "_", f_pairs[1][2])])
    if length(f_pairs) > 1
        for e ∈ f_pairs[2:end]
            r[!, Symbol(e[1], "_", e[2])] .= e[2](df[!, e[1]])
        end
    end
    return r
end

function add_col_reduce(df, f_pairs)
    df[!, :for_grouping] .= 1
    by(df, :for_grouping, f_pairs)
end

using BenchmarkTools
@btime dfreduce(df, operator) # 10.499 μs (71 allocations: 5.06 KiB)
@btime add_col_reduce(df, operator) #   35.200 μs (183 allocations: 13.30 KiB)

ufechner7 · October 7, 2019, 3:52pm

Thank you!
I am nearly happy with my solution, but discussion with you helped!
Only readability should be improved.

Nevertheless I think the “by” function should handle the case of an empty group_by automatically. I might create an issue for that.

Topic		Replies	Views
Recommended equivalent to map / broadcast on GroupedDataFrame Data	4	517	April 5, 2024
DataFrame groupby-aggregate strategy General Usage question	7	5910	June 27, 2020
DataFrame `by` function error New to Julia	8	1258	June 1, 2018
Group DataFrames by a function of a column Data package	4	1206	December 11, 2019
DataFrames.jl: keep group by result in every row General Usage dataframes	4	1685	February 9, 2020

Groupreduce of DataFrame with empty group

Related topics