Summarize all or selected columns in grouped DataFrame

Dear community,

how can I summarize all columns of a GroupedDataFrame with DataFramesMeta?

Like this in DataFrames:

df = DataFrame(rand(20,5), :auto)
df[!, :gr] = repeat(1:4; inner=5)
dsum=combine(groupby(df, :gr), names(df) .=> (x->sum(log.(x))) .=> names(df).*"_logsum" )

Additionally (both for DataFrames and Meta), how could I select columns by type for a summary (e.g. if char would not work woth log und sum).
Basically I am looking for equivalents of TidyR’s summarize_if and summarize_at .

New to Julia, I am happy for any hints beyond the standard documentation.

Thanks!

You can do name(df, Number) to get all numeric columns.

julia> df = DataFrame(a = [1], b = ["a string"]);

julia> combine(df, names(df, Number) => (t -> sum(log.(t))))
1Γ—1 DataFrame
 Row β”‚ a_function 
     β”‚ Float64    
─────┼────────────
   1 β”‚        0.0

You can also pass custom functions to describe

julia> df = DataFrame(a = [1], b = ["a string"])
1Γ—2 DataFrame
 Row β”‚ a      b        
     β”‚ Int64  String   
─────┼─────────────────
   1 β”‚     1  a string

julia> describe(df, (t -> sum(log.(t))) => :sum_log)
2Γ—2 DataFrame
 Row β”‚ variable  sum_log 
     β”‚ Symbol    Union…  
─────┼───────────────────
   1 β”‚ a         0.0
   2 β”‚ b                 
1 Like

DataFramesMeta.jl was designed with intention of explicit passing of column names. If you want, you can pass arbitrarily complex expression to it also escaping it with $ (though it is useful only in complex cases - here it is just a noise and probably using combine is better):

@combine(groupby(df, :gr), $(names(df) .=> (x->sum(log.(x))) .=> x -> x * "_logsum" ))

(note that it also shows you that you can pass a function that renames your input columns).

As @pdeffebach commented column selection having a specified type is achieved using the names function.

Thanks a lot! - I learned at least 5 helpful things, also the hint with the function to rename was very useful.

I should have looked at the names documentation more closely. The possibilities with using eachcol seem powerful. More than R?

In add an option to pass multiple column selectors to names by bkamins Β· Pull Request #3224 Β· JuliaData/DataFrames.jl Β· GitHub we are discussing how to improve this functionality in the future.

Thanks! - I do like your suggestion: add an option to pass multiple column selectors to names by bkamins Β· Pull Request #3224 Β· JuliaData/DataFrames.jl Β· GitHub

You could also have a look at GitHub - jkrumbiegel/DataFrameMacros.jl: Macros that simplify working with DataFrames.jl which can run expressions over column sets, in your case that would be @combine(gdf, "{}_logsum" = sum(log.({All()}))) I think.

2 Likes

It seems df-meta and df-macro are exclusive, i.e. one cannot use them together?
Nice of df-meta IMO is that is it not always rowwise, allowing something like x.-mean(x),
while what you describe and the @group_by(iseven(:b)) is nice in df-macro.

If they are currently exclusive, is there a chance to merge them?

You can switch to by column as well in DataFrameMacros, it’s kind of the other way around because I found in my own use, transform style commands use row wise more often.

You cannot merge them really as they make some fundamentally different choices in their implementation, even though the surface syntax is similar (as both just use the DataFrames verbs as macros). I think one of the biggest differences is that everything in DataFrameMacros is automatically broadcast which gives you these multi column operations.

2 Likes