Multi Way Summary

bernhard · June 12, 2020, 2:00pm

All I am looking for a simple way to perform a summary on multiple levels

Attached is a screenshot from SAS, which shows what I am trying to achieve (consider the first row as well as the subsequent 26 rows).
Is there any package in Julia that provides this?
By default combine only aggregates on the most granular level, see example.

lnk="https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv"
fi=download(lnk)

using DataFrames
using CSV

df=CSV.read(fi)
agg=combine(groupby(df,[:species,:petal_width]),:sepal_length=>sum)

nilshg · June 12, 2020, 2:54pm

Just to make sure I understand you correctly, your aim is to have

combine(groupby(df,[:species]),:sepal_length=>sum)
combine(groupby(df,[:petal_width]),:sepal_length=>sum)
combine(groupby(df,[:species,:petal_width]),:sepal_length=>sum)

all in one call, and output into one table? If so I’m unaware of a package that would provide this, but it should be relatively easy to roll your own?

bernhard · June 12, 2020, 2:58pm

Yes, this is what I want.
Indeed it is probably not too difficult, but I don’t want to reinvent the wheel.
I hope it is clear, that I plan to call this function with an arbitrary number of arguments which could be of different types.
I think I even may have a snippet somewhere that does this but I was hoping “proper code” exists for this somewhere online. (it has happened to me that I coded something that actually existed in some function I did not know…)

bernhard · June 12, 2020, 4:08pm

As per your suggestion I gave this a try.
I think this suffices for me (for now), although it certainly has potential for improvements.


using DataFrames
using CSV
using IterTools

function multiwayaggregation(df::DataFrame,v::Vector{Symbol},cs::Union{Pair, typeof(nrow), DataFrames.ColumnIndex, DataFrames.MultiColumnIndex}...)
    res=DataFrame()

    for c in v 
        @assert !(any(ismissing,df[!,c])) #otherwise the appending will not be meaningful, as we set the values to missing for columns which are not considered in the multi way summary
    end
    
    for subsetlength=length(v):-1:0
        for subs in IterTools.subsets(v,subsetlength)
            #@show subs
            if subsetlength==0 
                agg = DataFrames.combine(df,cs...)
            else 
                agg = DataFrames.combine(DataFrames.groupby(df,subs),cs...)
            end
            nonAggregatedVars=setdiff(v,subs)
            
            k=1
            DataFrames.insertcols!(agg,k,:_TYPE_ => repeat(vcat(subsetlength),size(agg,1)))
            k+=1
            for addcol in nonAggregatedVars 
                DataFrames.insertcols!(agg,k,addcol => repeat(vcat(missing),size(agg,1)))
                k+=1
            end 
            
            DataFrames.allowmissing!(agg)
            DataFrames.append!(res,agg)            
        end
    end
    
    sort!(res,vcat(:_TYPE_,v))
    return res 
end

lnk="https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv"
fi=download(lnk)

df=CSV.read(fi)

v=[:species,:petal_width]
rs=multiwayaggregation(df,v,:sepal_length=>sum)

Topic		Replies	Views
Dataframes: Split combined result to different columns General Usage dataframes	3	298	December 13, 2021
Easier way to split-apply-combine in DataFrames.jl General Usage dataframes	5	1101	December 14, 2020
Translation groupby and agg and join python to julia General Usage	4	932	April 8, 2021
How to group by multiple data types? General Usage dataframes	4	375	February 22, 2022
Help generating an Expression General Usage metaprogramming , dataframes	3	320	September 25, 2020

Multi Way Summary

Related topics