All I am looking for a simple way to perform a summary on multiple levels
Attached is a screenshot from SAS, which shows what I am trying to achieve (consider the first row as well as the subsequent 26 rows).
Is there any package in Julia that provides this?
By default combine only aggregates on the most granular level, see example.
lnk="https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv"
fi=download(lnk)
using DataFrames
using CSV
df=CSV.read(fi)
agg=combine(groupby(df,[:species,:petal_width]),:sepal_length=>sum)
Yes, this is what I want.
Indeed it is probably not too difficult, but I don’t want to reinvent the wheel.
I hope it is clear, that I plan to call this function with an arbitrary number of arguments which could be of different types.
I think I even may have a snippet somewhere that does this but I was hoping “proper code” exists for this somewhere online. (it has happened to me that I coded something that actually existed in some function I did not know…)
As per your suggestion I gave this a try.
I think this suffices for me (for now), although it certainly has potential for improvements.
using DataFrames
using CSV
using IterTools
function multiwayaggregation(df::DataFrame,v::Vector{Symbol},cs::Union{Pair, typeof(nrow), DataFrames.ColumnIndex, DataFrames.MultiColumnIndex}...)
res=DataFrame()
for c in v
@assert !(any(ismissing,df[!,c])) #otherwise the appending will not be meaningful, as we set the values to missing for columns which are not considered in the multi way summary
end
for subsetlength=length(v):-1:0
for subs in IterTools.subsets(v,subsetlength)
#@show subs
if subsetlength==0
agg = DataFrames.combine(df,cs...)
else
agg = DataFrames.combine(DataFrames.groupby(df,subs),cs...)
end
nonAggregatedVars=setdiff(v,subs)
k=1
DataFrames.insertcols!(agg,k,:_TYPE_ => repeat(vcat(subsetlength),size(agg,1)))
k+=1
for addcol in nonAggregatedVars
DataFrames.insertcols!(agg,k,addcol => repeat(vcat(missing),size(agg,1)))
k+=1
end
DataFrames.allowmissing!(agg)
DataFrames.append!(res,agg)
end
end
sort!(res,vcat(:_TYPE_,v))
return res
end
lnk="https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv"
fi=download(lnk)
df=CSV.read(fi)
v=[:species,:petal_width]
rs=multiwayaggregation(df,v,:sepal_length=>sum)