StatsPlots & DataFrames: grouping by multiple columns

I want to create a box plot of data in a DataFrame grouped by more than one variable (column value). Is there a compact high-level way to do this? The following does not work:

‘’'using DataFrames, StatsPlots

dfa = DataFrame(a = [0, 1, 0, 1, 0, 1, 0, 1],
b = [1, 1, 1, 1, 2, 2, 2, 2],
dat = randn(8))

@df dfa boxplot(:dat, group=([:a, :b]))‘’’

Not sure if this is what you need, but just in case:

using DataFrames, StatsPlots

dfa = DataFrame(a = [0, 1, 0, 1, 0, 1, 0, 1],
b = [1, 1, 1, 1, 2, 2, 2, 2],
dat = randn(8))

gdf = groupby(dfa, [:a, :b])
nt = NamedTuple.(keys(gdf))
plot(legend=:outertopright)
i = 1
for (k,v) in pairs(gdf)
    @df v boxplot!(:dat, label="$(nt[i])")
    i += 1
end
Plots.current()

2 Likes

AFAIC @df dfa boxplot(:dat, group=(:a, :b)) should work and it’s a big if it doesn’t

It does not work properly - although there are 4 categories (which is accurately reflected in the legend), it generates 8 bars (one for each data point):

Someone kindly gave an answer on GitHub - the first argument needs to be the labels of the groups:

@df df boxplot(string.(tuple.(:a, :b)), :dat, group=(:a, :b))

1 Like