How can I show the mean in a groupedboxplot?

Hi all! I have this kind of DataFrame containing data:

julia> df_filtered = filter(:mnemonic => m -> m ∈ selected_instructions, df_temp)
1990×4 DataFrame
  Row │ mnemonic  instruction     forwarding  value       
      │ String    String          Bool        Quantity…   
──────┼───────────────────────────────────────────────────
    1 │ add       add_r0_r2_#0         false  0.0829053 W
    2 │ add       add_r0_r2_#1         false  0.0830969 W
    3 │ add       add_r0_r2_#10        false  0.0831797 W
    4 │ add       add_r0_r2_#100       false   0.083226 W
    5 │ add       add_r0_r2_#101       false  0.0834463 W
    6 │ add       add_r0_r2_#102       false  0.0834131 W
    7 │ add       add_r0_r2_#103       false  0.0836367 W
    8 │ add       add_r0_r2_#104       false  0.0830944 W
    9 │ add       add_r0_r2_#105       false  0.0832901 W
   10 │ add       add_r0_r2_#106       false  0.0832738 W
   11 │ add       add_r0_r2_#107       false  0.0834709 W
   12 │ add       add_r0_r2_#108       false  0.0833354 W
   13 │ add       add_r0_r2_#109       false  0.0835491 W
   14 │ add       add_r0_r2_#11        false   0.083368 W
  ⋮   │    ⋮            ⋮             ⋮            ⋮
 1977 │ mvn       mvn_r6_r2            false  0.0863886 W
 1978 │ mvn       mvn_r6_r3            false  0.0866079 W
 1979 │ mvn       mvn_r6_r4            false  0.0864734 W
 1980 │ mvn       mvn_r6_r5            false  0.0866562 W
 1981 │ mvn       mvn_r6_r6             true   0.108087 W
 1982 │ mvn       mvn_r6_r7            false  0.0870755 W
 1983 │ mvn       mvn_r7_r0            false  0.0864812 W
 1984 │ mvn       mvn_r7_r1            false  0.0866842 W
 1985 │ mvn       mvn_r7_r2            false  0.0866603 W
 1986 │ mvn       mvn_r7_r3            false   0.086884 W
 1987 │ mvn       mvn_r7_r4            false  0.0867469 W
 1988 │ mvn       mvn_r7_r5            false  0.0869432 W
 1989 │ mvn       mvn_r7_r6            false  0.0869205 W
 1990 │ mvn       mvn_r7_r7             true   0.108796 W
                                         1962 rows omitted

I want to produce a boxplot comparing the effect of the activation of the column “forwarding”. For now the best I could get is using groupedboxplot(), like this:

p = @df df_filtered groupedboxplot(:mnemonic, :value;
    group = :forwarding,
    # xlabel = "Instructions",
    ylabel = "Measured power",
    label = ["not active" "active"],
    bar_width = 0.7,
    left_margin = 5mm,
    bottom_margin = 3mm,
    size = (1000, 500),
    tickfontsize = 14,
    guidefontsize = 16
)

producing

Now I’d like to show the mean for each group on the plot, like a scatter point. I see this can be done for single boxplots, however for grouped ones I don’t know how to do it.

The best I could reach is grouping the DF so to get the mean for each group:

julia> df_grouped = groupby(df_temp, [:mnemonic, :forwarding]; sort = true);


julia> df_means = combine(df_grouped, :value => mean)
90×3 DataFrame
 Row │ mnemonic  forwarding  value_mean  
     │ String    Bool        Quantity…   
─────┼───────────────────────────────────
   1 │ adc            false  0.0836535 W
   2 │ adc             true  0.0863453 W
   3 │ add            false  0.0839741 W
   4 │ add             true  0.0828457 W
   5 │ and            false  0.0834052 W
   6 │ and             true   0.082732 W
   7 │ asr            false  0.0861718 W
   8 │ asr             true  0.0865888 W
   9 │ b              false   0.170137 W
  10 │ bfc            false  0.0876756 W
  11 │ bfi            false  0.0854044 W
  12 │ bic            false  0.0830427 W
  13 │ bl             false   0.195068 W
  14 │ blx            false  0.0943027 W
  ⋮  │    ⋮          ⋮            ⋮
  77 │ teq            false  0.0844381 W
  78 │ tst            false  0.0842573 W
  79 │ ubfx           false  0.0855511 W
  80 │ ubfx            true  0.0843686 W
  81 │ udiv           false  0.0689451 W
  82 │ udiv            true   0.073053 W
  83 │ umlal          false  0.0848933 W
  84 │ umlal           true  0.0915934 W
  85 │ umull          false   0.083016 W
  86 │ umull           true  0.0831577 W
  87 │ uxtb           false  0.0744491 W
  88 │ uxtb            true  0.0745435 W
  89 │ uxth           false  0.0742926 W
  90 │ uxth            true  0.0743984 W
                          62 rows omitted

Is there a way I can do this? Thanks.

I don’t know of an automatic way, but if it helps, here’s a semi-automatic way.

CODE: add mean to groupedboxplot()
using StatsPlots, DataFrames

Random.seed!(123)
mnemonics = ["add","and","lsl","mvn","rsb"]
df = DataFrame(a=rand(Bool,200), b=rand(mnemonics,200), c=randn(200))

gdf = groupby(df, [:a,:b])
df_means = combine(gdf, :c => mean)

p = @df df groupedboxplot(:b, :c, group=:a, label=["not active" "active"])

dic1 = Dict(xticks(p)[1][2] .=> xticks(p)[1][1])
dic2 = Dict([false,true] .=> [-0.2, 0.2])
x = getindex.(Ref(dic1), df_means.b) .+ getindex.(Ref(dic2), df_means.a)

scatter!(x, df_means.c_mean, c=:black, label = "mean")

3 Likes

Thanks, this works! It seems quite laborious though (it’s far over my Julia knowledge), and it’s a pity that Plots.jl and the recipe ecosystem (like StatsPlots.jl) are IMO so undocumented.
Since in my code I also changed bar_width, I find a good solution to set

scatter_offset = bar_width / 4
dic2 = Dict([false,true] .=> [-scatter_offset, scatter_offset])

I don’t know if you find the following scheme simpler, which preorders the dataframes to make the positions of the scatter points correspond to the positions of the boxblocks in a “simpler” way

gdf = groupby(df, [:b,:a], sort=true)

df_means = combine(gdf, :c => mean)

p = @df sort(df,[:b,:a]) groupedboxplot(:b, :c, group=:a, label=["not active" "active"])

x=mapreduce(x->(x.+[-0.2,0.2]),vcat,xticks(p)[1][1])

scatter!(x, df_means.c_mean, c=:black, label = "mean")
2 Likes

That is a possibility, thanks! It is a bit easier to move around if I don’t need a specific order for the bins.