Show significance on boxplots

I want to annotate the statistical significance of the difference between a pair of boxplots on my figure, as one of 'n.s.', '*', '**', '***'. An example image of what I want is below, the best analogue I can find of what I want is ggsignif in R: https://github.com/const-ae/ggsignif.

ztl3U

Does a functionality like this exist in StatsPlots.boxplot?

If no functionality exists, I can easily get one of 'n.s.', '*', '**', '***' using the p-value for a given hypothesis test in HypothesisTests. However I’m struggling with annotating the plot correctly - I could use annotate!(x, y, text("*", :centre, 8)), but I’m not sure how I’d know which x, y to pick to place the text correctly above the box? Anyone have any suggestions?

1 Like

Wouldn’t a color with legend be more intuitive for people who never came across this “symbol jargon”? I use boxplots for a while and never came across *** ns, etc on papers.

1 Like

Yeah colour’s a really nice idea, I’ll look at that. Unfortunately this notation is the standard in my field (developmental biology), so I can’t really dismiss it.

You can annotate StatsPlots boxplot as follows:

using StatsPlots, DataFrames
theme(:ggplot2)

# INPUT DATA:
X = ["setosa", "versicolor","virginica"]
Y = [rand(-3:3, 10) for _ in 1:length(X)]
X2 = [fill(x,length(y)) for (x,y) in zip(X,Y)]
df = DataFrame(X = X2, Y = Y)

# PLOT DATA:
p = @df df boxplot(:X, :Y, c=:black, fillcolor=:white, legend=false)

ymin, ymax = ylims(p)
dy = (ymax - ymin)/25
ymax += dy
xt = xticks(p[1])[1]
plot!(xt[2:3], [ymax,ymax], c=:black, ylims = (ymin, ymax + dy))
annotate!(mean(xt[2:3]), ymax + dy/2, text("***",  10))

the result is:
StatsPlot_Boxplot_annotate_pair

2 Likes

Thanks, and how could I get a second bar like this one in red?

ztl3U

Your solution’s great for annotating a comparison with the highest box, but I’d like to be able to annotate these bars at a consistent height above each box. Can I access the y values for the whiskers in each box?

Would something like this be to your liking?
StatsPlot_Boxplot_annotate_pair

Not really, if boxes virginica and setosa are adjacent, I’d like the line the same height above the tallest whisker of the two, as the height above the whisker for versicolor.

Basically, for any pair of boxes b1, b2, I want the line to be at height max{whiskers(b1), whiskers(b2)} + dy, for some dy constant across the plot.

Take my last plot and annotate it by hand, please.

15690531662b26509cb26020ea07ab51dab0d6f1

Here I’m assuming that setosa has an upper whisker >= virginica. In either case, the line is always dy units above the tallest whisker.

OK, now it is clear. So you do not care about the outlier points displayed beyond the whiskers on the boxplot.

1 Like

I don’t, no

Annotating as you suggest can lead to situations like the one shown below:
StatsPlot_Boxplot_annotate_pair

If there is a web resource showing more examples it would be helpful.

This recent FEX contribution (Matlab) seems to cover many of those cases

In cases like this I’d re-order the x axis to prevent these things occurring. How did you alter your code to achieve this?

Joaquim, that is the Lamborghini of boxplots. Way beyond my skills and time.

The buit-in logic seems to annotate mostly at the top, never going across the data, which doesn’t seem to be sorted:

I guess what I’m really asking for (to save you time) is if there’s a way to get the y values of the whiskers for each plot (excluding outliers)?

The answer is yes:

p = @df df boxplot(:X, :Y, c=:black, fillcolor=:white, legend=false)
xt = xticks(p[1])[1]
yminmax = [extrema(filter(!isnan, p[1][3(i-1)+1][:y])) for i in axes(xt,1)]
1 Like

Wonderful, thank you.