Hi there,
It seems like violin plots truncate the support of the data. My understanding is that violin plots are kernel density estimates of the data but I’m struggling to find more information about the actual details.
Thanks in advance!
Miguel.
Hi there,
It seems like violin plots truncate the support of the data. My understanding is that violin plots are kernel density estimates of the data but I’m struggling to find more information about the actual details.
Thanks in advance!
Miguel.
Looking at the code, I think that by default the violin plots in StatsPlots.jl don’t go beyond the minimum and maximum values in the dataset. And that to change the default violin()
plot we should set the keyword argument trim=false
.
Thanks for the answer! I will look closer into the code. The problem in my case is that its truncating the lower support above the minimum. It might be an issue with outliers.
In a quick test comparison with boxplot I could not see any problem as the (default) violin plot extended up to the outliers.
Yep, it might only be failing in my case where the data is kind of not too well behaved since I have some very large outliers. I can not provide a MWE bu just to illustrate:
df |>
x -> @df x boxplot(:time_period, :max_down_speed, legend = false, title = "CAF II model", showaxis = :y, outliers = false)
df |>
x -> @df x violin!(:time_period, :max_down_speed)
Gives:
The 4 violin plots on the right not overlapping much with the boxplots does seem strange. Check the ECDFs to see if either the violin or box plots are off.
I know for a fact that my data contains zeros for every category and this is clearly not reflected in the violin plots which I think are off.
The violin plots could be very thin instead of truncated if the point density just sharply drops off there, see how the middle 4 have these smaller diamonds above the main violin. However, that seems to be contradicted by the boxplot’s quartiles, so I suggest checking the ECDFs to eyeball the density and quartiles. ECDFs don’t look good, but a simple and transparent line plot of sort(x), (1:length(x)) ./ length(x)
dodges the options, cutoffs, and possible implementation issues of the boxplot or violinplot. You should be able to tell which is wrong, if either.
To complement the answer above, you could also try with Makie to compare?
using GLMakie
function plot_stats_line!(ax, data, xpos)
xs = fill(xpos, length(data))
violin!(ax, xs, data)
boxplot!(ax, xs, data; color=Makie.wong_colors(0.7)[2])
end
function plot_stats(data) # plot the distribution of vector `data`
f = Figure()
ax = Axis(f[1,1])
plot_stats_line!(ax, data, 1) # call this with different values of xpos to put multiple lines on the same plot
f
end
If you take something with large outliers like
data = randn(2000)
append!(data, 30*rand() for _ in 1:10);
then calling plot_stats(data)
will show:
where it may look like the violin truncates the support, but if you zoom in you will see that each outlier is actually covered:
Of course if there is a bug in StatsPlot it should be reported there regardless.