I am trying to draw a violin plot of forest structure (distribution of volumes by diameter class).
The problem is that violin plot try to interpolate “too much” and draws too many waves:
using DataFrames, Plots, StatPlots
forestVols = wsv"""
diamClass year vol
10 2000 320.62
20 2000 614.55
30 2000 586.75
40 2000 467.26
50 2000 318.31
60 2000 191.39
70 2000 97.13
10 2001 156.65
20 2001 594.46
30 2001 820.27
40 2001 788.55
50 2001 640.59
60 2001 464.36
70 2001 307.13
10 2002 156.65
20 2002 594.46
30 2002 820.27
40 2002 788.55
50 2002 640.59
60 2002 464.36
70 2002 307.13
"""
# Creating a frequency df..
forestVols_freq = DataFrame(
diamClass = [],
year = []
)
for r in eachrow(forestVols)
for i in range(1,Int(round(r[:vol])))
push!(forestVols_freq, [r[:diamClass] r[:year]])
end
end
forStructPlot = violin(forestVols_freq,:year,:diamClass, side=:both, marker=(0.2,:blue,stroke(0)))
I did try to divide the diamClass by 10, but the result is the same (and violin plot seems not to work with vector of strings for the y dimensions - they are ok for the x one).
I did check my forestVols_freq dataframe…
it seems to be linked with the size of the data… if the frequency is very small like in the example in the documentation, the waves are less pronounced (e.g. if I compute the frequency dataframe with for i in range(1,Int(round(r[:vol]/40)))… it may be ok for continuous data, but it’s a problem for categorical/integer data…
StatPlots.violin passes the density estimation to KernelDensity.kde with a default npoints = 200. Maybe raise the question on KernelDensity.jl if the density estimation is inappropriate?
KernelDensity.jl allows to set manually a bandwidth parameter manually if what it founds automatically is inappropriate. In this case I guess the OP would like to plot with a larger bandwidth. Maybe it could be interesting to be able to pass keyword arguments to KernelDensity.kde directly from the plot call.
On the other hand, there seems to be a conceptual issue as IMO kernel density methods don’t make a lot of sense for categorical data.
Passing keyword args to violin to adjust the values passed to kde should be an easy PR for an interested party, and adding extra keywords is appropriate for a series recipe placed in StatPlots.
I did some tests, but, aside that the output is really sensitive to the npoints value (and in a nonlinear manner), I coudn’t find a single value where the length of the horizontal segment is proportional to the frequency of the data…
My suggestion is: If you think StatPlots don’t cater sufficiently to capabilites of kde, open a PR with the functionality you’d like - I know you know that code well If you think there is a problem with kde open an issue on KernelDensity.jl.
Which is why kernel density estimation is not what you want for categorical variables: your distribution doesn’t have an underlying smooth density function. This is a conceptual problem which has nothing to do with KernelDensity and StatPlots. In the limit of infinite data, your density would be degenerate, with spikes corresponding to the possible values of the categorical variable.
For distributions of categorical variables, you are probably better off doing a (normalized) histogram, with one bin per value of the categorical variable.
I agree with that (and that’s why I wrote that it was a problem with categorised data)… the violin plot does however has a semantic also for categorised data… the ideal would be for the code to recognise the nature of the data in the y dimension and choose an appropriate algorithm for it: kernel density estimator if Float or something else (?) for Int/String/…
On groupapply from the same package, I have an explicit keyword (see here ) called axis_type to determine how to treat the data. If it’s not specified by the user, I’d recommend checking whether the column is a PooledArray or not to see what to do.
I’m not a violin plot expert, but if it is well defined for categorical data you could simply add this option to violin_coords.