Violin plot interpolates too much

plotting

#1

I am trying to draw a violin plot of forest structure (distribution of volumes by diameter class).
The problem is that violin plot try to interpolate “too much” and draws too many waves:

using DataFrames, Plots, StatPlots

forestVols = wsv"""
diamClass	year	vol
10	2000	320.62
20	2000	614.55
30	2000	586.75
40	2000	467.26
50	2000	318.31
60	2000	191.39
70	2000	97.13
10	2001	156.65
20	2001	594.46
30	2001	820.27
40	2001	788.55
50	2001	640.59
60	2001	464.36
70	2001	307.13
10	2002	156.65
20	2002	594.46
30	2002	820.27
40	2002	788.55
50	2002	640.59
60	2002	464.36
70	2002	307.13
"""
# Creating a frequency df..
forestVols_freq = DataFrame(
  diamClass   = [],
  year        = []
)
for r in eachrow(forestVols)
  for i in range(1,Int(round(r[:vol])))
    push!(forestVols_freq, [r[:diamClass] r[:year]])
  end
end
forStructPlot = violin(forestVols_freq,:year,:diamClass, side=:both, marker=(0.2,:blue,stroke(0)))

I did try to divide the diamClass by 10, but the result is the same (and violin plot seems not to work with vector of strings for the y dimensions - they are ok for the x one).
I did check my forestVols_freq dataframe…


#2

it seems to be linked with the size of the data… if the frequency is very small like in the example in the documentation, the waves are less pronounced (e.g. if I compute the frequency dataframe with for i in range(1,Int(round(r[:vol]/40)))… it may be ok for continuous data, but it’s a problem for categorical/integer data…


#3

StatPlots.violin passes the density estimation to KernelDensity.kde with a default npoints = 200. Maybe raise the question on KernelDensity.jl if the density estimation is inappropriate?


#4

KernelDensity.jl allows to set manually a bandwidth parameter manually if what it founds automatically is inappropriate. In this case I guess the OP would like to plot with a larger bandwidth. Maybe it could be interesting to be able to pass keyword arguments to KernelDensity.kde directly from the plot call.

On the other hand, there seems to be a conceptual issue as IMO kernel density methods don’t make a lot of sense for categorical data.


#5

Passing keyword args to violin to adjust the values passed to kde should be an easy PR for an interested party, and adding extra keywords is appropriate for a series recipe placed in StatPlots.


#6

I did some tests, but, aside that the output is really sensitive to the npoints value (and in a nonlinear manner), I coudn’t find a single value where the length of the horizontal segment is proportional to the frequency of the data…


#7

My suggestion is: If you think StatPlots don’t cater sufficiently to capabilites of kde, open a PR with the functionality you’d like - I know you know that code well :slight_smile: If you think there is a problem with kde open an issue on KernelDensity.jl.


#8

Which is why kernel density estimation is not what you want for categorical variables: your distribution doesn’t have an underlying smooth density function. This is a conceptual problem which has nothing to do with KernelDensity and StatPlots. In the limit of infinite data, your density would be degenerate, with spikes corresponding to the possible values of the categorical variable.

For distributions of categorical variables, you are probably better off doing a (normalized) histogram, with one bin per value of the categorical variable.


#9

I agree with that (and that’s why I wrote that it was a problem with categorised data)… the violin plot does however has a semantic also for categorised data… the ideal would be for the code to recognise the nature of the data in the y dimension and choose an appropriate algorithm for it: kernel density estimator if Float or something else (?) for Int/String/…


#10

On groupapply from the same package, I have an explicit keyword (see here ) called axis_type to determine how to treat the data. If it’s not specified by the user, I’d recommend checking whether the column is a PooledArray or not to see what to do.

I’m not a violin plot expert, but if it is well defined for categorical data you could simply add this option to violin_coords.