Violin plot interpolates too much

sylvaticus · May 18, 2017, 7:57am

I am trying to draw a violin plot of forest structure (distribution of volumes by diameter class).
The problem is that violin plot try to interpolate “too much” and draws too many waves:

using DataFrames, Plots, StatPlots

forestVols = wsv"""
diamClass	year	vol
10	2000	320.62
20	2000	614.55
30	2000	586.75
40	2000	467.26
50	2000	318.31
60	2000	191.39
70	2000	97.13
10	2001	156.65
20	2001	594.46
30	2001	820.27
40	2001	788.55
50	2001	640.59
60	2001	464.36
70	2001	307.13
10	2002	156.65
20	2002	594.46
30	2002	820.27
40	2002	788.55
50	2002	640.59
60	2002	464.36
70	2002	307.13
"""
# Creating a frequency df..
forestVols_freq = DataFrame(
  diamClass   = [],
  year        = []
)
for r in eachrow(forestVols)
  for i in range(1,Int(round(r[:vol])))
    push!(forestVols_freq, [r[:diamClass] r[:year]])
  end
end
forStructPlot = violin(forestVols_freq,:year,:diamClass, side=:both, marker=(0.2,:blue,stroke(0)))

I did try to divide the diamClass by 10, but the result is the same (and violin plot seems not to work with vector of strings for the y dimensions - they are ok for the x one).
I did check my forestVols_freq dataframe…

sylvaticus · May 18, 2017, 8:40am

it seems to be linked with the size of the data… if the frequency is very small like in the example in the documentation, the waves are less pronounced (e.g. if I compute the frequency dataframe with for i in range(1,Int(round(r[:vol]/40)))… it may be ok for continuous data, but it’s a problem for categorical/integer data…

mkborregaard · May 18, 2017, 9:33am

StatPlots.violin passes the density estimation to KernelDensity.kde with a default npoints = 200. Maybe raise the question on KernelDensity.jl if the density estimation is inappropriate?

piever · May 18, 2017, 9:44am

KernelDensity.jl allows to set manually a bandwidth parameter manually if what it founds automatically is inappropriate. In this case I guess the OP would like to plot with a larger bandwidth. Maybe it could be interesting to be able to pass keyword arguments to KernelDensity.kde directly from the plot call.

On the other hand, there seems to be a conceptual issue as IMO kernel density methods don’t make a lot of sense for categorical data.

mkborregaard · May 18, 2017, 9:57am

Passing keyword args to violin to adjust the values passed to kde should be an easy PR for an interested party, and adding extra keywords is appropriate for a series recipe placed in StatPlots.

sylvaticus · May 18, 2017, 11:41am

I did some tests, but, aside that the output is really sensitive to the npoints value (and in a nonlinear manner), I coudn’t find a single value where the length of the horizontal segment is proportional to the frequency of the data…

mkborregaard · May 18, 2017, 11:52am

My suggestion is: If you think StatPlots don’t cater sufficiently to capabilites of kde, open a PR with the functionality you’d like - I know you know that code well If you think there is a problem with kde open an issue on KernelDensity.jl.

piever · May 18, 2017, 12:33pm

Which is why kernel density estimation is not what you want for categorical variables: your distribution doesn’t have an underlying smooth density function. This is a conceptual problem which has nothing to do with KernelDensity and StatPlots. In the limit of infinite data, your density would be degenerate, with spikes corresponding to the possible values of the categorical variable.

For distributions of categorical variables, you are probably better off doing a (normalized) histogram, with one bin per value of the categorical variable.

sylvaticus · May 18, 2017, 12:46pm

I agree with that (and that’s why I wrote that it was a problem with categorised data)… the violin plot does however has a semantic also for categorised data… the ideal would be for the code to recognise the nature of the data in the y dimension and choose an appropriate algorithm for it: kernel density estimator if Float or something else (?) for Int/String/…

piever · May 18, 2017, 1:03pm

On groupapply from the same package, I have an explicit keyword (see here ) called axis_type to determine how to treat the data. If it’s not specified by the user, I’d recommend checking whether the column is a PooledArray or not to see what to do.

I’m not a violin plot expert, but if it is well defined for categorical data you could simply add this option to violin_coords.

Topic		Replies	Views
Violin plot General Usage plots , statsplots	8	249	April 25, 2025
How to scale the density of violin plot General Usage question , plotting , statsplots	2	383	October 9, 2023
Side by side violin plots with VegaLite.jl Visualization	15	1589	May 5, 2021
Support for CategoricalValue in StatsPlots Visualization statsplots	3	908	November 23, 2020
Question about the width argument for Gadfly violin plots Visualization question , plotting , gadfly , visualization	5	424	October 2, 2022

Violin plot interpolates too much

Related topics