Changing the smoothness of density function

Torkel · August 18, 2021, 4:14pm

Hi, I wish to plot a density function (to compare to another density function I have plotted).
Unfortunatley, all values are integers, typically between 0 and 5. This gives a very jagged, and ugly looking plot. Is there some option where I can tune this, making it smoother?

density_problem
(Note, the axis labels are totally misgiving, not that it matter much, but still. x-axis should be molecule numbers and y-axis probably frequency)

jkbest2 · August 18, 2021, 5:27pm

I’m sure there’s a way to change the smoothness of the density estimate (but don’t know it off the top of my head sorry). But it seems like a histogram might be more appropriate here. You can still overlay your other density function for comparison.

hdavid16 · August 18, 2021, 5:40pm

How are you making that plot?

using StatsPlots

x=[0,0,1,1,1,1,2,2,2,2,2,3,3,3,4,5,6]
density(x)

Capture

George9000 · August 18, 2021, 6:52pm

If this is a discrete distribution, perhaps a PMF plot is more appropriate? See
this draft, p. 93 - 96, and this corresponding to the PMF plot on this page.

Torkel · August 19, 2021, 2:07pm

Yes, but there are about 50,000 values in the array.

Torkel · August 19, 2021, 2:07pm

That’s a really extensive guide, thanks for sharing! Will look at it.

hdavid16 · August 19, 2021, 2:22pm

You could use KernelDensity.jl:

using StatsPlots, KernelDensity

x=rand(1:6,50000)
density(x, label="using StatsPlots")

f=kde(x,0:7) #specify the points at which to evaluate the kernel density estimation
plot!(f.x,f.density, label="using KernelDensity")

Torkel · August 19, 2021, 2:39pm

Thanks, that did work, but only partially. If I use an integer grid, I get a curve very similar to what I would have gotten if I would have plotted the values 0->10, against the count of each value in my data. It works, but looks jagged and rough (which I would rather want to avoid in this case, since I make a comparison to a smooth curve, and don’t want people distracted by what in this case is an artificial difference). If I try changing the grid, however, I get the same phenomena as when using density:

f=kde(no_stress_wt_vals,0:1:10) 
p1 = plot(f.x,f.density)
f=kde(no_stress_wt_vals,0:0.25:10) 
p2 = plot(f.x,f.density)
plot(p1,p2,size=(1000,350))

hdavid16 · August 19, 2021, 2:41pm

Yeah…

It’s probably best to plot a histogram where the bins are the integer values.
Then you can plot other smooth curves on top of it.
I don’t think density functions make much sense for discrete distributions since you are fitting the data to a continuous distribution (kernel).

Torkel · August 19, 2021, 2:41pm

Got it, thanks for the help everyone

genkuroki · August 20, 2021, 8:47am

I agree with hdavid16-san.

Moreover, if you want to compare a sample of a discrete distribution with a continuous distribution, you can plot the ecdf of the sample and the cdf of the continuous distribution on top of each other.

The L^1-metric of cdf’s on \mathbb{R} coinsides with the Wasserstein metric for p=1.

Sample working code:

using Plots
using Distributions

"""empirical cumulative distribution function"""
ecdf(sample, x) = count(≤(x), sample)/length(sample)

λ = 3
dist_true = Poisson(λ)
n = 2^10
sample = rand(dist_true, n)
m, u² = mean(sample), var(sample)
gamma = Gamma(m^2/u², u²/m)
a, b = 0, 10

P = plot();
histogram!(sample; norm=true, alpha=0.3,
    bin=a-0.5:b+0.5, xtick=a:b,
    label="size-$n sample of Poisson($λ)");
plot!(x -> pdf(gamma, x), a, b;
    label="Gamma distribution approximation");

Q = plot(; legend=:bottomright, xtick=a:b);
plot!(x -> ecdf(sample, x), a-0.5, b+0.5;
    label="ecdf of size-$n sample of Poisson($λ)");
plot!(x -> cdf(gamma, x), a-0.5, b+0.5;
    label="cdf of Gamma distribution approximation");

plot(P, Q; size=(500, 500), layout=(2, 1))

mschauer · August 20, 2021, 11:45am

The key question is: Why are the numbers integers? Are they random, but rounded in some way? Or are they not random at all but set by the experimenter?

Torkel · August 20, 2021, 11:50am

Basically, I am comparing the presence of a protein in experiments vs simulations. The experiments use fluorescence as a proxy, which gives continuous data. By simulated data simulates the actual copy-numbers of how many proteins there are (discrete).

The units of the two categories don’t really correspond to each other, but I wanted to create an image to show that the shapes were similar. I have more details elsewhere, but in this case I figured if one case was obviously continuous and one discrete, that would be what people note first, while I want them to focus on the shapes.

What I ended up doing was to use Interpolations.jl to interpolate the data, and then plot it continuous (which I agree might seem like cheating a bit, but in this case I think it is OK).

mschauer · August 20, 2021, 5:37pm

No, my question is why the time is in n minutes where n is an integer and not a real number (as expected if you actually measure time of physical processes).

Torkel · August 20, 2021, 6:23pm

Ahh, I’m sorry, my misstake.

I was using default() in the script I ran (because most plots are time plots). This is the only exception, but seem I forgot to clean the axises. The x-axis should be number of molecules, and the y-axis fraction or something.

cjdoris · August 20, 2021, 6:34pm

One thing you can do if you have a distribution over the integers is to un-discretize your samples by adding uniform random noise in [0,1] to the location x of each sample.

This way, the KDE is actually estimating the histogram of your distribution, which is a bona-fide continuous PDF. That is, it is estimating the density with PDF f(x)=p(\lfloor x \rfloor) where p is the PMF of the discrete data distribution.

dlakelan · August 20, 2021, 7:05pm

what you want to do is specify the bandwidth for the density estimation in the density command.

x=rand(1:6,50000)
density(x,label="default")
density!(x,label="bandwidth=1",bandwidth=1)

Torkel · August 20, 2021, 8:59pm

Thanks, yes, something like that is what I was thinking about. Thanks

genkuroki · August 20, 2021, 9:48pm

hdavid16-san is right:

Please don’t apply the kernel density estimation to a sample of a discrete distribution with values in 1, 2, 3, 4, 5, and 6. The kernel density estimation is a method for estimating the density function of the continuous distribution from a sample.

For a sample with values 1, 2, 3, 4, 5, and 6, both its histogram with integer bins and its ecdf keep the true sample information, but the kernel density estimation does not.

using StatsPlots
x = rand(1:6, 10^5)
histogram(x; norm=true, alpha=0.3, bin=0.5:6.5, label="histogram of true sample")
density!(x; label="kde with bandwidth=1", bandwidth=1, lw=2)
plot!(; ylim=(-0.005, 0.22), xtick=-10:10)

dlakelan · August 20, 2021, 9:58pm

There are plenty of good reasons to want to treat discrete distributions as if they were continuous. Whether it’s appropriate or not depends entirely on the application. I use KDE plots for discrete distributions all the time, particularly when the discreteness is basically uninteresting to the application.

Topic		Replies	Views
Kernel Density Estimate boundary problems Statistics question	3	270	June 5, 2024
On-the-flight kernel density estimation? Statistics	15	1701	March 13, 2023
Plotting a line histogram New to Julia question , plotting	9	8120	January 9, 2021
Help to perform scatter density plot with Plots.jl Visualization plots	9	2676	October 17, 2023
Kernel Density Estimate for cdf General Usage question , statistics	11	1648	September 26, 2022

Changing the smoothness of density function

Related topics