Changing the smoothness of density function

Hi, I wish to plot a density function (to compare to another density function I have plotted).
Unfortunatley, all values are integers, typically between 0 and 5. This gives a very jagged, and ugly looking plot. Is there some option where I can tune this, making it smoother?

density_problem
(Note, the axis labels are totally misgiving, not that it matter much, but still. x-axis should be molecule numbers and y-axis probably frequency)

I’m sure there’s a way to change the smoothness of the density estimate (but don’t know it off the top of my head sorry). But it seems like a histogram might be more appropriate here. You can still overlay your other density function for comparison.

2 Likes

How are you making that plot?

using StatsPlots

x=[0,0,1,1,1,1,2,2,2,2,2,3,3,3,4,5,6]
density(x)

Capture

2 Likes

If this is a discrete distribution, perhaps a PMF plot is more appropriate? See
this draft, p. 93 - 96, and this corresponding to the PMF plot on this page.

2 Likes

Yes, but there are about 50,000 values in the array.

That’s a really extensive guide, thanks for sharing! Will look at it.

You could use KernelDensity.jl:

using StatsPlots, KernelDensity

x=rand(1:6,50000)
density(x, label="using StatsPlots")

f=kde(x,0:7) #specify the points at which to evaluate the kernel density estimation
plot!(f.x,f.density, label="using KernelDensity")

1 Like

Thanks, that did work, but only partially. If I use an integer grid, I get a curve very similar to what I would have gotten if I would have plotted the values 0->10, against the count of each value in my data. It works, but looks jagged and rough (which I would rather want to avoid in this case, since I make a comparison to a smooth curve, and don’t want people distracted by what in this case is an artificial difference). If I try changing the grid, however, I get the same phenomena as when using density:

f=kde(no_stress_wt_vals,0:1:10) 
p1 = plot(f.x,f.density)
f=kde(no_stress_wt_vals,0:0.25:10) 
p2 = plot(f.x,f.density)
plot(p1,p2,size=(1000,350))

Yeah…

It’s probably best to plot a histogram where the bins are the integer values.
Then you can plot other smooth curves on top of it.
I don’t think density functions make much sense for discrete distributions since you are fitting the data to a continuous distribution (kernel).

2 Likes

Got it, thanks for the help everyone :slight_smile:

1 Like

I agree with hdavid16-san.

Moreover, if you want to compare a sample of a discrete distribution with a continuous distribution, you can plot the ecdf of the sample and the cdf of the continuous distribution on top of each other.

The L^1-metric of cdf’s on \mathbb{R} coinsides with the Wasserstein metric for p=1.

Sample working code:

using Plots
using Distributions

"""empirical cumulative distribution function"""
ecdf(sample, x) = count(≤(x), sample)/length(sample)

λ = 3
dist_true = Poisson(λ)
n = 2^10
sample = rand(dist_true, n)
m, u² = mean(sample), var(sample)
gamma = Gamma(m^2/u², u²/m)
a, b = 0, 10

P = plot();
histogram!(sample; norm=true, alpha=0.3,
    bin=a-0.5:b+0.5, xtick=a:b,
    label="size-$n sample of Poisson($λ)");
plot!(x -> pdf(gamma, x), a, b;
    label="Gamma distribution approximation");

Q = plot(; legend=:bottomright, xtick=a:b);
plot!(x -> ecdf(sample, x), a-0.5, b+0.5;
    label="ecdf of size-$n sample of Poisson($λ)");
plot!(x -> cdf(gamma, x), a-0.5, b+0.5;
    label="cdf of Gamma distribution approximation");

plot(P, Q; size=(500, 500), layout=(2, 1))

3 Likes

The key question is: Why are the numbers integers? Are they random, but rounded in some way? Or are they not random at all but set by the experimenter?

1 Like

Basically, I am comparing the presence of a protein in experiments vs simulations. The experiments use fluorescence as a proxy, which gives continuous data. By simulated data simulates the actual copy-numbers of how many proteins there are (discrete).

The units of the two categories don’t really correspond to each other, but I wanted to create an image to show that the shapes were similar. I have more details elsewhere, but in this case I figured if one case was obviously continuous and one discrete, that would be what people note first, while I want them to focus on the shapes.

What I ended up doing was to use Interpolations.jl to interpolate the data, and then plot it continuous (which I agree might seem like cheating a bit, but in this case I think it is OK).

No, my question is why the time is in n minutes where n is an integer and not a real number (as expected if you actually measure time of physical processes).

Ahh, I’m sorry, my misstake.

I was using default() in the script I ran (because most plots are time plots). This is the only exception, but seem I forgot to clean the axises. The x-axis should be number of molecules, and the y-axis fraction or something.

One thing you can do if you have a distribution over the integers is to un-discretize your samples by adding uniform random noise in [0,1] to the location x of each sample.

This way, the KDE is actually estimating the histogram of your distribution, which is a bona-fide continuous PDF. That is, it is estimating the density with PDF f(x)=p(\lfloor x \rfloor) where p is the PMF of the discrete data distribution.

what you want to do is specify the bandwidth for the density estimation in the density command.

x=rand(1:6,50000)
density(x,label="default")
density!(x,label="bandwidth=1",bandwidth=1)

image

2 Likes

Thanks, yes, something like that is what I was thinking about. Thanks :slight_smile:

hdavid16-san is right:

Please don’t apply the kernel density estimation to a sample of a discrete distribution with values in 1, 2, 3, 4, 5, and 6. The kernel density estimation is a method for estimating the density function of the continuous distribution from a sample.

For a sample with values 1, 2, 3, 4, 5, and 6, both its histogram with integer bins and its ecdf keep the true sample information, but the kernel density estimation does not.

using StatsPlots
x = rand(1:6, 10^5)
histogram(x; norm=true, alpha=0.3, bin=0.5:6.5, label="histogram of true sample")
density!(x; label="kde with bandwidth=1", bandwidth=1, lw=2)
plot!(; ylim=(-0.005, 0.22), xtick=-10:10)

1 Like

There are plenty of good reasons to want to treat discrete distributions as if they were continuous. Whether it’s appropriate or not depends entirely on the application. I use KDE plots for discrete distributions all the time, particularly when the discreteness is basically uninteresting to the application.