Hi, I wish to plot a density function (to compare to another density function I have plotted).
Unfortunatley, all values are integers, typically between 0 and 5. This gives a very jagged, and ugly looking plot. Is there some option where I can tune this, making it smoother?
(Note, the axis labels are totally misgiving, not that it matter much, but still. x-axis should be molecule numbers and y-axis probably frequency)
I’m sure there’s a way to change the smoothness of the density estimate (but don’t know it off the top of my head sorry). But it seems like a histogram might be more appropriate here. You can still overlay your other density function for comparison.
If this is a discrete distribution, perhaps a PMF plot is more appropriate? See this draft, p. 93 - 96, and this corresponding to the PMF plot on this page.
using StatsPlots, KernelDensity
x=rand(1:6,50000)
density(x, label="using StatsPlots")
f=kde(x,0:7) #specify the points at which to evaluate the kernel density estimation
plot!(f.x,f.density, label="using KernelDensity")
Thanks, that did work, but only partially. If I use an integer grid, I get a curve very similar to what I would have gotten if I would have plotted the values 0->10, against the count of each value in my data. It works, but looks jagged and rough (which I would rather want to avoid in this case, since I make a comparison to a smooth curve, and don’t want people distracted by what in this case is an artificial difference). If I try changing the grid, however, I get the same phenomena as when using density:
It’s probably best to plot a histogram where the bins are the integer values.
Then you can plot other smooth curves on top of it.
I don’t think density functions make much sense for discrete distributions since you are fitting the data to a continuous distribution (kernel).
Moreover, if you want to compare a sample of a discrete distribution with a continuous distribution, you can plot the ecdf of the sample and the cdf of the continuous distribution on top of each other.
The key question is: Why are the numbers integers? Are they random, but rounded in some way? Or are they not random at all but set by the experimenter?
Basically, I am comparing the presence of a protein in experiments vs simulations. The experiments use fluorescence as a proxy, which gives continuous data. By simulated data simulates the actual copy-numbers of how many proteins there are (discrete).
The units of the two categories don’t really correspond to each other, but I wanted to create an image to show that the shapes were similar. I have more details elsewhere, but in this case I figured if one case was obviously continuous and one discrete, that would be what people note first, while I want them to focus on the shapes.
What I ended up doing was to use Interpolations.jl to interpolate the data, and then plot it continuous (which I agree might seem like cheating a bit, but in this case I think it is OK).
No, my question is why the time is in n minutes where n is an integer and not a real number (as expected if you actually measure time of physical processes).
I was using default() in the script I ran (because most plots are time plots). This is the only exception, but seem I forgot to clean the axises. The x-axis should be number of molecules, and the y-axis fraction or something.
One thing you can do if you have a distribution over the integers is to un-discretize your samples by adding uniform random noise in [0,1] to the location x of each sample.
This way, the KDE is actually estimating the histogram of your distribution, which is a bona-fide continuous PDF. That is, it is estimating the density with PDF f(x)=p(\lfloor x \rfloor) where p is the PMF of the discrete data distribution.
Please don’t apply the kernel density estimation to a sample of a discrete distribution with values in 1, 2, 3, 4, 5, and 6. The kernel density estimation is a method for estimating the density function of the continuous distribution from a sample.
For a sample with values 1, 2, 3, 4, 5, and 6, both its histogram with integer bins and its ecdf keep the true sample information, but the kernel density estimation does not.
using StatsPlots
x = rand(1:6, 10^5)
histogram(x; norm=true, alpha=0.3, bin=0.5:6.5, label="histogram of true sample")
density!(x; label="kde with bandwidth=1", bandwidth=1, lw=2)
plot!(; ylim=(-0.005, 0.22), xtick=-10:10)
There are plenty of good reasons to want to treat discrete distributions as if they were continuous. Whether it’s appropriate or not depends entirely on the application. I use KDE plots for discrete distributions all the time, particularly when the discreteness is basically uninteresting to the application.