On-the-flight kernel density estimation?

I have the following type of problem: imagine a Monte-Carlo simulation generating some data, and I want to estimate the density function for that simulation. I cannot store the data for each Monte-Carlos step, which would be prohibitive, but I want to do something better than simply constructing a histogram.

To provide a concrete example:

  1. In theory a nice way to obtain the density would be:
julia> using Plots, KernelDensity

julia> data = randn(10^4);

julia> k = kde(data);

julia> plot(k.x, k.density)

which gives me the pretty density:

image

  1. However, I cannot really store the data. So my approach is to build an histogram on the flight:
julia> function mc_histogram(nsteps)
           hist = zeros(100)
           limits = (-5,5)
           domain = range(limits[1], limits[2]; length=100)
           n = 0
           for _ in 1:nsteps
               x = randn()
               if limits[1] < x < limits[2]
                   n += 1
                   bin = floor(Int, 100*(x - limits[1]) / (limits[2]-limits[1])) + 1
                   hist[bin] += 1
               end
           end
           return domain, hist ./ n
       end
mc_histogram (generic function with 1 method)

julia> x, h = mc_histogram(10^4);

julia> plot(x,h)

Histograms, though, are ugly:

image

Thus, what I’m searching for is a way to, without storing the data, build a kernel density estimate that will be a better estimate of the density than the histogram.

One alternative I was told was the use of average shifted histograms (which this package implements). That strategy seems adequate, though I’m unsure how more justified it is relative to just computing a running average of other smoothing function on the final discrete histogram.

I’m quite unfamiliar with the world of these kernel estimators, so I’m looking for alternatives. Any ideas, suggestions? Are the average shifted histograms the best alternative out there?

The alternative with AverageShiftedHistograms gives:

julia> function mc_ash(nsteps)
           a = ash(randn(); rng=range(-5, 5, length=100))
           for _ in 1:nsteps
               x = randn()
               ash!(a, x)
           end
           return a
       end
mc_ash (generic function with 1 method)

julia> a = mc_ash(10^4);

julia> plot(a.density)

image

Which doesn’t look much smoother than the histogram (there are parameters to adjust, though), but at least allows the incremental improvement of the estimates.

1 Like

OnlineStats.jl has the ASH method plus a few others, I would try those out.

https://joshday.github.io/OnlineStats.jl/latest/

If that fails, you can use a histogram + kernel smoothing. There are methods like kde that operate on histograms rather than the underlying data. I do not know if they have been implemented in Julia. Note that for 1D density estimation, this is probably pretty close to ASH anyway.

3 Likes

I spent a day searching for any of those, and could not find a single one :frowning: . Maybe because of lack familiarity with the terminology. That’s what I would prefer, such that it will be just plugged as a final step in my analysis.

I think R package kernsmooth does it that way. I had coded up some related functionality in Julia, but not with binning (GitHub - tbeason/NonparametricRegression.jl: Simple local constant and local linear regressions in Julia). Maybe you can modify what I started.

2 Likes

LOESS (I got it from the final references of your package) seem to do what I want (at least I get a smoothed distribution as expected, from the histogram). Does it have any obvious pitfalls?

You can implement this yourself pretty easily if you have an idea of the range of possible values that the variable of interest can take. Here’s an algorithm sketch:

  1. Define the range of x-values you want to plot:
    • xs = range(start=lower_limit, stop=upper_limit, length=N)
    • N is the number of x-axis values at which you want to plot, e.g. 100.
  2. Initialize your density estimate with zeros:
    • d = zeros(N)
  3. For each new data point x_i, do the following:
    a) Calculate the kernel value K(x - x_i) for each x in xs.
    b) Add the values calculated in part a) to the vector d.
  4. After all the data points have been visited, divide d by n (the number of data points):
    • d = d/n
  5. Make your plot:
    • plot(xs, d)
2 Likes

Why do you evaluate the density at all data points? Pick a grid (say, 101 values), evaluate and store. I am pretty sure that you wouldn’t be able to tell the difference. Then, afterwards interpolate if necessary.

What exactly would that mean? What is the “kernel value”. I’m not sure I that is clear from my initial post, I do not know the shape of the distribution (the randn() was just an example to make things simple).

I tried to simplify here. But the data is not that of a known distribution, it is, lets say, experimental data. I have to follow the simulation and compute the property for which I need a PDF from that simulation.

For example, this is the shape of one of such distributions:

image

(it is sort of a radial distribution function between particles in a simulation). I want to improve the estimate at each point, which is now simply a normalized histogram - since the density is rather smooth, the noise there seems to be avoidable with a better estimator than a simple histogram. I would like something that makes statistical sense, not a simple running average.

One alternative is to apply some kernel smoothing function, as suggested above, to the final histogram. But I do have access to the raw data, except that it cannot be stored. So I could eventually use something that uses that.

Not sure if you were responding to my post. I edited to clarify that the N in the range call is the number of grid points on the x-axis at which you wish to plot. So, yes, 101 would be a typical value for N.

A KDE is actually quite simple. The kernel is most commonly a Gaussian distribution. So the KDE is just the average of a bunch of Gaussian distributions with each distribution centered at one of the data points.

The tricky part is that you need to choose the width of the kernel in advance.

2 Likes

In your previous Googling, you might have missed the keyword “streaming”. Here are some articles that I just found about KDE on streaming data:

http://alumni.cs.ucr.edu/~wli/publications/deosd.pdf

https://www.researchgate.net/publication/221324940_Towards_Kernel_Density_Estimation_over_Streaming_Data

(I haven’t read them.)

But the real question is what do you want to do with your PDF? For example, if you want to compare to a theoretical distribution, then you might want a “streaming Kolmogorov-Smirnov test”:

2 Likes

No, I was responding to the original post. (Yes, I agree with the N.)

1 Like

Thanks for all the references, I’ll study them.

Concerning what I want to do with the PDF: this is data we get from simulations, which tell us about the nature of molecular interactions in a complex molecular system. I’ll use it to interpret the physics of the system, that’s it. There is no comparison to anything expected in theory, I just want to obtain the best estimate possible of the distribution given the data samples we can generate and save.

Oh yes, that’s what I do. I build that histogram on the flight, from hundreds-of-millions of data points that are generated. The histogram itself is an array of a few hundred positions.

The histogram itself is an estimate of the PDF function’s value at the midpoint of each bin.

You can use generic function approximation methods to smooth this. I’d take a look at ApproxFun.jl

The biggest issue is that an approximating function won’t necessarily have the constraint that it remains nonnegative everywhere. But I’d suggest something like calculating a least-squares solution on either the density, or the log density.

You can use ApproxFun to create a chebyshev polynomial representation of the PDF with say 10-20 coefficients, and then use something like LsqFit.jl to find the coefficients that minimize the sum of squares across the midpoints of the bins. Note that you’ll want to map the radius interval say 0-10 into the range -1,1 for chebyshev.

If that doesn’t work well (mainly if that results in polynomials that have negative values for density at various places) then you could try lsq fit to the logarithm of the density at bin midpoints, but beware of bins with zero counts. another option would be to lsqfit a chebyshev to the CUMULATIVE distribution (ie. sum the bins)

If you fit to the cumulative distribution you can then differentiate that to get an estimate of the density.

2 Likes

It sounds like this bit might need clarifying:

“Average Shifted Histogram” is a misnomer. While the method was originally derived by averaging histograms using slightly displaced bin locations, it’s mathematically equivalent to building one finer-partition histogram and throwing KDE at it (with, if I recall correctly, a triangular kernel function).

AverageShiftedHistograms.jl implements the generalized variety of this where you can use any kernel function.


I like ASH because 1) you can build it on-line, and 2), it’s much faster than KDE. My general rule is to use “too many” bins and then increase the smoothing parameter m as desired.

AverageShiftedHistograms.jl currently has a super simple heuristic for picking the default m that could certainly be improved. I just never bothered to improve it since it’s pretty subjective anyway.

The ASH functionality in OnlineStats.jl is essentially the same as AverageShiftedHistograms.jl, but gives you access to other underlying histogram types. I typically use ExpandingHist which automatically expands the bin range to ensure every data point is captured. The downside is of course one bad data point can muck things up.

4 Likes

Thanks @joshday for the clarifications. I think I’ll be using that, indeed.