Kernel density estimation status

juliohm · September 17, 2017, 3:30am

What is the de facto standard package for KDE in Julia?

My attempts with the packages listed above were not very successful.

Tamas_Papp · September 17, 2017, 8:01am

I have used KernelDensity.jl recently. The following runs with the current version just fine:

using KernelDensity
using Plots; gr()
x = randn(100)
y = kde(x)
plot(linspace(extrema(x)..., 100), z->pdf(y, z))

Being more specific than “not successful” would make it easier to help you. What did you try? What happened? How does it not conform to your expectations?

mcreel · September 17, 2017, 9:52am

I have a function that does basic multivariate local constant, linear and quadratic nonparametric fitting in https://github.com/mcreel/Econometrics.jl, just call npreg() to see an example. I use this code for teaching, and will try to make it reasonable accessible, but maybe not fully featured.

In the package https://github.com/mcreel/QuantileIV.jl there is an extended version which does nonpararmetric quantile regression, using code from https://github.com/pkofod/QuantileRegression.jl. The stuff in my QuantileIV.jl package needs to be cleaned up a bit for public use, but it works well for me.

joshday · September 17, 2017, 1:34pm

The first Julia package I wrote was GitHub - joshday/AverageShiftedHistograms.jl: ⚡ Lightning fast density estimation in Julia ⚡, which essentially does KDE over a fine-partition histogram. It’s cool if you’re working with large datasets, as it uses constant memory and can be estimated on-line.

juliohm · September 17, 2017, 3:20pm

For KernelDensity.jl, I was concerned about the theoretical aspect of it. Is it common practice to use spline interpolation for evaluating the density at new locations? It currently uses Interpolations.jl, but I am not sure this is the correct thing to do? Plus, I need to evaluate the KDE object thousands of times, my experience with the package is that it is somehow slow? I will double check my code to make sure the slowness is coming from the KDE part and not somewhere else.

For KernelEstimator.jl, it currently only supports 1D KDEs. My use case requires 2D.

For KernelDensityEstimate.jl, the repository only has 2 stars on GitHub, I wanted to double check if it is being used.

juliohm · September 17, 2017, 3:32pm

Thanks @joshday! It looks like a very nice package, both in terms of performance and API! I will definitely give it a try.

Do you recommend a paper on average shifted histograms? Or a publication on which you based your implementation?

joshday · September 17, 2017, 3:43pm

I used some course notes from David Scott (who invented ASH) which I can no longer find online. He has a few papers on the subject.

juliohm · September 17, 2017, 4:08pm

Thanks, I’ve downloaded a bunch of papers by him for my dessert

Tamas_Papp · September 17, 2017, 4:24pm

If you are using kernel densities for data visualization, there is no “correct” approach. Do what you like. If interpolations give a speedup, use them.

If you are using it for nonparametric estimation, then most results are asymptotic anyway. So the spline-interpolated density it may not coincide with some theoretical definitions, but this should not matter.

It is hard to say more without some context.

juliohm · September 17, 2017, 4:31pm

I agree that results are always dependent on the number of samples, but this doesn’t mean we can play with arbitrary algorithms. When I think kernel density estimation, I don’t think of visualization. If I was thinking of visualization, I would use splines.

Tamas_Papp · September 17, 2017, 4:48pm

I am under the impression that it pretty much does. As long as we have asymptotic results, and the differences vanish asymptotically. People do this all the time, derive new algorithms with different small-sample properties, and argue about their merits, while preserving the same asymptotics.

The more you know (or are willing to assume) about your estimated density, the more guidance the literature gives you about nonparametric estimation. But again, most of those results are based on heuristics, MC or theoretical results for a particular distribution, or asymptotics.

Kernel densities are commonly used for data visualization and exploration. Probably one of the best tools for that.

Lanfeng · September 17, 2017, 5:05pm

For KernelEstimator.jl, it currently only supports 1D KDEs. My use case requires 2D.

KernelEstimator.jl does support multivariate kernel density estimate and multivariate local constants regression.
The only issue is the bandwidth selection for multivariate kernel density. The user are responsible to choose the right bandwidth. The default bandwidth is chosen via likelihood cross validation which has some known disadvantages. I don’t have good solution yet. Comments and pull requests are welcome.

Regression and univariate kde do not suffer this problem. Their bandwidth are chosen by least squares cross validation.

juliohm · September 17, 2017, 7:11pm

Thank you @Lanfeng! I think the README is outdated:

The Julia package for nonparametric kernel density estimate and regression. This package currently includes univariate kernel density estimate, local constant regression (Nadaraya-Watson regression) and local linear regression. It can also compute the Bootstrap confidence band [4].

I will give it a try as well, thanks for the package!

Elrod · September 18, 2017, 12:14am

julia> using KernelDensity, BenchmarkTools

julia> x = randn(100_000);

julia> d = kde(x);

julia> @benchmark pdf($d, 0.1)
BenchmarkTools.Trial:
  memory estimate:  265.28 KiB
  allocs estimate:  164
  --------------
  minimum time:     612.651 µs (0.00% GC)
  median time:      652.279 µs (0.00% GC)
  mean time:        700.002 µs (4.56% GC)
  maximum time:     3.875 ms (78.17% GC)
  --------------
  samples:          7117
  evals/sample:     1

julia> itp_d = InterpKDE(d);

julia> @benchmark pdf($itp_d, 0.1)
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     40.197 ns (0.00% GC)
  median time:      41.052 ns (0.00% GC)
  mean time:        42.637 ns (1.10% GC)
  maximum time:     769.449 ns (90.14% GC)
  --------------
  samples:          10000
  evals/sample:     1000

juliohm · September 18, 2017, 12:49am

Thanks @Elrod, I am using 2D KDEs though. In any case, I am looking forward for the new release of AverageShiftedHistograms.jl that @joshday is working on.

The papers by David Scott are really great. ASH is very powerful, it has comparable statistical properties to that of kernel-based estimators and is as fast as simple frequency histograms. The best of both worlds. It has become my favorite density estimation technique today.

Nectarineimp · September 18, 2017, 8:04pm

I will have to check that out for myself! I found the other packages slow for my taste. Using 10,000 observations something like 10 or 30 seconds? I forgot what it was, but I had millions of observations. I ended up doing random sampling of 10,000 samples so it wouldn’t kill my entire process.

Thanks for sharing this.

Nectarineimp · September 20, 2017, 6:47pm

OK looks like AverageShiftedHistorgrams isn’t working well under Julia 0.6. At least the example code I tried is broken. It also gives quite a few deprecation warnings. I will try and find some time in the next two weeks to do some PRs to help make is 0.6 (and hopefully 1.0) compatible. fyi:@joshday

juliohm · September 20, 2017, 6:49pm

@Nectarineimp, you need to checkout the master branch, @joshday is still working on a new release.

Nectarineimp · September 20, 2017, 6:54pm

OK that sounds great. Maybe he has some particular tasks he wants help with. I’m all for volunteering for this. I used KDE quite a bit in the past.

joshday · September 20, 2017, 8:14pm

If there’s any missing functionality you want added, PRs are gladly welcome! There is an open issue for a pdf method for bivariate Ash if you’re up for tackling that.

Topic		Replies	Views
On-the-flight kernel density estimation? Statistics	15	1713	March 13, 2023
[ANN] MultiKDE.jl: A Lazy Evaluation Multivariate Kernel Density Estimator Package Announcements package , announcement , statistics	2	1227	June 30, 2021
Spatial Kernel in Julia Geo question , package	28	817	December 21, 2023
Getting the functional form of a pdf using Kernel density estimation General Usage	0	199	May 16, 2022
Empirical density contours from an MCMC sample Probabilistic Programming	2	431	March 29, 2022

Kernel density estimation status

Related topics