Kernel density estimation status

question
package

#1

What is the de facto standard package for KDE in Julia?

My attempts with the packages listed above were not very successful.


#2

I have used KernelDensity.jl recently. The following runs with the current version just fine:

using KernelDensity
using Plots; gr()
x = randn(100)
y = kde(x)
plot(linspace(extrema(x)..., 100), z->pdf(y, z))

Being more specific than “not successful” would make it easier to help you. What did you try? What happened? How does it not conform to your expectations?


#3

I have a function that does basic multivariate local constant, linear and quadratic nonparametric fitting in https://github.com/mcreel/Econometrics.jl, just call npreg() to see an example. I use this code for teaching, and will try to make it reasonable accessible, but maybe not fully featured.

In the package https://github.com/mcreel/QuantileIV.jl there is an extended version which does nonpararmetric quantile regression, using code from https://github.com/pkofod/QuantileRegression.jl. The stuff in my QuantileIV.jl package needs to be cleaned up a bit for public use, but it works well for me.


#4

The first Julia package I wrote was https://github.com/joshday/AverageShiftedHistograms.jl, which essentially does KDE over a fine-partition histogram. It’s cool if you’re working with large datasets, as it uses constant memory and can be estimated on-line.


#5

For KernelDensity.jl, I was concerned about the theoretical aspect of it. Is it common practice to use spline interpolation for evaluating the density at new locations? It currently uses Interpolations.jl, but I am not sure this is the correct thing to do? Plus, I need to evaluate the KDE object thousands of times, my experience with the package is that it is somehow slow? I will double check my code to make sure the slowness is coming from the KDE part and not somewhere else.

For KernelEstimator.jl, it currently only supports 1D KDEs. My use case requires 2D.

For KernelDensityEstimate.jl, the repository only has 2 stars on GitHub, I wanted to double check if it is being used.


#6

Thanks @joshday! It looks like a very nice package, both in terms of performance and API! I will definitely give it a try.

Do you recommend a paper on average shifted histograms? Or a publication on which you based your implementation?


#7

I used some course notes from David Scott (who invented ASH) which I can no longer find online. He has a few papers on the subject.


#8

Thanks, I’ve downloaded a bunch of papers by him for my dessert :slight_smile:


#9

If you are using kernel densities for data visualization, there is no “correct” approach. Do what you like. If interpolations give a speedup, use them.

If you are using it for nonparametric estimation, then most results are asymptotic anyway. So the spline-interpolated density it may not coincide with some theoretical definitions, but this should not matter.

It is hard to say more without some context.


#10

I agree that results are always dependent on the number of samples, but this doesn’t mean we can play with arbitrary algorithms. When I think kernel density estimation, I don’t think of visualization. If I was thinking of visualization, I would use splines.


#11

I am under the impression that it pretty much does. As long as we have asymptotic results, and the differences vanish asymptotically. People do this all the time, derive new algorithms with different small-sample properties, and argue about their merits, while preserving the same asymptotics.

The more you know (or are willing to assume) about your estimated density, the more guidance the literature gives you about nonparametric estimation. But again, most of those results are based on heuristics, MC or theoretical results for a particular distribution, or asymptotics.

Kernel densities are commonly used for data visualization and exploration. Probably one of the best tools for that.


#12

For KernelEstimator.jl, it currently only supports 1D KDEs. My use case requires 2D.

KernelEstimator.jl does support multivariate kernel density estimate and multivariate local constants regression.
The only issue is the bandwidth selection for multivariate kernel density. The user are responsible to choose the right bandwidth. The default bandwidth is chosen via likelihood cross validation which has some known disadvantages. I don’t have good solution yet. Comments and pull requests are welcome.

Regression and univariate kde do not suffer this problem. Their bandwidth are chosen by least squares cross validation.


#13

Thank you @Lanfeng! I think the README is outdated:

The Julia package for nonparametric kernel density estimate and regression. This package currently includes univariate kernel density estimate, local constant regression (Nadaraya-Watson regression) and local linear regression. It can also compute the Bootstrap confidence band [4].

I will give it a try as well, thanks for the package!


#14
julia> using KernelDensity, BenchmarkTools

julia> x = randn(100_000);

julia> d = kde(x);

julia> @benchmark pdf($d, 0.1)
BenchmarkTools.Trial:
  memory estimate:  265.28 KiB
  allocs estimate:  164
  --------------
  minimum time:     612.651 µs (0.00% GC)
  median time:      652.279 µs (0.00% GC)
  mean time:        700.002 µs (4.56% GC)
  maximum time:     3.875 ms (78.17% GC)
  --------------
  samples:          7117
  evals/sample:     1

julia> itp_d = InterpKDE(d);

julia> @benchmark pdf($itp_d, 0.1)
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     40.197 ns (0.00% GC)
  median time:      41.052 ns (0.00% GC)
  mean time:        42.637 ns (1.10% GC)
  maximum time:     769.449 ns (90.14% GC)
  --------------
  samples:          10000
  evals/sample:     1000


#15

Thanks @Elrod, I am using 2D KDEs though. In any case, I am looking forward for the new release of AverageShiftedHistograms.jl that @joshday is working on.

The papers by David Scott are really great. ASH is very powerful, it has comparable statistical properties to that of kernel-based estimators and is as fast as simple frequency histograms. The best of both worlds. It has become my favorite density estimation technique today.


#16

I will have to check that out for myself! I found the other packages slow for my taste. Using 10,000 observations something like 10 or 30 seconds? I forgot what it was, but I had millions of observations. I ended up doing random sampling of 10,000 samples so it wouldn’t kill my entire process.

Thanks for sharing this.


#17

OK looks like AverageShiftedHistorgrams isn’t working well under Julia 0.6. At least the example code I tried is broken. It also gives quite a few deprecation warnings. I will try and find some time in the next two weeks to do some PRs to help make is 0.6 (and hopefully 1.0) compatible. fyi:@joshday


#18

@Nectarineimp, you need to checkout the master branch, @joshday is still working on a new release.


#19

OK that sounds great. Maybe he has some particular tasks he wants help with. I’m all for volunteering for this. I used KDE quite a bit in the past.


#20

If there’s any missing functionality you want added, PRs are gladly welcome! There is an open issue for a pdf method for bivariate Ash if you’re up for tackling that.