Use a Histogram as a function

I would like to see built-in functionality that allows me to use a Histogram as a function that returns the weight of the histogram at a given point in state space. The closest thing i can find is the kde() function from KernelDensity.jl which produces a distribution, say p , from data d via p=kde(d) . The distribution can then be evaluated at a point x in state space via pdf(p,x) . However, kde() takes too long for my purposes, but h=fit(Histogram,d) is rather quick in comparison. Furthermore, my state space is multivariate, so kde() will only let me approximate the marginal distributions while fit(Histogram,d) lets me approximate multivariate distributions. It would be great if i could take h and compute the weight at x via something like pdf(h,x) . I understand Histograms come with edges and weights, so all the necessary ingredients are there.

The reason i want this is for fast Bayesian inference. I’m using an MCMC-like approach to approximate the likelihood surface. Instead of producing a function that approximates the likelihood, this approach produces a sample whose histogram is proportional to an approximation of the likelihood.

You might be interested in my answer here: histogram2d - Return the frequency in a bin of a 2D histogram in Julia - Stack Overflow

If you want to make this built-in probably best to do a PR to StatsBase

3 Likes

Thanks for the quick reply. Your get_freq() function looks like exactly what I need. Do you think this is the fastest way to perform this task in julia?

How do i do a PR to StatsBase?

Not sure it’s the fastest, but it’s surely orders of magnitude faster than what you were doing:

using StatsBase, BenchmarkTools

# Example data 
data = (randn(10_000), randn(10_000))

h2d = fit(Histogram, data)

function get_freq(h, xval, yval)
    x = searchsortedfirst(h.edges[1], xval)
    y = searchsortedfirst(h.edges[2], yval)
    h.weights[x, y]
end

@btime fit(Histogram, data) # 929.900 μs (3 allocations: 1.47 KiB)

@btime get_freq(h2d, 1.4, 0.6) #  70.832 ns (0 allocations: 0 bytes)

using KernelDensity

p = kde(data)

@btime kde(data) # 4.011 ms (166 allocations: 2.67 MiB)

@btime pdf(p, 1.4, 0.6) # 1.717 ms (208 allocations: 1.67 MiB)

Maybe try and profile this within your actual application, if it turns out to be a bottleneck I’m sure there are people on this forum that will manage to beat my suggestion quite handily :slight_smile:

As for the PR, maybe start off by creating an issue on the StatsBase repo to see what the maintainers think about having this functionality, how it should integrate into the existing API etc.

2 Likes

thanks for your suggestions! this is very helpful.