Use a Histogram as a function

bobweek · October 18, 2019, 7:01pm

I would like to see built-in functionality that allows me to use a Histogram as a function that returns the weight of the histogram at a given point in state space. The closest thing i can find is the kde() function from KernelDensity.jl which produces a distribution, say p , from data d via p=kde(d) . The distribution can then be evaluated at a point x in state space via pdf(p,x) . However, kde() takes too long for my purposes, but h=fit(Histogram,d) is rather quick in comparison. Furthermore, my state space is multivariate, so kde() will only let me approximate the marginal distributions while fit(Histogram,d) lets me approximate multivariate distributions. It would be great if i could take h and compute the weight at x via something like pdf(h,x) . I understand Histograms come with edges and weights, so all the necessary ingredients are there.

The reason i want this is for fast Bayesian inference. I’m using an MCMC-like approach to approximate the likelihood surface. Instead of producing a function that approximates the likelihood, this approach produces a sample whose histogram is proportional to an approximation of the likelihood.

nilshg · October 18, 2019, 7:30pm

You might be interested in my answer here: histogram2d - Return the frequency in a bin of a 2D histogram in Julia - Stack Overflow

If you want to make this built-in probably best to do a PR to StatsBase

bobweek · October 20, 2019, 8:53pm

Thanks for the quick reply. Your get_freq() function looks like exactly what I need. Do you think this is the fastest way to perform this task in julia?

How do i do a PR to StatsBase?

nilshg · October 21, 2019, 9:05am

Not sure it’s the fastest, but it’s surely orders of magnitude faster than what you were doing:

using StatsBase, BenchmarkTools

# Example data 
data = (randn(10_000), randn(10_000))

h2d = fit(Histogram, data)

function get_freq(h, xval, yval)
    x = searchsortedfirst(h.edges[1], xval)
    y = searchsortedfirst(h.edges[2], yval)
    h.weights[x, y]
end

@btime fit(Histogram, data) # 929.900 μs (3 allocations: 1.47 KiB)

@btime get_freq(h2d, 1.4, 0.6) #  70.832 ns (0 allocations: 0 bytes)

using KernelDensity

p = kde(data)

@btime kde(data) # 4.011 ms (166 allocations: 2.67 MiB)

@btime pdf(p, 1.4, 0.6) # 1.717 ms (208 allocations: 1.67 MiB)

Maybe try and profile this within your actual application, if it turns out to be a bottleneck I’m sure there are people on this forum that will manage to beat my suggestion quite handily

As for the PR, maybe start off by creating an issue on the StatsBase repo to see what the maintainers think about having this functionality, how it should integrate into the existing API etc.

bobweek · October 21, 2019, 4:51pm

thanks for your suggestions! this is very helpful.

Topic		Replies	Views
Speeding up `fit(Histogram)` Performance	10	770	November 17, 2023
Using a (normalized) Histogram as a Distribution General Usage	29	3325	September 3, 2019
Kernel density (w/ weights) General Usage	1	408	July 8, 2021
Weighted histogram ~2x as slow in Julia vs. Python Performance question , statistics , python	13	1202	January 20, 2022
Peak finding from a distribution General Usage statistics , distributions , data_science	9	2224	October 28, 2021

Use a Histogram as a function

Related topics