# Speeding up `fit(Histogram)`

As an offshoot of a thread about Makie histograms, I’ve tried to benchmark StatsBase `fit(Histogtam)` like so:

``````julia> using StatsBase

julia> @btime fit(Histogram, v, 0.0:0.1:1.0) setup=(Random.seed!(1234); v = rand(1_000_000))
31.720 ms (2 allocations: 224 bytes)
Histogram{Int64, 1, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}}
edges:
0.0:0.1:1.0
weights: [99542, 99942, 99836, 100537, 99777, 99271, 99582, 100479, 100476, 100558]
closed: left
isdensity: false
``````

As a comparison point, here is a `for` loop simple implementation:

``````julia> function fastest_hist(v, r)
hist = zeros(Int, length(r)-1)
f, s, l = first(r), step(r), last(r)
for i in eachindex(v)
a = v[i]
a < f && (hist[1] += 1; continue)
a >= l && (hist[end] += 1; continue)
hist[1+floor(Int, (a - f)/s)] += 1
end
hist
end
fastest_hist (generic function with 1 method)

julia> @btime fastest_hist(v, 0.0:0.1:1.0) setup=(Random.seed!(1234); v = rand(1_000_000))
8.933 ms (1 allocation: 144 bytes)
10-element Vector{Int64}:
99542
99942
99836
100537
99777
99271
99582
100479
100476
100558
``````

The gap in performance on my machine is about 3x. Is there a way to improve performance of `fit(Histogram)` or the way I’ve used it?

Maybe check out this?

6 Likes

Yep, it’s even faster than the `for` loop version:

``````julia> @btime Hist1D(v, 0.0:0.1:1.0) setup=(Random.seed!(1234); v = rand(1_000_000))
6.645 ms (6 allocations: 544 bytes)
edges: 0.0:0.1:1.0
bin counts: [99542, 99942, 99836, 100537, 99777, 99271, 99582, 100479, 100476, 100558]
total count: 1000000
``````

That’s really good. Maybe it should replace the StatsBase implementation.

Slightly off-topic - I think the JuliaStats ecosystem has matured nicely, but also needs some new contributors to forge the way forward. The challenge has always been that existing owners are too busy to review everything and they do need to keep the ecosystem working.

This would be exactly the kind of thing that can help bring new owners for these core packages. I’ll be happy to help make that happen.

2 Likes
1 Like
1 Like

`convert()` should already work, so you can take my histogram and disard some information and go back to `StatsBase.Histogram`.

My understanding is that hep use cases are very fringe in the grand scheme of things, for example, in addition to the bin count (can be weighted) we also keep track of:

• sum of wgt^2 in each bin
• total number of times this histogram has been filled

Don’t think it is so fringe. HEP practitioners might be more fringe, but the statistics/computation needs of every field seem to converge quite surprisingly.
I agree that keeping the square weights is also important in a lot of use cases. Also choosing the right bins is interesting and takes a complete pass on the data and even partial sorting in some cases (i.e. calculating inter-quartile range).

Maybe we should delineate the algorithms according to memory complexity:

• on-line algos (dynamic histogram)
• 1-pass algos
• 2-pass algos
• random access…
And have a histogram version for each of these complexities.

That’s really a bummer, seeing all the efforts you have put in.

I also work in a field where histograms are breakfast, lunch and dinner (astroparticle physics) and I do everything related with `FHist.jl`, it’s awesome. I wish JuliaStats/StatsBase would appreciate the need for a proper and fast implementation.

The first exists in: GitHub - joshday/OnlineStats.jl: ⚡ Single-pass algorithms for statistics

and for other binning, we actually have a cool Bayesian block: https://github.com/Moelf/FHist.jl/blob/fd65d4b93f4d4bfca07f4f3bec481c7650c9f1db/src/hist1d.jl#L333

anyway, but in hep actually most of the time you make a histogram and `push!()` into it event by event… which is why initially I missed this optimization, and even now I don’t have this optimization for 2D and 3D histogram.

In an ideal world I’d love to have the best histogram ecosystem in Julia, there are many interesting things going on:

2 Likes

Another niche important to be best in is the `countmap` area, also in StatsBase. Investing in StatsBase is important to draw/keep power users.

2 Likes