Speeding up `fit(Histogram)`

Dan · November 17, 2023, 12:13pm

As an offshoot of a thread about Makie histograms, I’ve tried to benchmark StatsBase fit(Histogtam) like so:

julia> using StatsBase

julia> @btime fit(Histogram, v, 0.0:0.1:1.0) setup=(Random.seed!(1234); v = rand(1_000_000))
  31.720 ms (2 allocations: 224 bytes)
Histogram{Int64, 1, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}}
edges:
  0.0:0.1:1.0
weights: [99542, 99942, 99836, 100537, 99777, 99271, 99582, 100479, 100476, 100558]
closed: left
isdensity: false

As a comparison point, here is a for loop simple implementation:

julia> function fastest_hist(v, r)
       hist = zeros(Int, length(r)-1)
       f, s, l = first(r), step(r), last(r)
       for i in eachindex(v)
           a = v[i]
           a < f && (hist[1] += 1; continue)
           a >= l && (hist[end] += 1; continue)
           hist[1+floor(Int, (a - f)/s)] += 1
       end
       hist
       end
fastest_hist (generic function with 1 method)

julia> @btime fastest_hist(v, 0.0:0.1:1.0) setup=(Random.seed!(1234); v = rand(1_000_000))
  8.933 ms (1 allocation: 144 bytes)
10-element Vector{Int64}:
  99542
  99942
  99836
 100537
  99777
  99271
  99582
 100479
 100476
 100558

The gap in performance on my machine is about 3x. Is there a way to improve performance of fit(Histogram) or the way I’ve used it?

tbeason · November 17, 2023, 2:16pm

Maybe check out this?

Dan · November 17, 2023, 2:21pm

Yep, it’s even faster than the for loop version:

julia> @btime Hist1D(v, 0.0:0.1:1.0) setup=(Random.seed!(1234); v = rand(1_000_000))
  6.645 ms (6 allocations: 544 bytes)
edges: 0.0:0.1:1.0
bin counts: [99542, 99942, 99836, 100537, 99777, 99271, 99582, 100479, 100476, 100558]
total count: 1000000

That’s really good. Maybe it should replace the StatsBase implementation.

viralbshah · November 17, 2023, 2:28pm

Slightly off-topic - I think the JuliaStats ecosystem has matured nicely, but also needs some new contributors to forge the way forward. The challenge has always been that existing owners are too busy to review everything and they do need to keep the ecosystem working.

This would be exactly the kind of thing that can help bring new owners for these core packages. I’ll be happy to help make that happen.

jling · November 17, 2023, 2:53pm

gotta go fast: Inline and optimize `push!()` when one-shot histogramming by Moelf · Pull Request #87 · Moelf/FHist.jl · GitHub

jling · November 17, 2023, 2:54pm

I tried and failed:

jling · November 17, 2023, 2:57pm

convert() should already work, so you can take my histogram and disard some information and go back to StatsBase.Histogram.

My understanding is that hep use cases are very fringe in the grand scheme of things, for example, in addition to the bin count (can be weighted) we also keep track of:

sum of wgt^2 in each bin
total number of times this histogram has been filled

Dan · November 17, 2023, 3:08pm

Don’t think it is so fringe. HEP practitioners might be more fringe, but the statistics/computation needs of every field seem to converge quite surprisingly.
I agree that keeping the square weights is also important in a lot of use cases. Also choosing the right bins is interesting and takes a complete pass on the data and even partial sorting in some cases (i.e. calculating inter-quartile range).

Maybe we should delineate the algorithms according to memory complexity:

on-line algos (dynamic histogram)
1-pass algos
2-pass algos
random access…
And have a histogram version for each of these complexities.

tamasgal · November 17, 2023, 3:16pm

That’s really a bummer, seeing all the efforts you have put in.

I also work in a field where histograms are breakfast, lunch and dinner (astroparticle physics) and I do everything related with FHist.jl, it’s awesome. I wish JuliaStats/StatsBase would appreciate the need for a proper and fast implementation.

jling · November 17, 2023, 4:45pm

The first exists in: GitHub - joshday/OnlineStats.jl: ⚡ Single-pass algorithms for statistics

and for other binning, we actually have a cool Bayesian block: https://github.com/Moelf/FHist.jl/blob/fd65d4b93f4d4bfca07f4f3bec481c7650c9f1db/src/hist1d.jl#L333

anyway, but in hep actually most of the time you make a histogram and push!() into it event by event… which is why initially I missed this optimization, and even now I don’t have this optimization for 2D and 3D histogram.

In an ideal world I’d love to have the best histogram ecosystem in Julia, there are many interesting things going on:

GitHub - scikit-hep/hist: Histogramming for analysis powered by boost-histogram
GitHub - scikit-hep/uhi: Universal Histogram Interface
superhistogram with many histogram into one and bunch of meta data
how do we persistify histogram onto disk for long term and language-agnostic usage?

Dan · November 17, 2023, 4:48pm

Another niche important to be best in is the countmap area, also in StatsBase. Investing in StatsBase is important to draw/keep power users.

Topic		Replies	Views
Weighted histogram ~2x as slow in Julia vs. Python Performance question , statistics , python	13	1202	January 20, 2022
Use a Histogram as a function General Usage	4	2188	October 21, 2019
Why is hist not defined after loading StatsBase? New to Julia question	10	2660	May 13, 2018
Query on histc and histogram counts Statistics	5	3038	January 9, 2017
Hist() and hits() in Julia? Statistics question	2	3998	December 30, 2016

Speeding up `fit(Histogram)`

Related topics