Speeding up `fit(Histogram)`

As an offshoot of a thread about Makie histograms, I’ve tried to benchmark StatsBase fit(Histogtam) like so:

julia> using StatsBase

julia> @btime fit(Histogram, v, 0.0:0.1:1.0) setup=(Random.seed!(1234); v = rand(1_000_000))
  31.720 ms (2 allocations: 224 bytes)
Histogram{Int64, 1, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}}
edges:
  0.0:0.1:1.0
weights: [99542, 99942, 99836, 100537, 99777, 99271, 99582, 100479, 100476, 100558]
closed: left
isdensity: false

As a comparison point, here is a for loop simple implementation:

julia> function fastest_hist(v, r)
       hist = zeros(Int, length(r)-1)
       f, s, l = first(r), step(r), last(r)
       for i in eachindex(v)
           a = v[i]
           a < f && (hist[1] += 1; continue)
           a >= l && (hist[end] += 1; continue)
           hist[1+floor(Int, (a - f)/s)] += 1
       end
       hist
       end
fastest_hist (generic function with 1 method)

julia> @btime fastest_hist(v, 0.0:0.1:1.0) setup=(Random.seed!(1234); v = rand(1_000_000))
  8.933 ms (1 allocation: 144 bytes)
10-element Vector{Int64}:
  99542
  99942
  99836
 100537
  99777
  99271
  99582
 100479
 100476
 100558

The gap in performance on my machine is about 3x. Is there a way to improve performance of fit(Histogram) or the way I’ve used it?

Maybe check out this?

7 Likes

Yep, it’s even faster than the for loop version:

julia> @btime Hist1D(v, 0.0:0.1:1.0) setup=(Random.seed!(1234); v = rand(1_000_000))
  6.645 ms (6 allocations: 544 bytes)
edges: 0.0:0.1:1.0
bin counts: [99542, 99942, 99836, 100537, 99777, 99271, 99582, 100479, 100476, 100558]
total count: 1000000

That’s really good. Maybe it should replace the StatsBase implementation.

Slightly off-topic - I think the JuliaStats ecosystem has matured nicely, but also needs some new contributors to forge the way forward. The challenge has always been that existing owners are too busy to review everything and they do need to keep the ecosystem working.

This would be exactly the kind of thing that can help bring new owners for these core packages. I’ll be happy to help make that happen.

2 Likes

gotta go fast: Inline and optimize `push!()` when one-shot histogramming by Moelf · Pull Request #87 · Moelf/FHist.jl · GitHub

1 Like

I tried and failed:

1 Like

convert() should already work, so you can take my histogram and disard some information and go back to StatsBase.Histogram.

My understanding is that hep use cases are very fringe in the grand scheme of things, for example, in addition to the bin count (can be weighted) we also keep track of:

  • sum of wgt^2 in each bin
  • total number of times this histogram has been filled

Don’t think it is so fringe. HEP practitioners might be more fringe, but the statistics/computation needs of every field seem to converge quite surprisingly.
I agree that keeping the square weights is also important in a lot of use cases. Also choosing the right bins is interesting and takes a complete pass on the data and even partial sorting in some cases (i.e. calculating inter-quartile range).

Maybe we should delineate the algorithms according to memory complexity:

  • on-line algos (dynamic histogram)
  • 1-pass algos
  • 2-pass algos
  • random access…
    And have a histogram version for each of these complexities.

That’s really a bummer, seeing all the efforts you have put in.

I also work in a field where histograms are breakfast, lunch and dinner (astroparticle physics) and I do everything related with FHist.jl, it’s awesome. I wish JuliaStats/StatsBase would appreciate the need for a proper and fast implementation.

1 Like

The first exists in: GitHub - joshday/OnlineStats.jl: ⚡ Single-pass algorithms for statistics

and for other binning, we actually have a cool Bayesian block: https://github.com/Moelf/FHist.jl/blob/fd65d4b93f4d4bfca07f4f3bec481c7650c9f1db/src/hist1d.jl#L333

anyway, but in hep actually most of the time you make a histogram and push!() into it event by event… which is why initially I missed this optimization, and even now I don’t have this optimization for 2D and 3D histogram.


In an ideal world I’d love to have the best histogram ecosystem in Julia, there are many interesting things going on:

2 Likes

Another niche important to be best in is the countmap area, also in StatsBase. Investing in StatsBase is important to draw/keep power users.

3 Likes