As a comparison point, here is a for loop simple implementation:
julia> function fastest_hist(v, r)
hist = zeros(Int, length(r)-1)
f, s, l = first(r), step(r), last(r)
for i in eachindex(v)
a = v[i]
a < f && (hist[1] += 1; continue)
a >= l && (hist[end] += 1; continue)
hist[1+floor(Int, (a - f)/s)] += 1
end
hist
end
fastest_hist (generic function with 1 method)
julia> @btime fastest_hist(v, 0.0:0.1:1.0) setup=(Random.seed!(1234); v = rand(1_000_000))
8.933 ms (1 allocation: 144 bytes)
10-element Vector{Int64}:
99542
99942
99836
100537
99777
99271
99582
100479
100476
100558
The gap in performance on my machine is about 3x. Is there a way to improve performance of fit(Histogram) or the way I’ve used it?
Slightly off-topic - I think the JuliaStats ecosystem has matured nicely, but also needs some new contributors to forge the way forward. The challenge has always been that existing owners are too busy to review everything and they do need to keep the ecosystem working.
This would be exactly the kind of thing that can help bring new owners for these core packages. I’ll be happy to help make that happen.
convert() should already work, so you can take my histogram and disard some information and go back to StatsBase.Histogram.
My understanding is that hep use cases are very fringe in the grand scheme of things, for example, in addition to the bin count (can be weighted) we also keep track of:
sum of wgt^2 in each bin
total number of times this histogram has been filled
Don’t think it is so fringe. HEP practitioners might be more fringe, but the statistics/computation needs of every field seem to converge quite surprisingly.
I agree that keeping the square weights is also important in a lot of use cases. Also choosing the right bins is interesting and takes a complete pass on the data and even partial sorting in some cases (i.e. calculating inter-quartile range).
Maybe we should delineate the algorithms according to memory complexity:
on-line algos (dynamic histogram)
1-pass algos
2-pass algos
random access…
And have a histogram version for each of these complexities.
That’s really a bummer, seeing all the efforts you have put in.
I also work in a field where histograms are breakfast, lunch and dinner (astroparticle physics) and I do everything related with FHist.jl, it’s awesome. I wish JuliaStats/StatsBase would appreciate the need for a proper and fast implementation.
anyway, but in hep actually most of the time you make a histogram and push!() into it event by event… which is why initially I missed this optimization, and even now I don’t have this optimization for 2D and 3D histogram.
In an ideal world I’d love to have the best histogram ecosystem in Julia, there are many interesting things going on: