Weighted histogram ~2x as slow in Julia vs. Python

I’m working on a project where I need to compute the weighted histogram of something many times. Basically I need the sum of each of the bins of the histogram, and I’ve been using PyCall to call the SciPy “binned statistic” method until this point as there wasn’t an obvious equivalent I could find in Julia.

While this is fine, it’s always annoyed me, and I recently decided I could just write it myself in native Julia and stop being annoyed by it. So I took this naive stab at it, which works, but is twice as slow as calling the Python version (~30 vs. ~60 ms):

function histSum(x::Array{Float64,},y::Array{Float64,},bins::Int=100) #61 ms
    x = vec(x); y = vec(y)
    binMax = maximum(x); binMin = minimum(x)
    binEdges = range(binMin,stop=binMax,length=bins+1)
    result = zeros(length(binEdges)-1)
    for i=1:bins-1
        result[i] = sum(y[(x .>= binEdges[i]) .& (x .< binEdges[i+1])])
    end
    result[end] = sum(y[(x .>= binEdges[end-1]) .& (x .<= binEdges[end])])
    return binEdges, result
end

I posted the problem on the Julia slack and there were some great suggestions from Dave MacMahon, including this much more elegant solution that uses the built-in StatsBase histogram function:

using StatsBase
function histSum(x::Array,y::Array,bins::Int=100)
    h = fit(Histogram, vec(x), aweights(y), nbins=bins)
    h.edges, h.weights
end

But this way is nearly identical in speed to my first naive way, so still twice as slow as using PyCall. The relevant Python source code for binned_statistic is here, which in the summation case relies on the numpy bincount function which is calling some C code I can’t find.

One thought is that maybe the numpy version is automatically multi-threaded behind the scenes and this is responsible for the discrepancy? If I place Threads.@threads in front of the for loop in my first version of the code I can beat the Python version by a few ms. Any suggestions much appreciated!

When you have performance issues, looking for unnecessary allocations is often a good idea. In this line you are creating one BitArray and one regular array for each bin.

If you can write this out using loop, using only scalar operations, there should be a decent speedup, I think. Maybe for each element search for the right bin using one of the sorted search functions.

This was a suggestion I got on the Slack as well (to try sorting instead) and we came up with this implementation:

function histSumSort(x::Array{Float64,},y::Array{Float64,},bins::Int=100) #87 ms
    x = vec(x); y = vec(y)
    p = sortperm(x)
    #sortedX = x[p]; sortedY = y[p]
    bins = range(minimum(x),stop=maximum(x),length=bins+1)
    I = 1
    result = zeros(length(bins)-1)
    for i in p
        while x[i] > bins[I+1]
            I += 1
        end
        result[I] += y[i]
    end
    return bins,result
end

Unfortunately this way is actually slower (by about 1/3) than my initial naive guess…but maybe you can think of a better way to write this?

Try this:

function histSum(x::Array{Float64,}, y::Array{Float64,}, bins::Int=100)
    binMin, binMax = extrema(x)
    result = zeros(bins)
    α = bins / (binMax - binMin)
    for (x, y) in zip(x, y)
        i = min(bins, 1 + floor(Int, α * (x - binMin)))
        result[i] += y
    end
    return range(binMin, stop=binMax, length=bins+1), result
end
4 Likes

This does it! What about this makes it so much faster? I notice if I place the @threads macro in front of this for loop I don’t get any speedup, and the time it takes is basically identical to my first way with the @threads macro, so I’m wondering if somehow this formulation is by default multi-threaded or something?

1 Like

Since the bins are uniform you can directly compute into which bin each element go, which is much faster than searching for the bin.

2 Likes

Thanks so much!

I didn’t actually mean that you should sort, but exploit that the bins are sorted. @GunnarFarneback’s solution is an even faster version of that.

2 Likes

Interestingly, changing extrema back to minimum and maximum seems quite a bit faster on my system (Julia 1.7.1)? Otherwise very nice!

1 Like

give this a try if you want

2 Likes

I can see a similar difference here, also 1.7.1. Seems like something must be wrong with the extrema implementation?

julia> a = randn(1000);

julia> @btime extrema($a)
  5.849 μs (0 allocations: 0 bytes)
(-3.155119094596879, 3.0526046427540967)

julia> @btime minimum($a), maximum($a)
  1.744 μs (0 allocations: 0 bytes)
(-3.155119094596879, 3.0526046427540967)
2 Likes

I’m only using 1.6.1, and there’s less of a difference but indeed the minimum / maximum way is faster by ~40% which does seem strange.

1 Like
4 Likes

The fact that extrema is slow compared to separate calls of minimum and maximum has had multiple github issues raised I believe.

1 Like