Performance scaling of broadcasting compared to looping

I’m writing a simulation where speed is very important. The program consists of a main loop where I need to collect some scalar by multiplying some floats based on connections in an adjacency matrix. As a very simplified example, a large part of the main loop of my simulation comes down to this:

using BenchmarkTools

const state = rand(1000)
const n_connections = 10

# (connections, weights)
const adj = (sort!(rand(1:1000, n_connections)), rand(n_connections))

function main_algo_loop(state, adj)
    e = 0
    for (loop_idx, connection_idx) in enumerate(adj[1])
        e += state[connection_idx]*adj[2][loop_idx]
    end
    return e
end

function main_algo_broadcast(state, adj)
    return sum((@view state[adj[1]]) .* adj[2])
end

If n_connections = 10 I get

julia> @btime main_algo_loop($state, $adj)
  19.224 ns (0 allocations: 0 bytes)

julia> @btime main_algo_broadcast($state, $adj)
  31.910 ns (1 allocation: 144 bytes)

however, for n_connections = 100

@btime main_algo_loop($state, $adj)
  446.758 ns (0 allocations: 0 bytes)

julia> @btime main_algo_broadcast($state, $adj)
  98.558 ns (1 allocation: 896 bytes)

So I can understand that broadcasting can be optimised, which is why it is faster, but why is it quite a bit slower for smaller vectors? Right now, usually in most cases the number of connections are quite low, so looping is faster. However, in some cases I would like to do simulations with higher connectivity so broadcasting would be much faster. I could switch the algorithm when the average number of connections gets high, but this would be a bit cumbersome and is not very elegant. Can I write something that is always optimally fast, no matter what the value of n_connections is?

The problem is not broadcasting (which is implemented internally with loops, after all), it’s allocation.

This allocates a new array (for the result of the .*) and then sums it, which is quite a bit slower for small arrays than looping in-place for the summation.

(Even if you pre-allocated the output array, you would lose in memory bandwidth simply by having another array to read/write from.)

main_algo_sum(state, adj) = 
  sum(i -> state[adj[1][i]]*adj[2][i], 1:length(adj[1]))

is faster and shorter than the for-loop version (at least on my system).
You are welcome to report from yours.

Added: Add @inbounds for extra speed:

main_algo_sum(state, adj) = 
  sum(i -> @inbounds( state[adj[1][i]]*adj[2][i] ), 1:length(adj[1]))
1 Like

Why not use the built-in dot function?

Also an option:

using LinearAlgebra
main_algo_dot(state, adj) = dot(@view(state[adj[1]]), adj[2])

Clear. Around 2x slower than sum(i-> version.

1 Like

Thanks for the answer. I see, it makes sense that the allocation causes overhead, I didn’t think of that. However, if it’s implemented with loops, I don’t understand what makes it faster for longer arrays. Can you elaborate?

Interesting, for n_connections = 10

julia> @btime main_algo_sum($state, $adj)
  24.724 ns (0 allocations: 0 bytes)

n_connections = 100

@btime main_algo_sum($state, $adj)
  45.248 ns (0 allocations: 0 bytes)

However, for n_connections = 4 (which is the most common for my simulation)

@btime main_algo_loop($state, $adj)
  6.666 ns (0 allocations: 0 bytes)

@btime main_algo_broadcast($state, $adj)
  41.961 ns (1 allocation: 96 bytes)

@btime main_algo_sum($state, $adj)
  22.818 ns (0 allocations: 0 bytes)

So for larger arrays this is blazing fast, which is very useful. However, I still can’t use it for smaller arrays since a speed decrease of almost 4x is too much. So first of all I would like to understand the reason this is slower for smaller arrays, since this time it does not allocate and also why it is so much faster for larger arrays.

Furthermore it would be useful to know if it can be optimized for small arrays.

Edit: This is all on an M1 Max chip btw.

Edit 2: I don’t know what happened, but first of all without macro interploation it’s faster, and now testing it again, its faster in the first place:

@btime main_algo_sum(state, adj)
  6.417 ns (0 allocations: 0 bytes)

(with macro interpolation it would be around 6.7 ns, why?)

So, this is great! It’s faster in all cases. I still would like to understand why it’s faster for large arrays, however.

For very samll time-scales ( <100 cycles ) measurement might get trickier. Perhaps this should be repeated many times.

Also, if n_connections = 4 mostly, it might be worth it to special case it with StaticArrays.

1 Like

sum has some tricks in its loops to squeeze out a bit more performance (as well as using pairwise summation for better accuracy).

2 Likes

That’s good to know. One weird thing I’m having though is that sometimes the main_algo_sum will still report 20ns, and after some restarts of the REPL, sometimes will do 6ns, whereas the one with the loop always reports 6ns… Why is this?

I guess I celebrated too soon. For n_connections = 4

and

function loopalgo_sum(state, adj)
    for _ in 1:100
        main_algo_sum(state,adj)
    end
end

function testalgo_loop(state, adj)
    for _ in 1:100
        main_algo_loop(state,adj)
    end
end

I get

julia> @btime loopalgo_sum(state, adj)
  471.939 ns (0 allocations: 0 bytes)

julia> @btime testalgo_loop(state, adj)
  210.510 ns (0 allocations: 0 bytes)

So it does seem quite a bit slower still for small arrays, why?

Without @inbounds, the loop won’t SIMD. Short arrays won’t SIMD either.
It’s faster not to have code paths that aren’t taken.
Meaning it’s faster not to have @inbounds, because the SIMD code that allows the compiler to create isn’t being executed and just gets in the way.

For big arrays, that code does get executed and is faster.

3 Likes

I don’t understand, the code seems slower for any size without @inbounds (from testing). Also, this doesn’t tell me why it’s still slower for small arrays than looping.

So again, is there anyway to get the optimal speed for any size? Or is the only way to switch algorithms for the simulation if the connectivity get higher?

There is nothing wrong with the loop implementation (if it achieves good performance) and there is nothing wrong with switching algos according to parameters (most algos do this internally).

The best way to get good answers is to post another version of the code, with the test, and with parameters which are representative of actual use case. Such running code will get the best ‘treatment’ by the community (much better than asking for optimal speed for any size).

julia> @btime main_algo_inbounds_simd($state, $adj10)
  4.916 ns (0 allocations: 0 bytes)
3.611952005832202

julia> @btime main_algo_inbounds($state, $adj10)
  5.541 ns (0 allocations: 0 bytes)
3.611952005832202

julia> @btime main_algo_loop($state, $adj10)
  7.416 ns (0 allocations: 0 bytes)
3.611952005832202

julia> @btime main_algo_turbo($state, $adj10)
  6.166 ns (0 allocations: 0 bytes)
3.611952005832202

julia> @btime main_algo_inbounds_simd($state, $adj100)
  40.573 ns (0 allocations: 0 bytes)
26.671416641549378

julia> @btime main_algo_inbounds($state, $adj100)
  57.774 ns (0 allocations: 0 bytes)
26.67141664154938

julia> @btime main_algo_loop($state, $adj100)
  68.818 ns (0 allocations: 0 bytes)
26.67141664154938

julia> @btime main_algo_turbo($state, $adj100)
  29.146 ns (0 allocations: 0 bytes)
26.671416641549374

These are all the original loop, but with different macros. For example:

julia> function main_algo_inbounds_simd(state, adj)
           e = zero(eltype(state))
           @inbounds @simd for loop_idx in eachindex(adj[1])
               connection_idx = adj[1][loop_idx]
               e += state[connection_idx]*adj[2][loop_idx]
           end
           return e
       end
main_algo_inbounds_simd (generic function with 1 method)

There were examples shared before where @inbounds made very short loops slower, but that wasn’t the case here.

Oops… sorry @Elrod, I thought this was a post from question writer :face_with_hand_over_mouth:.

The main_alog_inbounds_simd is probably a good choice, and is faster than all the non-@turbo versions for large sizes.
Whether @turbo’s advantage at large sizes is enough to offset its disadvantage at small sizes plus its extra compile time latency will require knowing more about the distribution.

If we have 80% 10, 20% 100

julia> adj4 = (sort!(rand(1:1000, 4)), rand(4));

julia> @btime main_algo_inbounds_simd($state, $adj4)
  3.375 ns (0 allocations: 0 bytes)
0.8052880397380788

julia> @btime main_algo_inbounds($state, $adj4)
  3.666 ns (0 allocations: 0 bytes)
0.8052880397380788

julia> @btime main_algo_loop($state, $adj4)
  4.291 ns (0 allocations: 0 bytes)
0.8052880397380788

julia> @btime main_algo_turbo($state, $adj4)
  4.625 ns (0 allocations: 0 bytes)
0.8052880397380788

then we have average runtimes in ns of

julia> 0.8 * 3.375 + 0.2 * 40.573 # @inbounds @simd
10.814600000000002

julia> 0.8 * 4.625 + 0.2 * 29.146 # @turbo
9.5292

and @turbo wins ignoring compile times, thanks to its advantage for larger inputs.

The above was on an M1.

On a 9940X, I get
julia> @btime main_algo_loop($state, $adj4)
  5.044 ns (0 allocations: 0 bytes)
0.8847923117356615

julia> @btime main_algo_inbounds($state, $adj4)
  4.089 ns (0 allocations: 0 bytes)
0.8847923117356615

julia> @btime main_algo_inbounds_simd($state, $adj4)
  4.078 ns (0 allocations: 0 bytes)
0.8847923117356615

julia> @btime main_algo_turbo($state, $adj4)
  7.801 ns (0 allocations: 0 bytes)
0.8847923117356615

julia> @btime main_algo_loop($state, $adj100)
  98.579 ns (0 allocations: 0 bytes)
24.71532662119185

julia> @btime main_algo_inbounds($state, $adj100)
  89.646 ns (0 allocations: 0 bytes)
24.71532662119185

julia> @btime main_algo_inbounds_simd($state, $adj100)
  32.077 ns (0 allocations: 0 bytes)
24.715326621191846

julia> @btime main_algo_turbo($state, $adj100)
  30.409 ns (0 allocations: 0 bytes)
24.71532662119185

On this CPU @inbounds @simd and @turbo are around 3x or more faster than the alternatives for adj100 and (now shown) 4x or more faster for length 1000.

5 Likes

After a bit of a bit of a hiatus I’m back on this project. I hope people are still interested because I’m a bit stumped.

To give a bit of a better idea of how my program is set up I will sketch an outline:

struct g
  state::Vector{Int8}
  adj::Vector{Vector{Tuple{Float32,Int32}}}
end

function mainLoop(state, adj, someBoolRef, loopCounterRef, sumfunc)
  while someBoolRef[]
    metropolisalgo(g, adj, sumfunc) 
    loopCounterRef[] += 1
  end
end

function metropolisalgo(state,adj, sumfunc)
  # Do some stuff
  idx = rand(1:length(state))
  
  energy = - state[idx]*sumfunc(state, adj, idx)

  # do more stuff
end

function sumfunc(state, adj, idx)
  # either include @simd or not
  @inbounds for adj_idx in eachindex(adj[idx])
    e += state[adj[idx][adj_idx][1]] * adj[idx][adj_idx][2]
  end
  return e
end

I made the energy function an argument to be able to restart the simulation with a different one easily, but before I tried the same by modifying the code by hand. The weird part is the following, if I run the sumfunc from the repl, it’s noticeably faster if I include @simd. I’m testing now for a scenario where ever adj[idx] has 224 entries. With @simd it will take about 120ns, whereas without is will take 180ns to complete. If I actually try to use it in my simulation, I can see absolutely no difference at all in performance. I’m tracking the performance through the loopCounterRef (actually it’s a field to some other struct in my real program, but it shouldn’t matter), and also just by eye. This makes me think the @simd macro is not being used somehow. Anyone any idea why this would be?

1 Like

I am not sure how much this apply to your actual code, but here e = 0 builds an int and then you add floats since state = rand(1000) makes floats. This is type unstable. Try with e = zero(eltype(state)) instead, you might win big.

Yep, I noticed that before and changed it already. Thanks for the help though!

1 Like