# Performance scaling of broadcasting compared to looping

I’m writing a simulation where speed is very important. The program consists of a main loop where I need to collect some scalar by multiplying some floats based on connections in an adjacency matrix. As a very simplified example, a large part of the main loop of my simulation comes down to this:

``````using BenchmarkTools

const state = rand(1000)
const n_connections = 10

# (connections, weights)
const adj = (sort!(rand(1:1000, n_connections)), rand(n_connections))

e = 0
end
return e
end

end

``````

If `n_connections = 10` I get

``````julia> @btime main_algo_loop(\$state, \$adj)
19.224 ns (0 allocations: 0 bytes)

31.910 ns (1 allocation: 144 bytes)
``````

however, for `n_connections = 100`

``````@btime main_algo_loop(\$state, \$adj)
446.758 ns (0 allocations: 0 bytes)

98.558 ns (1 allocation: 896 bytes)
``````

So I can understand that broadcasting can be optimised, which is why it is faster, but why is it quite a bit slower for smaller vectors? Right now, usually in most cases the number of connections are quite low, so looping is faster. However, in some cases I would like to do simulations with higher connectivity so broadcasting would be much faster. I could switch the algorithm when the average number of connections gets high, but this would be a bit cumbersome and is not very elegant. Can I write something that is always optimally fast, no matter what the value of `n_connections` is?

The problem is not broadcasting (which is implemented internally with loops, after all), it’s allocation.

This allocates a new array (for the result of the `.*`) and then sums it, which is quite a bit slower for small arrays than looping in-place for the summation.

(Even if you pre-allocated the output array, you would lose in memory bandwidth simply by having another array to read/write from.)

``````main_algo_sum(state, adj) =
``````

is faster and shorter than the for-loop version (at least on my system).
You are welcome to report from yours.

Added: Add `@inbounds` for extra speed:

``````main_algo_sum(state, adj) =
``````
1 Like

Why not use the built-in dot function?

Also an option:

``````using LinearAlgebra
``````

Clear. Around 2x slower than `sum(i->` version.

1 Like

Thanks for the answer. I see, it makes sense that the allocation causes overhead, I didn’t think of that. However, if it’s implemented with loops, I don’t understand what makes it faster for longer arrays. Can you elaborate?

Interesting, for `n_connections = 10`

``````julia> @btime main_algo_sum(\$state, \$adj)
24.724 ns (0 allocations: 0 bytes)
``````

`n_connections = 100`

``````@btime main_algo_sum(\$state, \$adj)
45.248 ns (0 allocations: 0 bytes)
``````

However, for `n_connections = 4` (which is the most common for my simulation)

``````@btime main_algo_loop(\$state, \$adj)
6.666 ns (0 allocations: 0 bytes)

41.961 ns (1 allocation: 96 bytes)

22.818 ns (0 allocations: 0 bytes)
``````

So for larger arrays this is blazing fast, which is very useful. However, I still can’t use it for smaller arrays since a speed decrease of almost 4x is too much. So first of all I would like to understand the reason this is slower for smaller arrays, since this time it does not allocate and also why it is so much faster for larger arrays.

Furthermore it would be useful to know if it can be optimized for small arrays.

Edit: This is all on an M1 Max chip btw.

Edit 2: I don’t know what happened, but first of all without macro interploation it’s faster, and now testing it again, its faster in the first place:

``````@btime main_algo_sum(state, adj)
6.417 ns (0 allocations: 0 bytes)
``````

(with macro interpolation it would be around 6.7 ns, why?)

So, this is great! It’s faster in all cases. I still would like to understand why it’s faster for large arrays, however.

For very samll time-scales ( <100 cycles ) measurement might get trickier. Perhaps this should be repeated many times.

Also, if `n_connections = 4` mostly, it might be worth it to special case it with StaticArrays.

1 Like

`sum` has some tricks in its loops to squeeze out a bit more performance (as well as using pairwise summation for better accuracy).

2 Likes

That’s good to know. One weird thing I’m having though is that sometimes the `main_algo_sum` will still report 20ns, and after some restarts of the REPL, sometimes will do 6ns, whereas the one with the loop always reports 6ns… Why is this?

I guess I celebrated too soon. For `n_connections = 4`

and

``````function loopalgo_sum(state, adj)
for _ in 1:100
end
end

for _ in 1:100
end
end
``````

I get

``````julia> @btime loopalgo_sum(state, adj)
471.939 ns (0 allocations: 0 bytes)

210.510 ns (0 allocations: 0 bytes)
``````

So it does seem quite a bit slower still for small arrays, why?

Without `@inbounds`, the loop won’t SIMD. Short arrays won’t SIMD either.
It’s faster not to have code paths that aren’t taken.
Meaning it’s faster not to have `@inbounds`, because the SIMD code that allows the compiler to create isn’t being executed and just gets in the way.

For big arrays, that code does get executed and is faster.

3 Likes

I don’t understand, the code seems slower for any size without @inbounds (from testing). Also, this doesn’t tell me why it’s still slower for small arrays than looping.

So again, is there anyway to get the optimal speed for any size? Or is the only way to switch algorithms for the simulation if the connectivity get higher?

There is nothing wrong with the loop implementation (if it achieves good performance) and there is nothing wrong with switching algos according to parameters (most algos do this internally).

The best way to get good answers is to post another version of the code, with the test, and with parameters which are representative of actual use case. Such running code will get the best ‘treatment’ by the community (much better than asking for optimal speed for any size).

``````julia> @btime main_algo_inbounds_simd(\$state, \$adj10)
4.916 ns (0 allocations: 0 bytes)
3.611952005832202

5.541 ns (0 allocations: 0 bytes)
3.611952005832202

7.416 ns (0 allocations: 0 bytes)
3.611952005832202

6.166 ns (0 allocations: 0 bytes)
3.611952005832202

40.573 ns (0 allocations: 0 bytes)
26.671416641549378

57.774 ns (0 allocations: 0 bytes)
26.67141664154938

68.818 ns (0 allocations: 0 bytes)
26.67141664154938

29.146 ns (0 allocations: 0 bytes)
26.671416641549374
``````

These are all the original loop, but with different macros. For example:

``````julia> function main_algo_inbounds_simd(state, adj)
e = zero(eltype(state))
@inbounds @simd for loop_idx in eachindex(adj[1])
end
return e
end
main_algo_inbounds_simd (generic function with 1 method)
``````

There were examples shared before where `@inbounds` made very short loops slower, but that wasn’t the case here.

Oops… sorry @Elrod, I thought this was a post from question writer .

The `main_alog_inbounds_simd` is probably a good choice, and is faster than all the non-`@turbo` versions for large sizes.
Whether `@turbo`’s advantage at large sizes is enough to offset its disadvantage at small sizes plus its extra compile time latency will require knowing more about the distribution.

If we have 80% 10, 20% 100

``````julia> adj4 = (sort!(rand(1:1000, 4)), rand(4));

3.375 ns (0 allocations: 0 bytes)
0.8052880397380788

3.666 ns (0 allocations: 0 bytes)
0.8052880397380788

4.291 ns (0 allocations: 0 bytes)
0.8052880397380788

4.625 ns (0 allocations: 0 bytes)
0.8052880397380788
``````

then we have average runtimes in `ns` of

``````julia> 0.8 * 3.375 + 0.2 * 40.573 # @inbounds @simd
10.814600000000002

julia> 0.8 * 4.625 + 0.2 * 29.146 # @turbo
9.5292
``````

and `@turbo` wins ignoring compile times, thanks to its advantage for larger inputs.

The above was on an M1.

On a 9940X, I get
``````julia> @btime main_algo_loop(\$state, \$adj4)
5.044 ns (0 allocations: 0 bytes)
0.8847923117356615

4.089 ns (0 allocations: 0 bytes)
0.8847923117356615

4.078 ns (0 allocations: 0 bytes)
0.8847923117356615

7.801 ns (0 allocations: 0 bytes)
0.8847923117356615

98.579 ns (0 allocations: 0 bytes)
24.71532662119185

89.646 ns (0 allocations: 0 bytes)
24.71532662119185

32.077 ns (0 allocations: 0 bytes)
24.715326621191846

30.409 ns (0 allocations: 0 bytes)
24.71532662119185
``````

On this CPU `@inbounds @simd` and `@turbo` are around 3x or more faster than the alternatives for `adj100` and (now shown) 4x or more faster for length 1000.

5 Likes

After a bit of a bit of a hiatus I’m back on this project. I hope people are still interested because I’m a bit stumped.

To give a bit of a better idea of how my program is set up I will sketch an outline:

``````struct g
state::Vector{Int8}
end

function mainLoop(state, adj, someBoolRef, loopCounterRef, sumfunc)
while someBoolRef[]
loopCounterRef[] += 1
end
end

# Do some stuff
idx = rand(1:length(state))

energy = - state[idx]*sumfunc(state, adj, idx)

# do more stuff
end

# either include @simd or not
end
return e
end
``````

I made the energy function an argument to be able to restart the simulation with a different one easily, but before I tried the same by modifying the code by hand. The weird part is the following, if I run the `sumfunc` from the repl, it’s noticeably faster if I include @simd. I’m testing now for a scenario where ever `adj[idx]` has 224 entries. With `@simd` it will take about 120ns, whereas without is will take 180ns to complete. If I actually try to use it in my simulation, I can see absolutely no difference at all in performance. I’m tracking the performance through the loopCounterRef (actually it’s a field to some other struct in my real program, but it shouldn’t matter), and also just by eye. This makes me think the @simd macro is not being used somehow. Anyone any idea why this would be?

1 Like

I am not sure how much this apply to your actual code, but here `e = 0` builds an int and then you add floats since `state = rand(1000)` makes floats. This is type unstable. Try with `e = zero(eltype(state))` instead, you might win big.

Yep, I noticed that before and changed it already. Thanks for the help though!

1 Like