I’m writing a simulation where speed is very important. The program consists of a main loop where I need to collect some scalar by multiplying some floats based on connections in an adjacency matrix. As a very simplified example, a large part of the main loop of my simulation comes down to this:
using BenchmarkTools
const state = rand(1000)
const n_connections = 10
# (connections, weights)
const adj = (sort!(rand(1:1000, n_connections)), rand(n_connections))
function main_algo_loop(state, adj)
e = 0
for (loop_idx, connection_idx) in enumerate(adj[1])
e += state[connection_idx]*adj[2][loop_idx]
end
return e
end
function main_algo_broadcast(state, adj)
return sum((@view state[adj[1]]) .* adj[2])
end
If n_connections = 10
I get
julia> @btime main_algo_loop($state, $adj)
19.224 ns (0 allocations: 0 bytes)
julia> @btime main_algo_broadcast($state, $adj)
31.910 ns (1 allocation: 144 bytes)
however, for n_connections = 100
@btime main_algo_loop($state, $adj)
446.758 ns (0 allocations: 0 bytes)
julia> @btime main_algo_broadcast($state, $adj)
98.558 ns (1 allocation: 896 bytes)
So I can understand that broadcasting can be optimised, which is why it is faster, but why is it quite a bit slower for smaller vectors? Right now, usually in most cases the number of connections are quite low, so looping is faster. However, in some cases I would like to do simulations with higher connectivity so broadcasting would be much faster. I could switch the algorithm when the average number of connections gets high, but this would be a bit cumbersome and is not very elegant. Can I write something that is always optimally fast, no matter what the value of n_connections
is?