Unrolling loops over tuple elements can bring easy performance gains. Here’s a trivial example of a 250x speed-up, just by adding an @unroll macro to the code:
import Unroll.@unroll
using BenchmarkTools
function nounroll(args...)
total = 0
for i = 1:5
total += args[i]
end
return total
end
function withunroll(args...)
total = 0
@unroll for i = 1:5
total += args[i]
end
return total
end
args = (UInt8(1), Int16(1), UInt32(1), Float32(1.0), Float64(1.0))
@assert nounroll(args...) == withunroll(args...)
@btime nounroll(($args)...);
@btime withunroll(($args)...);
378.676 ns (4 allocations: 64 bytes)
1.500 ns (0 allocations: 0 bytes)
These gains are not surprising. When tuple elements are of different types, the loop cannot specialize on those types, leading to dynamic dispatch and/or type instability. However, since the tuple element types are known at compile time, if we unroll the loop, suddenly the compiler can specialize each iteration, leading to much more efficient code (in certain circumstances).
I’ve noticed some code in the Julia stdlib that would benefit from this (it can be fairly slow, for several reasons, but this is one). I’m sure there’s loads more. My own code would also (and does, in places) benefit from this. However, there are two problems. Firstly, I’m using an external package for the unroll macro, so it can’t be used in the stdlib. Secondly, the macro requires hard coded loop ranges (notice I cheated, and set the loop range to 1:5
, not 1:length(args)
, knowing there were 5 inputs). I think I see a way round the latter with @generated functions (not ideal, but likely unavoidable), but what can be done about the former?
I’m aware of the various ntuple macros and functions. However, I’ve found them to be neither flexible enough nor always performant, and found loop unrolling to be both.
Why isn’t it easier to unroll loops over tuples, in particular in stdlib code? Isn’t this something we should be doing? It seems to me the whole value of tuples is that their lengths and (usually) element types are known at compile time. Why not exploit this?