Interesting observation! Looking at the assembly, we can tell what’s going on. As you’re hinting, the steprange version has a non-inlined call to calculate the range length before iteration starts. The unit range on the other hand is fully inlined. The loops themselves are identical between the two versions, so the overhead will be a fixed 25 - 30 clock cycles (~8-15 ns) regardless of vector length. At least that’s what I’m seeing on my system.
However, the inner loops between the two steps-of-1 functions and the two steps-of-4s are not the same. The former unrolls the loop 4 times (does 4 additions, then loops), while the latter has no unrolling (single addition, then loop). Interestingly, this is the complete opposite to how it’s written in the functions (former not unrolled, latter manually unrolled).
The unrolling makes a huge difference, because it accumulates in 4 separate registers, which means that data dependency can be avoided, and the speedup is close to 4x on my system (for large enough vectors that the function overhead is not relevant):
julia> x = rand(10000);
julia> @btime iter1($x)
736.047 ns (0 allocations: 0 bytes)
5010.054710248716
julia> @btime iter5($x)
2.824 μs (0 allocations: 0 bytes)
5010.054710248716
julia> 2824/736
3.8369565217391304
I’m sure that this can be achieved in the steps-of-4 version as well, perhaps with the use of SIMD.jl
.