Making iterators as fast as loops

bennedich · March 5, 2019, 7:39am

Interesting observation! Looking at the assembly, we can tell what’s going on. As you’re hinting, the steprange version has a non-inlined call to calculate the range length before iteration starts. The unit range on the other hand is fully inlined. The loops themselves are identical between the two versions, so the overhead will be a fixed 25 - 30 clock cycles (~8-15 ns) regardless of vector length. At least that’s what I’m seeing on my system.

However, the inner loops between the two steps-of-1 functions and the two steps-of-4s are not the same. The former unrolls the loop 4 times (does 4 additions, then loops), while the latter has no unrolling (single addition, then loop). Interestingly, this is the complete opposite to how it’s written in the functions (former not unrolled, latter manually unrolled).

The unrolling makes a huge difference, because it accumulates in 4 separate registers, which means that data dependency can be avoided, and the speedup is close to 4x on my system (for large enough vectors that the function overhead is not relevant):

julia> x = rand(10000);

julia> @btime iter1($x)
  736.047 ns (0 allocations: 0 bytes)
5010.054710248716

julia> @btime iter5($x)
  2.824 μs (0 allocations: 0 bytes)
5010.054710248716

julia> 2824/736
3.8369565217391304

I’m sure that this can be achieved in the steps-of-4 version as well, perhaps with the use of SIMD.jl.

Topic		Replies	Views
Iterating over range is slower than while loop Performance	7	590	April 24, 2023
Question: Improve the speed of for loop Performance question	20	1572	April 12, 2023
Is it ok to iterate inside a `@simd` loop? General Usage performance , simd	8	285	April 25, 2025
Prefered pattern for generators? Performance	12	940	January 3, 2019
Usage of lazy vector instead of function Performance	22	947	November 7, 2022

Making iterators as fast as loops

Related topics