Making iterators as fast as loops

I’m trying to build an efficient iterator for a bunch of nested loops, so that the inner loops depend on the outer loops. I’m having problems achieving comparable performance with the iterator approach as with the loop approach.

This is a minimal example of the kind of loop I’m interested in

function test_loop(n)
    k = 0
    for i in 1:n, j in i^2:i^3
        k += i + j
    end
    return k
end

I would like to write it as

function test_iter(n)
    k = 0
    for (i, j) in MyIter(1:n)
        k += i + j
    end
    return k
end

This is what I’ve tried

struct MyIter
    irange::UnitRange{Int}
end
struct MyState
    i::Int
    ist::Int
    jrange::UnitRange{Int}
    j::Int
    jst::Int
end
const endstate = MyState(0, 0, 0:0, 0, 0)  # Sentinel for type stability of state
MyState(it::MyIter) = MyState(it, iterate(it.irange))
MyState(it::MyIter, ::Nothing) = endstate
MyState(it::MyIter, (i, ist)) = (jrange = i^2:i^3; MyState(it, i, ist, jrange, iterate(jrange)))
MyState(it::MyIter, i, ist, jrange, ::Nothing) = MyState(it, iterate(it.irange, ist))
MyState(it::MyIter, i, ist, jrange, (j, jst)) = MyState(i, ist, jrange, j, jst)

nextstate(it::MyIter, s::MyState) = MyState(it, s.i, s.ist, s.jrange, iterate(s.jrange, s.jst))

function Base.iterate(it::MyIter, s = MyState(it))
    s == endstate && return nothing
    newstate = nextstate(it, s)
    return (s.i, s.j), newstate
end

While this code is clear enough to me, and it seems perfectly type-stable to the best of my understanding, I get very poor performance with the iterator approach

julia> using BenchmarkTools

julia> @btime test_loop(10)
  17.948 ns (0 allocations: 0 bytes)
1000604

julia> @btime test_iter(10)
  3.111 μs (0 allocations: 0 bytes)
1000604

What am I doing wrong? What is a good, performant approach for this?

2 Likes

I don’t know a lot about this, but have you considered Tranaducers.jl? It tends to be quite performance compared to iterators.

First of all, realize that loops like for i = 1:10 are implemented with pure-Julia iterators. So the problem is not “loops vs. iterators,” it is that your iterator is slow compared to the Range iterators because your iterate function does a lot of operations.

For example, your termination condition is s == endstate, which requires six Int comparisons, compared to one i > 10 comparison required for 1:10 iteration.

4 Likes

The main reason for the difference is that large parts of your test_loop is constant-folded. It’s easy to see: The complexity of your loop is n^4 right? So growing the problem size with 10, you’d expect roughly a 10,000x slower runtime. Yet:

julia> @btime test_loop(10)
  23.441 ns (0 allocations: 0 bytes)
1000604

julia> @btime test_loop(100)
  231.024 ns (0 allocations: 0 bytes)
7396369649165

Compare with the iterator:

julia> @btime test_iter(10)
  3.929 μs (0 allocations: 0 bytes)
1000604

julia> @btime test_iter(100)
  36.426 ms (0 allocations: 0 bytes)
7396369649165

36 ms / 3.9 μs ≈ 10,000x slower

3 Likes

True! So how can I modify MyIter to make it do constant folding? Or better still, where in Base should I look for inspiration to solve this optimally?

For example, your termination condition is s == endstate , which requires six Int comparisons, compared to one i > 10 comparison required for 1:10 iteration.

I tried changing to s === endstate (which AFAIU doesn’t compare all fields), but that doesn’t improve performance

Do the simplification manually:

function test_loop_formula(n)
    k = 0
    for i in 1:n
        k += (i^3 - i^2 + 1) * (i^3 + i^2 + 2i) ÷ 2
    end
    return k
end

julia> test_loop(100)
7396369649165

julia> test_loop_formula(100)
7396369649165

Or benchmark on something closer to your actual problem, which presumably can’t be simplified this way?

1 Like

Nope, an immutable struct is identified by its value, so === still compares the contents.

2 Likes

I have tried this:

abstract type MyState end

struct MyEmptyState <: MyState end

struct MyNonEmptyState <: MyState
    i::Int
    ist::Int
    jrange::UnitRange{Int}
    j::Int
    jst::Int
end

And the performance issue didn’t get eased much.

@btime test_loop(10)
@btime test_iter(10)

  21.257 ns (0 allocations: 0 bytes)
  13.603 μs (0 allocations: 0 bytes)

Looking over the posted code again, I suspect that the biggest problem here is that the nextstate function relies on dynamic dispatch to decide which MyState constructor to call. Putting dynamic dispatch in your innermost loop is going to be very slow.

3 Likes

Yes, although comparing six numbers is removed, code_warntype shows that the type is not that stable. In fact mine slower than the his original code.

Benchmarks of his original codes in my computer:

@btime test_loop(10)
@btime test_iter(10)

  21.575 ns (0 allocations: 0 bytes)
  3.616 μs (0 allocations: 0 bytes)

After peforming these evil tricks

const endstate = MyState(-1, 0, 0:0, 0, 0)
s.i === endstate.i && return nothing

I got:

  21.581 ns (0 allocations: 0 bytes)
  2.215 μs (0 allocations: 0 bytes)

Let me reiterate – the biggest problem here is that in the loop version, the compiler is able to simplify the inner cubic complexity loop to a constant formula. Similar to k += (i^3 - i^2 + 1) * (i^3 + i^2 + 2i) ÷ 2. This does not happen in the iterator version. So the loop version does O(n) operations, while the iterator version does O(n^4) operations. The test is flawed.

For a more accurate test, we can for example do array indexing in the inner loop. (Don’t worry about the memory access, once the memory is cached, it has very low overhead.)

function test_loop(n, v)
    k = zero(eltype(v))
    for i = 1:n, j = i^2:i^3
        k += @inbounds v[i + j]
    end
    return k
end

function test_iter(n, v)
    k = zero(eltype(v))
    for (i, j) = MyIter(1:n)
        k += @inbounds v[i + j]
    end
    return k
end

Comparing performance:

julia> n = 10;

julia> v = rand(n^3 + 10n);

julia> @btime test_loop($n, $v)
  3.032 μs (0 allocations: 0 bytes)
1314.111504915699

julia> @btime test_iter($n, $v)
  3.953 μs (0 allocations: 0 bytes)
1314.111504915699

So the difference is not very big. As has been pointed out a few times above, a secondary issue is that the endstate comparison is inefficient. Let’s fix that:

-    s == endstate && return nothing
+    s.i == endstate.i && return nothing

Timing again:

julia> @btime test_iter($n, $v)
  3.134 μs (0 allocations: 0 bytes)
1314.111504915699

This is around 3 % slower than the loop version. If the iterator is a better fit for your problem than a loop, I think this should be good enough. To sanity check, note that the inner loop is hit 2650 times for n = 10, which corresponds to 3134 / 2650 ≈ 1.2 ns per inner loop, or 3.4 clock cycles on my 2.9 GHz system. The overhead of the iterator version over the loop version is 0.11 clock cycles per iteration.

(A word of caution: The above benchmarks assume that the runtime is dominated by the inner iteration (i^2 : i^3). If this is not the case, the iterator version will be a bit slower.)

10 Likes

That’s really useful @bennedich. Indeed, that MWE seems to not be representative of my actual code, which I suspect is closer to your example above. However, my real loop and iterator codes are not equally performing. I don’t have time now to delve deeper (family calls!), but I’ll investigate more later and post again. Thanks!

It turns out @bennedich and @stevengj were exactly right. Apart from the constant folding artefact in the original example, the real-world code with the iterator was actually doing more work than the loop version (which actually didn’t need most of the values generated by the iterator, so they never got computed). Doing an apples to apples comparison it is indeed true: iterators and loops are equally fast, as @bennedich’s example demonstrates.

(One thing I didn’t fully understand is @stevengj 's statement that nextstate needs dynamic dispatch. It’s true that at different points in the execution it dispatches to different methods, but is that the same as dynamic dispatch? It is now clear that this does nbot incur in any performance penalty compared to the loop.)

Good lessons, thanks to all.

EDIT: Actually, with this insight I was now able to make the real-world iterator version even marginally faster than the loop version (the iterator allows to cleanly shave off some computations), so I’m happy!

That’s dynamic dispatch: which method it dispatches to is determined at runtime, not compile time.

However, since in this case there are only two possible methods to dispatch to, I guess the compiler is able to turn this into single branch, rather than using slow generic dispatch.

3 Likes

Hmm. As an experiment, I just tried:

julia> function iter1(x)
           s = zero(eltype(x))
           @inbounds @simd for i ∈ 1:length(x)
               s += x[i]
           end
           s
       end
iter1 (generic function with 1 method)

julia> function iter2(x)
           s = zero(eltype(x))
           @inbounds @simd for i ∈ 1:1:length(x)
               s += x[i]
           end
           s
       end
iter2 (generic function with 1 method)

julia> x = rand(200);

julia> using BenchmarkTools

julia> @btime iter1($x)
  36.869 ns (0 allocations: 0 bytes)
98.7796609023857

julia> @btime iter2($x)
  51.419 ns (0 allocations: 0 bytes)
98.7796609023857

Curious about the case of, “what if we actually want to take steps larger than 1?” – would it be faster to use 1:step:N, or 1:(N÷step); iter = i * step?

julia> using Base.Cartesian

julia> function iter4(x) # requires length(x) % 4 == 0
           @nexprs 4 j -> s_j = zero(eltype(x))
           @inbounds @fastmath for i ∈ 0:4:length(x)-4
               @nexprs 4 j -> s_j += x[i+j]
           end
           s_1 + s_2 + s_3 + s_4
       end
iter4 (generic function with 1 method)

julia> function iter5(x)
           @nexprs 4 j -> s_j = zero(eltype(x))
           @inbounds @fastmath for i ∈ 0:(length(x)>>2)-1
               k = 4i
               @nexprs 4 j -> s_j += x[k+j]
           end
           s_1 + s_2 + s_3 + s_4
       end
iter5 (generic function with 1 method)

julia> @btime iter4($x)
  73.623 ns (0 allocations: 0 bytes)
98.7796609023857

julia> @btime iter5($x)
  58.758 ns (0 allocations: 0 bytes)
98.7796609023857

I’m on a six day old master:

julia> versioninfo()
Julia Version 1.2.0-DEV.377
Commit 4b01e6a60f (2019-02-26 16:06 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz

All four @code_warntypes showed ::Union{Nothing, Tuple{Int64,Int64}}, but the stepranges generated a little more code before the loop body. In all cases the loop condition compiled down to a simple jne jump (but the @simd loop was more unrolled).

Something I may keep in mind.

3 Likes

Interesting observation! Looking at the assembly, we can tell what’s going on. As you’re hinting, the steprange version has a non-inlined call to calculate the range length before iteration starts. The unit range on the other hand is fully inlined. The loops themselves are identical between the two versions, so the overhead will be a fixed 25 - 30 clock cycles (~8-15 ns) regardless of vector length. At least that’s what I’m seeing on my system.

However, the inner loops between the two steps-of-1 functions and the two steps-of-4s are not the same. The former unrolls the loop 4 times (does 4 additions, then loops), while the latter has no unrolling (single addition, then loop). Interestingly, this is the complete opposite to how it’s written in the functions (former not unrolled, latter manually unrolled).

The unrolling makes a huge difference, because it accumulates in 4 separate registers, which means that data dependency can be avoided, and the speedup is close to 4x on my system (for large enough vectors that the function overhead is not relevant):

julia> x = rand(10000);

julia> @btime iter1($x)
  736.047 ns (0 allocations: 0 bytes)
5010.054710248716

julia> @btime iter5($x)
  2.824 μs (0 allocations: 0 bytes)
5010.054710248716

julia> 2824/736
3.8369565217391304

I’m sure that this can be achieved in the steps-of-4 version as well, perhaps with the use of SIMD.jl.

5 Likes

With avx2, it’s actually unrolled by 16 for iter1 and iter2, and by 4 (as written) for iter4 and iter5.
The ymm registers are each 256 bits, containing 4 Float64.
I’m impressed by that level of improvement from out of order execution.
(EDIT: if you have avx512, that may help explain why @simd is so much faster. Then it’d be unrolling by 32 vs 4, and each vadd in the former would do 8 additions. If you’re on avx2, I’m really impressed by how much the extra unrolling bought you!)

I’m not at a computer at the moment, but you could try manually unrolling by 16 instead (don’t forget to correct the loop bounds). I’d expect it to give similar performance.
You’d have to either add another loop to catch the rest, or benchmark on vectors with lengths that are a multiple of 16, when comparing to the autovectorized version.

EDIT:
Okay, back on a computer. I now defined iter6:

function iter6(x)
    @nexprs 16 j -> s_j = zero(eltype(x))
    @inbounds @fastmath for i ∈ 0:(length(x)>>4)-1
        k = 16i
        @nexprs 16 j -> s_j += x[k+j]
    end
    @nexprs 4 j -> ss_j = s_j + s_{j+4} + s_{j+8} + s_{j+12}
    ss_1 + ss_2 + ss_3 + ss_4
end

Now the loop bodies are, iter1 (@simd):

L112:
	vaddpd	-96(%rsi), %ymm0, %ymm0
	vaddpd	-64(%rsi), %ymm1, %ymm1
	vaddpd	-32(%rsi), %ymm2, %ymm2
	vaddpd	(%rsi), %ymm3, %ymm3
; │└
; │┌ @ simdloop.jl:74 within `macro expansion'
; ││┌ @ int.jl:53 within `+'
	subq	$-128, %rsi
	addq	$-16, %rdi
	jne	L112
L64:
	vaddpd	(%rcx), %ymm0, %ymm0
	vaddpd	32(%rcx), %ymm1, %ymm1
	vaddpd	64(%rcx), %ymm3, %ymm3
	vaddpd	96(%rcx), %ymm2, %ymm2
; │└
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
	subq	$-128, %rcx
	decq	%rax
; │└└
	jne	L64

Similarly, the run times are almost the same:

julia> x = rand(1024);

julia> @btime iter1($x)
  68.709 ns (0 allocations: 0 bytes)
530.6541640341687

julia> @btime iter6($x)
  69.549 ns (0 allocations: 0 bytes)
530.6541640341687

julia> @btime iter5($x)
  173.132 ns (0 allocations: 0 bytes)
530.6541640341687

julia> @btime iter1($x)
  68.361 ns (0 allocations: 0 bytes)
530.6541640341687

julia> @btime iter6($x)
  69.549 ns (0 allocations: 0 bytes)
530.6541640341687

julia> @btime iter5($x)
  174.080 ns (0 allocations: 0 bytes)
530.6541640341687

The slight advantage of iter1 over iter6 seems consistent.

This was on Ryzen, and I got about a 2.5x speedup from the 4x unrolling.
Now…
Moving to Skylake-X, with avx512:

julia> x = rand(1024);

julia> @btime iter1($x)
  29.732 ns (0 allocations: 0 bytes)
528.0621589149741

julia> @btime iter5($x)
  185.889 ns (0 allocations: 0 bytes)
528.0621589149741

julia> @btime iter6($x)
  55.177 ns (0 allocations: 0 bytes)
528.062158914974

julia> @btime iter1($x)
  27.871 ns (0 allocations: 0 bytes)
528.0621589149741

julia> @btime iter5($x)
  186.176 ns (0 allocations: 0 bytes)
528.0621589149741

julia> @btime iter6($x)
  55.062 ns (0 allocations: 0 bytes)
528.062158914974

and I still see about a 2x speedup from iter6 to iter1, while iter5 is a full 6.5x slower.

3 Likes

Nice trick of manually unrolling by x16 to generate efficient code (on AVX2)!

I differentiate between “unrolling” and “vectorizing” btw – with vectorizing I refer to using SIMD to operate on multiple data in a single instruction, and with unrolling I refer to how many such instructions are carried out per loop. So AVX2 lets us operate on 4x 64 bit floats at a time, and unrolling the loop by a factor 4 means that a total of 16 floats are handled per loop iteration (by 4 SIMD instructions). (Perhaps there are other more established terms for this?)

The main advantage of unrolling, as I see it, is not to do less iterations (since the loop overhead is cheap due to correctly predicted jumps), but to avoid pipeline stalls due to instruction latency.

2 Likes