Making iterators as fast as loops

Do the simplification manually:

function test_loop_formula(n)
    k = 0
    for i in 1:n
        k += (i^3 - i^2 + 1) * (i^3 + i^2 + 2i) ÷ 2
    end
    return k
end

julia> test_loop(100)
7396369649165

julia> test_loop_formula(100)
7396369649165

Or benchmark on something closer to your actual problem, which presumably can’t be simplified this way?

1 Like

Nope, an immutable struct is identified by its value, so === still compares the contents.

2 Likes

I have tried this:

abstract type MyState end

struct MyEmptyState <: MyState end

struct MyNonEmptyState <: MyState
    i::Int
    ist::Int
    jrange::UnitRange{Int}
    j::Int
    jst::Int
end

And the performance issue didn’t get eased much.

@btime test_loop(10)
@btime test_iter(10)

  21.257 ns (0 allocations: 0 bytes)
  13.603 μs (0 allocations: 0 bytes)

Looking over the posted code again, I suspect that the biggest problem here is that the nextstate function relies on dynamic dispatch to decide which MyState constructor to call. Putting dynamic dispatch in your innermost loop is going to be very slow.

3 Likes

Yes, although comparing six numbers is removed, code_warntype shows that the type is not that stable. In fact mine slower than the his original code.

Benchmarks of his original codes in my computer:

@btime test_loop(10)
@btime test_iter(10)

  21.575 ns (0 allocations: 0 bytes)
  3.616 μs (0 allocations: 0 bytes)

After peforming these evil tricks

const endstate = MyState(-1, 0, 0:0, 0, 0)
s.i === endstate.i && return nothing

I got:

  21.581 ns (0 allocations: 0 bytes)
  2.215 μs (0 allocations: 0 bytes)

Let me reiterate – the biggest problem here is that in the loop version, the compiler is able to simplify the inner cubic complexity loop to a constant formula. Similar to k += (i^3 - i^2 + 1) * (i^3 + i^2 + 2i) ÷ 2. This does not happen in the iterator version. So the loop version does O(n) operations, while the iterator version does O(n^4) operations. The test is flawed.

For a more accurate test, we can for example do array indexing in the inner loop. (Don’t worry about the memory access, once the memory is cached, it has very low overhead.)

function test_loop(n, v)
    k = zero(eltype(v))
    for i = 1:n, j = i^2:i^3
        k += @inbounds v[i + j]
    end
    return k
end

function test_iter(n, v)
    k = zero(eltype(v))
    for (i, j) = MyIter(1:n)
        k += @inbounds v[i + j]
    end
    return k
end

Comparing performance:

julia> n = 10;

julia> v = rand(n^3 + 10n);

julia> @btime test_loop($n, $v)
  3.032 μs (0 allocations: 0 bytes)
1314.111504915699

julia> @btime test_iter($n, $v)
  3.953 μs (0 allocations: 0 bytes)
1314.111504915699

So the difference is not very big. As has been pointed out a few times above, a secondary issue is that the endstate comparison is inefficient. Let’s fix that:

-    s == endstate && return nothing
+    s.i == endstate.i && return nothing

Timing again:

julia> @btime test_iter($n, $v)
  3.134 μs (0 allocations: 0 bytes)
1314.111504915699

This is around 3 % slower than the loop version. If the iterator is a better fit for your problem than a loop, I think this should be good enough. To sanity check, note that the inner loop is hit 2650 times for n = 10, which corresponds to 3134 / 2650 ≈ 1.2 ns per inner loop, or 3.4 clock cycles on my 2.9 GHz system. The overhead of the iterator version over the loop version is 0.11 clock cycles per iteration.

(A word of caution: The above benchmarks assume that the runtime is dominated by the inner iteration (i^2 : i^3). If this is not the case, the iterator version will be a bit slower.)

10 Likes

That’s really useful @bennedich. Indeed, that MWE seems to not be representative of my actual code, which I suspect is closer to your example above. However, my real loop and iterator codes are not equally performing. I don’t have time now to delve deeper (family calls!), but I’ll investigate more later and post again. Thanks!

It turns out @bennedich and @stevengj were exactly right. Apart from the constant folding artefact in the original example, the real-world code with the iterator was actually doing more work than the loop version (which actually didn’t need most of the values generated by the iterator, so they never got computed). Doing an apples to apples comparison it is indeed true: iterators and loops are equally fast, as @bennedich’s example demonstrates.

(One thing I didn’t fully understand is @stevengj 's statement that nextstate needs dynamic dispatch. It’s true that at different points in the execution it dispatches to different methods, but is that the same as dynamic dispatch? It is now clear that this does nbot incur in any performance penalty compared to the loop.)

Good lessons, thanks to all.

EDIT: Actually, with this insight I was now able to make the real-world iterator version even marginally faster than the loop version (the iterator allows to cleanly shave off some computations), so I’m happy!

That’s dynamic dispatch: which method it dispatches to is determined at runtime, not compile time.

However, since in this case there are only two possible methods to dispatch to, I guess the compiler is able to turn this into single branch, rather than using slow generic dispatch.

3 Likes

Hmm. As an experiment, I just tried:

julia> function iter1(x)
           s = zero(eltype(x))
           @inbounds @simd for i ∈ 1:length(x)
               s += x[i]
           end
           s
       end
iter1 (generic function with 1 method)

julia> function iter2(x)
           s = zero(eltype(x))
           @inbounds @simd for i ∈ 1:1:length(x)
               s += x[i]
           end
           s
       end
iter2 (generic function with 1 method)

julia> x = rand(200);

julia> using BenchmarkTools

julia> @btime iter1($x)
  36.869 ns (0 allocations: 0 bytes)
98.7796609023857

julia> @btime iter2($x)
  51.419 ns (0 allocations: 0 bytes)
98.7796609023857

Curious about the case of, “what if we actually want to take steps larger than 1?” – would it be faster to use 1:step:N, or 1:(N÷step); iter = i * step?

julia> using Base.Cartesian

julia> function iter4(x) # requires length(x) % 4 == 0
           @nexprs 4 j -> s_j = zero(eltype(x))
           @inbounds @fastmath for i ∈ 0:4:length(x)-4
               @nexprs 4 j -> s_j += x[i+j]
           end
           s_1 + s_2 + s_3 + s_4
       end
iter4 (generic function with 1 method)

julia> function iter5(x)
           @nexprs 4 j -> s_j = zero(eltype(x))
           @inbounds @fastmath for i ∈ 0:(length(x)>>2)-1
               k = 4i
               @nexprs 4 j -> s_j += x[k+j]
           end
           s_1 + s_2 + s_3 + s_4
       end
iter5 (generic function with 1 method)

julia> @btime iter4($x)
  73.623 ns (0 allocations: 0 bytes)
98.7796609023857

julia> @btime iter5($x)
  58.758 ns (0 allocations: 0 bytes)
98.7796609023857

I’m on a six day old master:

julia> versioninfo()
Julia Version 1.2.0-DEV.377
Commit 4b01e6a60f (2019-02-26 16:06 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz

All four @code_warntypes showed ::Union{Nothing, Tuple{Int64,Int64}}, but the stepranges generated a little more code before the loop body. In all cases the loop condition compiled down to a simple jne jump (but the @simd loop was more unrolled).

Something I may keep in mind.

3 Likes

Interesting observation! Looking at the assembly, we can tell what’s going on. As you’re hinting, the steprange version has a non-inlined call to calculate the range length before iteration starts. The unit range on the other hand is fully inlined. The loops themselves are identical between the two versions, so the overhead will be a fixed 25 - 30 clock cycles (~8-15 ns) regardless of vector length. At least that’s what I’m seeing on my system.

However, the inner loops between the two steps-of-1 functions and the two steps-of-4s are not the same. The former unrolls the loop 4 times (does 4 additions, then loops), while the latter has no unrolling (single addition, then loop). Interestingly, this is the complete opposite to how it’s written in the functions (former not unrolled, latter manually unrolled).

The unrolling makes a huge difference, because it accumulates in 4 separate registers, which means that data dependency can be avoided, and the speedup is close to 4x on my system (for large enough vectors that the function overhead is not relevant):

julia> x = rand(10000);

julia> @btime iter1($x)
  736.047 ns (0 allocations: 0 bytes)
5010.054710248716

julia> @btime iter5($x)
  2.824 μs (0 allocations: 0 bytes)
5010.054710248716

julia> 2824/736
3.8369565217391304

I’m sure that this can be achieved in the steps-of-4 version as well, perhaps with the use of SIMD.jl.

5 Likes

With avx2, it’s actually unrolled by 16 for iter1 and iter2, and by 4 (as written) for iter4 and iter5.
The ymm registers are each 256 bits, containing 4 Float64.
I’m impressed by that level of improvement from out of order execution.
(EDIT: if you have avx512, that may help explain why @simd is so much faster. Then it’d be unrolling by 32 vs 4, and each vadd in the former would do 8 additions. If you’re on avx2, I’m really impressed by how much the extra unrolling bought you!)

I’m not at a computer at the moment, but you could try manually unrolling by 16 instead (don’t forget to correct the loop bounds). I’d expect it to give similar performance.
You’d have to either add another loop to catch the rest, or benchmark on vectors with lengths that are a multiple of 16, when comparing to the autovectorized version.

EDIT:
Okay, back on a computer. I now defined iter6:

function iter6(x)
    @nexprs 16 j -> s_j = zero(eltype(x))
    @inbounds @fastmath for i ∈ 0:(length(x)>>4)-1
        k = 16i
        @nexprs 16 j -> s_j += x[k+j]
    end
    @nexprs 4 j -> ss_j = s_j + s_{j+4} + s_{j+8} + s_{j+12}
    ss_1 + ss_2 + ss_3 + ss_4
end

Now the loop bodies are, iter1 (@simd):

L112:
	vaddpd	-96(%rsi), %ymm0, %ymm0
	vaddpd	-64(%rsi), %ymm1, %ymm1
	vaddpd	-32(%rsi), %ymm2, %ymm2
	vaddpd	(%rsi), %ymm3, %ymm3
; │└
; │┌ @ simdloop.jl:74 within `macro expansion'
; ││┌ @ int.jl:53 within `+'
	subq	$-128, %rsi
	addq	$-16, %rdi
	jne	L112
L64:
	vaddpd	(%rcx), %ymm0, %ymm0
	vaddpd	32(%rcx), %ymm1, %ymm1
	vaddpd	64(%rcx), %ymm3, %ymm3
	vaddpd	96(%rcx), %ymm2, %ymm2
; │└
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
	subq	$-128, %rcx
	decq	%rax
; │└└
	jne	L64

Similarly, the run times are almost the same:

julia> x = rand(1024);

julia> @btime iter1($x)
  68.709 ns (0 allocations: 0 bytes)
530.6541640341687

julia> @btime iter6($x)
  69.549 ns (0 allocations: 0 bytes)
530.6541640341687

julia> @btime iter5($x)
  173.132 ns (0 allocations: 0 bytes)
530.6541640341687

julia> @btime iter1($x)
  68.361 ns (0 allocations: 0 bytes)
530.6541640341687

julia> @btime iter6($x)
  69.549 ns (0 allocations: 0 bytes)
530.6541640341687

julia> @btime iter5($x)
  174.080 ns (0 allocations: 0 bytes)
530.6541640341687

The slight advantage of iter1 over iter6 seems consistent.

This was on Ryzen, and I got about a 2.5x speedup from the 4x unrolling.
Now…
Moving to Skylake-X, with avx512:

julia> x = rand(1024);

julia> @btime iter1($x)
  29.732 ns (0 allocations: 0 bytes)
528.0621589149741

julia> @btime iter5($x)
  185.889 ns (0 allocations: 0 bytes)
528.0621589149741

julia> @btime iter6($x)
  55.177 ns (0 allocations: 0 bytes)
528.062158914974

julia> @btime iter1($x)
  27.871 ns (0 allocations: 0 bytes)
528.0621589149741

julia> @btime iter5($x)
  186.176 ns (0 allocations: 0 bytes)
528.0621589149741

julia> @btime iter6($x)
  55.062 ns (0 allocations: 0 bytes)
528.062158914974

and I still see about a 2x speedup from iter6 to iter1, while iter5 is a full 6.5x slower.

3 Likes

Nice trick of manually unrolling by x16 to generate efficient code (on AVX2)!

I differentiate between “unrolling” and “vectorizing” btw – with vectorizing I refer to using SIMD to operate on multiple data in a single instruction, and with unrolling I refer to how many such instructions are carried out per loop. So AVX2 lets us operate on 4x 64 bit floats at a time, and unrolling the loop by a factor 4 means that a total of 16 floats are handled per loop iteration (by 4 SIMD instructions). (Perhaps there are other more established terms for this?)

The main advantage of unrolling, as I see it, is not to do less iterations (since the loop overhead is cheap due to correctly predicted jumps), but to avoid pipeline stalls due to instruction latency.

2 Likes

Nice trick of manually unrolling by x16 to generate efficient code (on AVX2)!

FWIW, I was interested in UnitRange vs StepRange because I have a library called LoopVectorization.jl, which vectorizes, and optionally unrolls, loops.
The idea is that it’s a blunt, brute-force instrument.
(LoopVectorization also depends on the unregistered VectorizationBase, SIMDPirates, and the SIMDPirates fork of SLEEF.jl – I’ll eventually make a new repo called SLEEFPirates to avoid having to git fetch origin && git checkout -b SIMDPirates remotes/origin/SIMDPirates).

Sometimes, no matter how much you try and argue with the autovectorizer, it may refuse to cooperate. Especially when pointers are involved. For example:

function axpy!(z, a, x, y)
    @inbounds @simd ivdep for i ∈ eachindex(z, x, y)
        z[i] = a * x[i] + y[i]
    end
end
using StaticArrays, BenchmarkTools
function randmvector(::Val{N}, T = Float64) where N
    # Compiles quickly, regardless of size of MVector
    # unlike @MVector rand(N)
    mv = MVector{N,T}(undef)
    @inbounds for i ∈ 1:N
        mv[i] = rand()
    end
    mv
end
x = randmvector(Val(512));
y = randmvector(Val(512));
z = MVector{512,Float64}(undef);
@code_native axpy!(z, 2.0, x, y)

This yields:

julia> @code_native axpy!(z, 2.0, x, y)
	.text
; ┌ @ REPL[1]:2 within `axpy!'
	movq	$-4096, %rax            # imm = 0xF000
	nopw	(%rax,%rax)
; │┌ @ simdloop.jl:73 within `macro expansion' @ REPL[1]:3
; ││┌ @ float.jl:399 within `*'
L16:
	vmulsd	4096(%rsi,%rax), %xmm0, %xmm1
; ││└
; ││ @ simdloop.jl:73 within `macro expansion' @ float.jl:395
	vaddsd	4096(%rdx,%rax), %xmm1, %xmm1
; ││ @ simdloop.jl:73 within `macro expansion' @ REPL[1]:3
; ││┌ @ MArray.jl:130 within `setindex!'
; │││┌ @ gcutils.jl:87 within `macro expansion'
; ││││┌ @ pointer.jl:118 within `unsafe_store!'
	vmovsd	%xmm1, 4096(%rdi,%rax)
; │└└└└
; │┌ @ int.jl:49 within `macro expansion'
	addq	$8, %rax
; └└
; ┌ @ simdloop.jl:71 within `axpy!'
	jne	L16
; └
; ┌ @ REPL[1]:2 within `axpy!'
	retq
	nopw	%cs:(%rax,%rax)
; └

The loop is just:

L16:
	vmulsd	4096(%rsi,%rax), %xmm0, %xmm1
	vaddsd	4096(%rdx,%rax), %xmm1, %xmm1
	vmovsd	%xmm1, 4096(%rdi,%rax)
	addq	$8, %rax
	jne	L16

Note that vmulsd and vaddsd multiply and add single doubles (hence, sd instead of pd suffix).
Yikes!
It’s naturally very slow:

julia> @btime axpy!($z, 2.0, $x, $y)
  228.304 ns (0 allocations: 0 bytes)

While

using LoopVectorization
function axpy2!(z, a, x, y)
    @vectorize for i ∈ eachindex(z, x, y)
        z[i] = a * x[i] + y[i]
    end
end
@code_native axpy2!(z, 2.0, x, y)

Yields:

L16:
	vmovupd	4096(%rsi,%rax), %zmm1
	vfmadd213pd	4096(%rdx,%rax), %zmm0, %zmm1
	vmovupd	%zmm1, 4096(%rdi,%rax)
	addq	$64, %rax
	jne	L16

I also did add an unroll argument:

function axpy3!(z, a, x, y)
    @vectorize 2 for i ∈ eachindex(z, x, y)
        z[i] = a * x[i] + y[i]
    end
end
function axpy4!(z, a, x, y)
    @vectorize 4 for i ∈ eachindex(z, x, y)
        z[i] = a * x[i] + y[i]
    end
end
@code_native axpy3!(z, 2.0, x, y)
@code_native axpy4!(z, 2.0, x, y)

Note that it currently gets rounded down to the nearest power of 2.
axpy3 looks like expected:

L16:
	vmovupd	4096(%rsi,%rax), %zmm1
	vmovupd	4160(%rsi,%rax), %zmm2
	vfmadd213pd	4096(%rdx,%rax), %zmm0, %zmm1
	vfmadd213pd	4160(%rdx,%rax), %zmm0, %zmm2
	vmovupd	%zmm2, 4160(%rdi,%rax)
	vmovupd	%zmm1, 4096(%rdi,%rax)
	addq	$128, %rax
	jne	L16

while for axpy4, we unrolled the loop by enough (only 16 iterations left) that llvm actually decided to completely unroll the loop.
Results:

julia> @btime axpy2!($z, 2.0, $x, $y)
  23.885 ns (0 allocations: 0 bytes)

julia> @btime axpy3!($z, 2.0, $x, $y)
  23.793 ns (0 allocations: 0 bytes)

julia> @btime axpy4!($z, 2.0, $x, $y)
  24.066 ns (0 allocations: 0 bytes)

Much better!
Of course, my actual motivation was more along the lines of:

using SLEEF
function special!(z, a, x, y)
    @inbounds for i ∈ eachindex(z, x, y)
        w = a * x[i] + y[i]
        s, c = sincos(w)
        z[i] = exp(s + log(c*c))
    end
end
function vectorized_special!(z, a, x, y)
    @vectorize for i ∈ eachindex(z, x, y)
        w = a * x[i] + y[i]
        s, c = sincos(w)
        z[i] = exp(s + log(c*c))
    end
end
@btime special!($z, 2.0, $x, $y)
@btime vectorized_special!($z, 2.0, $x, $y)

Yielding an almost 8x speedup:

julia> @btime special!($z, 2.0, $x, $y)
  15.081 μs (0 allocations: 0 bytes)

julia> @btime vectorized_special!($z, 2.0, $x, $y)
  1.955 μs (0 allocations: 0 bytes)
3 Likes

Thanks, this is very impressive! I’m interested by such use cases as well, but when I try to reproduce your example, I get the following error:

julia> vectorized_special!(z, 2.0, x, y)
ERROR: MethodError: no method matching sincos_fast(::SVec{4,Float64})
Closest candidates are:
  sincos_fast(::Float16) at /home/user/.julia/packages/SLEEF/vUBln/src/SLEEF.jl:130
  sincos_fast(::Union{Float64, Vec{#s12,Float64} where #s12}) at /home/user/.julia/packages/SLEEF/vUBln/src/trig.jl:352                                                                                                              
  sincos_fast(::Union{Float32, Vec{#s12,Float32} where #s12}) at /home/user/.julia/packages/SLEEF/vUBln/src/trig.jl:400                                                                                                              
  ...
Stacktrace:
 [1] vectorized_special!(::MArray{Tuple{512},Float64,1,512}, ::Float64, ::MArray{Tuple{512},Float64,1,512}, ::MArray{Tuple{512},Float64,1,512}) at /home/user/.julia/packages/LoopVectorization/ct9Bk/src/LoopVectorization.jl:209   
 [2] top-level scope at none:0

Do you have any idea about what I’m doing wrong? FWIW, I Pkg.added the relevant packages directly from their respective master branches on github:

  1. SLEEF had to be on the SIMDPirates branch. The ::SVec methods are only defined on that branch, while master supports SIMD.jl. I’ll create a fork of the library to make it easier to get set up with the correct libraries in a few hours. I don’t expect to ever merge my modifications into my master. And especially not while there is still an open PR from my master into the official master.
  2. I’ve been using all my libraries from ~/.julia/dev. Before, when getting someone set up, there were errors whenever they were added instead of deved. Seems this was fixed, because the @vectorize loop obviously applied correctly, and you ended up with a method error.
1 Like

Thanks! I can confirm that this fixes my problems. There does not seem to be any issue anymore with adding packages instead of deving them.



Self-contained example

In case anyone would be interested to fully reproduce this:

#] add https://github.com/chriselrod/LoopVectorization.jl https://github.com/chriselrod/VectorizationBase.jl https://github.com/chriselrod/SIMDPirates.jl https://github.com/chriselrod/SLEEF.jl#SIMDPirates


using LoopVectorization
using SLEEF

function special!(z, a, x, y)
    @inbounds for i ∈ eachindex(z, x, y)
        w = a * x[i] + y[i]
        s, c = sincos(w)
        z[i] = exp(s + log(c*c))
    end
end

function vectorized_special!(z, a, x, y)
    @vectorize for i ∈ eachindex(z, x, y)
        w = a * x[i] + y[i]
        s, c = sincos(w)
        z[i] = exp(s + log(c*c))
    end
end


using StaticArrays, BenchmarkTools
function randmvector(::Val{N}, T = Float64) where N
    mv = MVector{N,T}(undef)
    @inbounds for i ∈ 1:N
        mv[i] = rand()
    end
    mv
end
x = randmvector(Val(512));
y = randmvector(Val(512));
z = MVector{512,Float64}(undef);

@btime special!($z, 2.0, $x, $y)
@btime vectorized_special!($z, 2.0, $x, $y)
2 Likes

Then what is the reason to use iterators instead of loops?

In fact I don’t understand well iterators. The package iterators.jl doesn’t even have documentation, nor itertools.jl.

You’ve resurrected two old threads posting basically the same message in each, which didn’t really add to the original content. In a situation like this, please just make a new thread.

Also, see https://docs.julialang.org/en/v1/base/iterators/

2 Likes