Making iterators as fast as loops

Nice trick of manually unrolling by x16 to generate efficient code (on AVX2)!

FWIW, I was interested in UnitRange vs StepRange because I have a library called LoopVectorization.jl, which vectorizes, and optionally unrolls, loops.
The idea is that it’s a blunt, brute-force instrument.
(LoopVectorization also depends on the unregistered VectorizationBase, SIMDPirates, and the SIMDPirates fork of SLEEF.jl – I’ll eventually make a new repo called SLEEFPirates to avoid having to git fetch origin && git checkout -b SIMDPirates remotes/origin/SIMDPirates).

Sometimes, no matter how much you try and argue with the autovectorizer, it may refuse to cooperate. Especially when pointers are involved. For example:

function axpy!(z, a, x, y)
    @inbounds @simd ivdep for i ∈ eachindex(z, x, y)
        z[i] = a * x[i] + y[i]
    end
end
using StaticArrays, BenchmarkTools
function randmvector(::Val{N}, T = Float64) where N
    # Compiles quickly, regardless of size of MVector
    # unlike @MVector rand(N)
    mv = MVector{N,T}(undef)
    @inbounds for i ∈ 1:N
        mv[i] = rand()
    end
    mv
end
x = randmvector(Val(512));
y = randmvector(Val(512));
z = MVector{512,Float64}(undef);
@code_native axpy!(z, 2.0, x, y)

This yields:

julia> @code_native axpy!(z, 2.0, x, y)
	.text
; ┌ @ REPL[1]:2 within `axpy!'
	movq	$-4096, %rax            # imm = 0xF000
	nopw	(%rax,%rax)
; │┌ @ simdloop.jl:73 within `macro expansion' @ REPL[1]:3
; ││┌ @ float.jl:399 within `*'
L16:
	vmulsd	4096(%rsi,%rax), %xmm0, %xmm1
; ││└
; ││ @ simdloop.jl:73 within `macro expansion' @ float.jl:395
	vaddsd	4096(%rdx,%rax), %xmm1, %xmm1
; ││ @ simdloop.jl:73 within `macro expansion' @ REPL[1]:3
; ││┌ @ MArray.jl:130 within `setindex!'
; │││┌ @ gcutils.jl:87 within `macro expansion'
; ││││┌ @ pointer.jl:118 within `unsafe_store!'
	vmovsd	%xmm1, 4096(%rdi,%rax)
; │└└└└
; │┌ @ int.jl:49 within `macro expansion'
	addq	$8, %rax
; └└
; ┌ @ simdloop.jl:71 within `axpy!'
	jne	L16
; └
; ┌ @ REPL[1]:2 within `axpy!'
	retq
	nopw	%cs:(%rax,%rax)
; └

The loop is just:

L16:
	vmulsd	4096(%rsi,%rax), %xmm0, %xmm1
	vaddsd	4096(%rdx,%rax), %xmm1, %xmm1
	vmovsd	%xmm1, 4096(%rdi,%rax)
	addq	$8, %rax
	jne	L16

Note that vmulsd and vaddsd multiply and add single doubles (hence, sd instead of pd suffix).
Yikes!
It’s naturally very slow:

julia> @btime axpy!($z, 2.0, $x, $y)
  228.304 ns (0 allocations: 0 bytes)

While

using LoopVectorization
function axpy2!(z, a, x, y)
    @vectorize for i ∈ eachindex(z, x, y)
        z[i] = a * x[i] + y[i]
    end
end
@code_native axpy2!(z, 2.0, x, y)

Yields:

L16:
	vmovupd	4096(%rsi,%rax), %zmm1
	vfmadd213pd	4096(%rdx,%rax), %zmm0, %zmm1
	vmovupd	%zmm1, 4096(%rdi,%rax)
	addq	$64, %rax
	jne	L16

I also did add an unroll argument:

function axpy3!(z, a, x, y)
    @vectorize 2 for i ∈ eachindex(z, x, y)
        z[i] = a * x[i] + y[i]
    end
end
function axpy4!(z, a, x, y)
    @vectorize 4 for i ∈ eachindex(z, x, y)
        z[i] = a * x[i] + y[i]
    end
end
@code_native axpy3!(z, 2.0, x, y)
@code_native axpy4!(z, 2.0, x, y)

Note that it currently gets rounded down to the nearest power of 2.
axpy3 looks like expected:

L16:
	vmovupd	4096(%rsi,%rax), %zmm1
	vmovupd	4160(%rsi,%rax), %zmm2
	vfmadd213pd	4096(%rdx,%rax), %zmm0, %zmm1
	vfmadd213pd	4160(%rdx,%rax), %zmm0, %zmm2
	vmovupd	%zmm2, 4160(%rdi,%rax)
	vmovupd	%zmm1, 4096(%rdi,%rax)
	addq	$128, %rax
	jne	L16

while for axpy4, we unrolled the loop by enough (only 16 iterations left) that llvm actually decided to completely unroll the loop.
Results:

julia> @btime axpy2!($z, 2.0, $x, $y)
  23.885 ns (0 allocations: 0 bytes)

julia> @btime axpy3!($z, 2.0, $x, $y)
  23.793 ns (0 allocations: 0 bytes)

julia> @btime axpy4!($z, 2.0, $x, $y)
  24.066 ns (0 allocations: 0 bytes)

Much better!
Of course, my actual motivation was more along the lines of:

using SLEEF
function special!(z, a, x, y)
    @inbounds for i ∈ eachindex(z, x, y)
        w = a * x[i] + y[i]
        s, c = sincos(w)
        z[i] = exp(s + log(c*c))
    end
end
function vectorized_special!(z, a, x, y)
    @vectorize for i ∈ eachindex(z, x, y)
        w = a * x[i] + y[i]
        s, c = sincos(w)
        z[i] = exp(s + log(c*c))
    end
end
@btime special!($z, 2.0, $x, $y)
@btime vectorized_special!($z, 2.0, $x, $y)

Yielding an almost 8x speedup:

julia> @btime special!($z, 2.0, $x, $y)
  15.081 μs (0 allocations: 0 bytes)

julia> @btime vectorized_special!($z, 2.0, $x, $y)
  1.955 μs (0 allocations: 0 bytes)
3 Likes

Thanks, this is very impressive! I’m interested by such use cases as well, but when I try to reproduce your example, I get the following error:

julia> vectorized_special!(z, 2.0, x, y)
ERROR: MethodError: no method matching sincos_fast(::SVec{4,Float64})
Closest candidates are:
  sincos_fast(::Float16) at /home/user/.julia/packages/SLEEF/vUBln/src/SLEEF.jl:130
  sincos_fast(::Union{Float64, Vec{#s12,Float64} where #s12}) at /home/user/.julia/packages/SLEEF/vUBln/src/trig.jl:352                                                                                                              
  sincos_fast(::Union{Float32, Vec{#s12,Float32} where #s12}) at /home/user/.julia/packages/SLEEF/vUBln/src/trig.jl:400                                                                                                              
  ...
Stacktrace:
 [1] vectorized_special!(::MArray{Tuple{512},Float64,1,512}, ::Float64, ::MArray{Tuple{512},Float64,1,512}, ::MArray{Tuple{512},Float64,1,512}) at /home/user/.julia/packages/LoopVectorization/ct9Bk/src/LoopVectorization.jl:209   
 [2] top-level scope at none:0

Do you have any idea about what I’m doing wrong? FWIW, I Pkg.added the relevant packages directly from their respective master branches on github:

  1. SLEEF had to be on the SIMDPirates branch. The ::SVec methods are only defined on that branch, while master supports SIMD.jl. I’ll create a fork of the library to make it easier to get set up with the correct libraries in a few hours. I don’t expect to ever merge my modifications into my master. And especially not while there is still an open PR from my master into the official master.
  2. I’ve been using all my libraries from ~/.julia/dev. Before, when getting someone set up, there were errors whenever they were added instead of deved. Seems this was fixed, because the @vectorize loop obviously applied correctly, and you ended up with a method error.
1 Like

Thanks! I can confirm that this fixes my problems. There does not seem to be any issue anymore with adding packages instead of deving them.



Self-contained example

In case anyone would be interested to fully reproduce this:

#] add https://github.com/chriselrod/LoopVectorization.jl https://github.com/chriselrod/VectorizationBase.jl https://github.com/chriselrod/SIMDPirates.jl https://github.com/chriselrod/SLEEF.jl#SIMDPirates


using LoopVectorization
using SLEEF

function special!(z, a, x, y)
    @inbounds for i ∈ eachindex(z, x, y)
        w = a * x[i] + y[i]
        s, c = sincos(w)
        z[i] = exp(s + log(c*c))
    end
end

function vectorized_special!(z, a, x, y)
    @vectorize for i ∈ eachindex(z, x, y)
        w = a * x[i] + y[i]
        s, c = sincos(w)
        z[i] = exp(s + log(c*c))
    end
end


using StaticArrays, BenchmarkTools
function randmvector(::Val{N}, T = Float64) where N
    mv = MVector{N,T}(undef)
    @inbounds for i ∈ 1:N
        mv[i] = rand()
    end
    mv
end
x = randmvector(Val(512));
y = randmvector(Val(512));
z = MVector{512,Float64}(undef);

@btime special!($z, 2.0, $x, $y)
@btime vectorized_special!($z, 2.0, $x, $y)
2 Likes

Then what is the reason to use iterators instead of loops?

In fact I don’t understand well iterators. The package iterators.jl doesn’t even have documentation, nor itertools.jl.

You’ve resurrected two old threads posting basically the same message in each, which didn’t really add to the original content. In a situation like this, please just make a new thread.

Also, see Iteration utilities · The Julia Language

2 Likes