A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd`

Krastanov · June 27, 2021, 5:14pm

I have a loop that performs some bit-fiddling operations. Because of the way some out-of-loop constants are used, @simd (and occasionally even LoopVectorization.jl) does not always give the fastest possible machine code. I have had good success writing custom SIMD.jl loops that are faster, but I was surprised to see that in this particular (overly-simplified) case SIMD.jl is noticeably slower than vanilla Julia. Any suggestions on what I am doing wrong? Is this something SIMD.jl should handle and it should be reported as a “performance bug”?

This is an overly simplified case that does not need SIMD.jl, but I am still surprised that SIMD.jl is actually slower than the other solutions. @code_warntype has some yellow in it, but I do not know how to address it.

Code:

r = rand(UInt64, 8000)
mask = ~(one(eltype(r))<<(2))

function simple_broadcast(r::Vector{UInt64}, mask)
    r .&= mask
end

function inboundssimd(r::Vector{UInt64}, mask::UInt64)
    @inbounds @simd for i in 1:length(r)
        r[i] &= mask
    end
    r
end

function simd_vec4(r::Vector{UInt64}, mask::UInt64)
    simd_mask = SIMD.Vec{4,UInt64}((mask,mask,mask,mask))
    lane = SIMD.VecRange{4}(0)
    @inbounds for i in 1:4:length(r)
        r[i+lane] &= simd_mask
    end
    r
end

function lv_turbo(r::Vector{UInt64}, mask::UInt64)
    @turbo for i in 1:length(r)
        r[i] = r[i] & mask
    end
    r
end

Benchmark:

julia> @btime simd_vec4($r, $mask);
  2.029 μs (0 allocations: 0 bytes)

julia> @btime inboundssimd($r, $mask);
  1.642 μs (0 allocations: 0 bytes)

julia> @btime simple_broadcast($r, $mask);
  1.641 μs (0 allocations: 0 bytes)

Elrod · June 27, 2021, 5:25pm

function lv_turbo(r::Vector{UInt64}, mask::UInt64)
    @turbo for i in 1:length(r)
        r[i] = r[i] & mask
    end
    r
end

I’ll add support for &=.

Results:

julia> @btime simd_vec4($r, $mask);
  894.049 ns (0 allocations: 0 bytes)

julia> @btime inboundssimd($r, $mask);
  751.833 ns (0 allocations: 0 bytes)

julia> @btime simple_broadcast($r, $mask);
  752.387 ns (0 allocations: 0 bytes)

julia> @btime lv_turbo($r, $mask);
  752.311 ns (0 allocations: 0 bytes)

Krastanov · June 27, 2021, 5:30pm

oh, jeez, I feel silly (good silly though, I have learnt so many new tools from you guys the last week or two)

julia> @btime inboundssimd($r, $mask);
  1.656 μs (0 allocations: 0 bytes)

julia> @btime lv_turbo($r, $mask);
  1.669 μs (0 allocations: 0 bytes)

julia> @btime simd_vec4($r, $mask);
  2.033 μs (0 allocations: 0 bytes)

The question about SIMD.jl still stands though, if anyone has an obvious idea.

I will fix the incorrect title and comment.

Elrod · June 27, 2021, 5:35pm

The obvious difference between SIMD.jl and the rest is that your SIMD.jl code is not unrolled.

However, that may not always make a difference, especially at these sizes (r is rather long for such a simple loop, meaning this should mostly be memory bound).
On my computer, preventing LoopVectorization from unrolling didn’t really hurt its performance, for example:

julia> function lv_turbo_u1(r::Vector{UInt64}, mask::UInt64)
           @turbo unroll=1 for i in 1:length(r)
               r[i] = r[i] & mask
           end
           r
       end
lv_turbo_u1 (generic function with 1 method)

julia> @btime lv_turbo_u1($r, $mask);
  752.712 ns (0 allocations: 0 bytes)

But my computer has AVX512, so making SIMD.jl use vectors of width 8 helped (I also start Julia with a flag to make @simd & co do this by default, they won’t normally):

julia> function simd_vec8(r::Vector{UInt64}, mask::UInt64)
           simd_mask = SIMD.Vec{8,UInt64}(mask)
           lane = SIMD.VecRange{8}(0)
           @inbounds for i in 1:8:length(r)
               r[i+lane] &= simd_mask
           end
           r
       end
simd_vec8 (generic function with 1 method)

julia> @btime simd_vec8($r, $mask);
  760.114 ns (0 allocations: 0 bytes)

Krastanov · June 27, 2021, 5:40pm

Do you know they were (or were not, respectively) unrolled by looking at some of the @code_* outputs? I am still not knowledgeable enough to read their output and knowing what too look for would be great.

Elrod · June 27, 2021, 6:03pm

Well, I knew SIMD.jl wasn’t unrolled because your code wasn’t unrolling.
And I knew that LLVM always unrolls simple loops like these 4x on top of vectorization, and LoopVectorization will do the same.

But you can check with @code_ or Cthulhu.@descend.

I actually noticed another problem with the SIMD.jl code.
I can show the LLVM instead if you’d prefer, but I’m showing the assembly because it is much less verbose; I added comments on what each line is doing:

L64:
        mov     rdx, qword ptr [rbx]   # load the pointer to r
        vpand   ymm1, ymm0, ymmword ptr [rdx + 8*rcx] # ymm1 = load from `r` and `&`
        vmovdqu ymmword ptr [rdx + 8*rcx], ymm1 # store to r
        add     rcx, 4 # add 4 to loop counter
        cmp     rax, rcx # compare with end
        jne     L64 # conditionally jump to L64

Here, you can see

That it is unrolled just once (but it is simd).
That it is loading the pointer to r from memory on each loop iteration.

Versus lv_turbo’s loop:

L48:
        vpandq  zmm1, zmm0, zmmword ptr [rdx + 8*rcx] # zmm1 = load from `r` and `&`
        vpandq  zmm2, zmm0, zmmword ptr [rdx + 8*rcx + 64] # zmm2 = load from `r` and `&`
        vpandq  zmm3, zmm0, zmmword ptr [rdx + 8*rcx + 128] # zmm3 = load from `r` and `&`
        vpandq  zmm4, zmm0, zmmword ptr [rdx + 8*rcx + 192] # zmm4 = load from `r` and `&`
        vmovdqu64       zmmword ptr [rdx + 8*rcx], zmm1 # store
        vmovdqu64       zmmword ptr [rdx + 8*rcx + 64], zmm2 # store
        vmovdqu64       zmmword ptr [rdx + 8*rcx + 128], zmm3 # store
        vmovdqu64       zmmword ptr [rdx + 8*rcx + 192], zmm4 # store
        add     rcx, 32 # add 32 (width = 8, unrolled by 4)
        cmp     rdi, rcx # compare with end
        jne     L48 # conditionally jump to L48

We can look at the effect using LinuxPerf:

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(simd_vec4, 1_000_000, r, mask)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               4.49e+09  100.0%  #  4.7 cycles per ns
└ task-clock               9.66e+08  100.0%  # 965.9 ms
┌ instructions             1.26e+10  100.0%  #  2.8 insns per cycle
│ branch-instructions      2.10e+09  100.0%  # 16.7% of instructions
└ branch-misses            1.96e+06  100.0%  #  0.1% of branch instructions
┌ L1-dcache-load-misses    1.07e+09  100.0%  # 25.6% of dcache loads
│ L1-dcache-loads          4.18e+09  100.0%
│ cache-misses             6.17e+04  100.0%  # 26.4% of cache references
└ cache-references         2.34e+05  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(lv_turbo, 1_000_000, r, mask)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               3.57e+09  100.0%  #  4.6 cycles per ns
└ task-clock               7.81e+08  100.0%  # 781.2 ms
┌ instructions             3.29e+09  100.0%  #  0.9 insns per cycle
│ branch-instructions      3.49e+08  100.0%  # 10.6% of instructions
└ branch-misses            1.05e+06  100.0%  #  0.3% of branch instructions
┌ L1-dcache-load-misses    1.07e+09  100.0%  # 90.5% of dcache loads
│ L1-dcache-loads          1.18e+09  100.0%
│ cache-misses             7.07e+04  100.0%  # 25.8% of cache references
└ cache-references         2.74e+05  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(simd_vec8, 1_000_000, r, mask)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               3.68e+09  100.0%  #  4.6 cycles per ns
└ task-clock               8.07e+08  100.0%  # 806.9 ms
┌ instructions             6.61e+09  100.0%  #  1.8 insns per cycle
│ branch-instructions      1.11e+09  100.0%  # 16.8% of instructions
└ branch-misses            1.95e+06  100.0%  #  0.2% of branch instructions
┌ L1-dcache-load-misses    1.07e+09  100.0%  # 49.1% of dcache loads
│ L1-dcache-loads          2.19e+09  100.0%
│ cache-misses             6.65e+04  100.0%  # 25.2% of cache references
└ cache-references         2.64e+05  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

simd_vec8 was modified to use vectors of length 8, as my laptop supports that. The assembly is otherwise the same as simd_vec4 (just replace the 4 with an 8, and ymm with zmm).

Note how (on my laptop), simd_vec8 was nearly as fast as lv_turbo. A million iterations took around 800ms for both.
However, lv_turbo required half the instructions to execute. The CPU was executing just 0.9 instructoins/cycle instead of 1.8 instructions/cycle; memory bound. Making the arrays smaller shows the advantage; we get a much larger than 10x speed up by decreasing the size of r 10-fold:

julia> rshort = rand(UInt64, 800);

julia> @btime simd_vec8($rshort, $mask);
  42.574 ns (0 allocations: 0 bytes)

julia> @btime lv_turbo($rshort, $mask);
  24.629 ns (0 allocations: 0 bytes)

A function you can use to experiment:

function simd_vec_uv(r::Vector{UInt64}, mask::UInt64, ::Val{U}, ::Val{V}) where {U,V}
  simd_mask = SIMD.Vec{V,UInt64}((mask,mask,mask,mask,mask,mask,mask,mask))
  lane = SIMD.VecRange{V}(0)
  for i in 1:U*V:length(r)
    masked = ntuple(Val(U)) do u
      Base.@_inline_meta
      @inbounds r[i+V*(u-1)+lane] & simd_mask
    end
    ntuple(Val(U)) do u
      Base.@_inline_meta
      @inbounds r[i+V*(u-1)+lane] = masked[u]
    end
  end
  r
end

Note that this produces a lot of those annoying “reload the pointer to r”:

L64:
        mov     rdx, qword ptr [rbx]
        vpandq  zmm1, zmm0, zmmword ptr [rdx + 8*rcx]
        vpandq  zmm2, zmm0, zmmword ptr [rdx + 8*rcx + 64]
        vpandq  zmm3, zmm0, zmmword ptr [rdx + 8*rcx + 128]
        vpandq  zmm4, zmm0, zmmword ptr [rdx + 8*rcx + 192]
        vmovdqu64       zmmword ptr [rdx + 8*rcx], zmm1
        mov     rdx, qword ptr [rbx]
        vmovdqu64       zmmword ptr [rdx + 8*rcx + 64], zmm2
        mov     rdx, qword ptr [rbx]
        vmovdqu64       zmmword ptr [rdx + 8*rcx + 128], zmm3
        mov     rdx, qword ptr [rbx]
        vmovdqu64       zmmword ptr [rdx + 8*rcx + 192], zmm4
        add     rcx, 32
        cmp     rax, rcx
        jne     L64

Which I think you could file an issue about over at SIMD.jl.
Doesn’t seem to make that much of a difference on performance, though. For the vectors of length 800, for example, I get about 28 ns for simd_vec_uv(r, mask, Val(4), Val(8));, which would be the same thing lv_turbo is doing on my computer.
simd_vec_uv(r, mask, Val(4), Val(4)) would probably be correct for yours.

EDIT:
Also, adding &= support: https://github.com/JuliaSIMD/LoopVectorization.jl/commit/5f62713aef127ad474a1cd5faa51980b3503ddec

kristoffer.carlsson · June 27, 2021, 6:35pm

SIMD.jl is really quite simple, it basically is a convenient way to map Julia syntax to LLVM intrinsics. So it is a bit unclear what SIMD.jl can change here.

Elrod · June 27, 2021, 6:48pm

Yeah, there are a few different aliasing Julia issues, and this probably belongs there for a real/direct fix.
E.g., code using array wrappers often results in this sort of pointer reloading on every iteration. Those cases are particularly problematic, as it blocks the loop vectorizer. Less of an issue for SIMD.jl, as code using it presumably isn’t counting on any autovectorizers.

While a convenient fix with the current API might not be possible from within SIMD.jl, they could perhaps recommend a different API as a workaround.

Krastanov · June 27, 2021, 8:21pm

Thank you, this was very illuminating!

Topic		Replies	Views
How to do SIMD code with wide-register accumulators (@simd vs LoopVectorization.jl vs SIMD.jl) Performance simd	11	2609	June 22, 2021
Why is this @simd loop faster than a while loop even if it has longer assembly? Performance question , simd , loops	6	1615	August 1, 2021
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2099	August 28, 2021
How to make the most of SIMD.jl when number of data elements is not divisible by SIMD width Performance simd	5	182	May 22, 2025
Help understanding vectorization (or lack thereof) Performance	15	1212	June 8, 2018

A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd`

Related topics