A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd`

The obvious difference between SIMD.jl and the rest is that your SIMD.jl code is not unrolled.

However, that may not always make a difference, especially at these sizes (r is rather long for such a simple loop, meaning this should mostly be memory bound).
On my computer, preventing LoopVectorization from unrolling didn’t really hurt its performance, for example:

julia> function lv_turbo_u1(r::Vector{UInt64}, mask::UInt64)
           @turbo unroll=1 for i in 1:length(r)
               r[i] = r[i] & mask
           end
           r
       end
lv_turbo_u1 (generic function with 1 method)

julia> @btime lv_turbo_u1($r, $mask);
  752.712 ns (0 allocations: 0 bytes)

But my computer has AVX512, so making SIMD.jl use vectors of width 8 helped (I also start Julia with a flag to make @simd & co do this by default, they won’t normally):

julia> function simd_vec8(r::Vector{UInt64}, mask::UInt64)
           simd_mask = SIMD.Vec{8,UInt64}(mask)
           lane = SIMD.VecRange{8}(0)
           @inbounds for i in 1:8:length(r)
               r[i+lane] &= simd_mask
           end
           r
       end
simd_vec8 (generic function with 1 method)

julia> @btime simd_vec8($r, $mask);
  760.114 ns (0 allocations: 0 bytes)
1 Like