The obvious difference between SIMD.jl and the rest is that your SIMD.jl code is not unrolled.
However, that may not always make a difference, especially at these sizes (r is rather long for such a simple loop, meaning this should mostly be memory bound).
On my computer, preventing LoopVectorization from unrolling didn’t really hurt its performance, for example:
julia> function lv_turbo_u1(r::Vector{UInt64}, mask::UInt64)
@turbo unroll=1 for i in 1:length(r)
r[i] = r[i] & mask
end
r
end
lv_turbo_u1 (generic function with 1 method)
julia> @btime lv_turbo_u1($r, $mask);
752.712 ns (0 allocations: 0 bytes)
But my computer has AVX512, so making SIMD.jl use vectors of width 8 helped (I also start Julia with a flag to make @simd & co do this by default, they won’t normally):
julia> function simd_vec8(r::Vector{UInt64}, mask::UInt64)
simd_mask = SIMD.Vec{8,UInt64}(mask)
lane = SIMD.VecRange{8}(0)
@inbounds for i in 1:8:length(r)
r[i+lane] &= simd_mask
end
r
end
simd_vec8 (generic function with 1 method)
julia> @btime simd_vec8($r, $mask);
760.114 ns (0 allocations: 0 bytes)