Looking at the SIMD.jl example vadd!(xs, ys, Vec{8,Float64})
there is a specific size 8
being set for the vec size. How do I know whether to use 4 or 8 or more? I assume it is hardware dependent, but probably there is some way to choose this at runtime?
Don’t use SIMD.jl
. It is a low level library meant for implementing higher level abstractions. Instead you should use LoopVectorization
which will automatically pick the right defaults for you.
As an answer to the question you asked though, yes it’s hardware dependent, and you can choose the right size by seeing what instructions the CPU supports (specifically SSE/AVX/AVX2/AVX512 for x86 cpus)
What I am trying to do does not really fit LoopVectorizations. I need bit-wise logical operations with complicated intermediate bitwise formulas that LoopVectorization’s @turbo and @simd failed to infere can be vectorized.
Hence going back to the SIMD.jl question I have above.
Here is the fairly short kernel that I want to vectorize better
function mul_left!(r::AbstractVector{T}, l::AbstractVector{T}) where T<:Unsigned
cnt1 = zero(T)
cnt2 = zero(T)
len = length(l)>>1
@inbounds @simd for i in 1:len
x1, x2, z1, z2 = l[i], r[i], l[i+len], r[i+len]
r[i] = newx1 = x1 ⊻ x2
r[i+len] = newz1 = z1 ⊻ z2
x1z2 = x1 & z2
anti_comm = (x2 & z1) ⊻ x1z2
cnt2 ⊻= (cnt1 ⊻ newx1 ⊻ newz1 ⊻ x1z2) & anti_comm
cnt1 ⊻= anti_comm
end
s = count_ones(cnt1)
s ⊻= count_ones(cnt2) << 1
s
end
More general advice on how to improve its vectorization would certainly be appreciated of course!
IMO, this seems like a place where you should file a bug report. Nothing about this code should be that hard for LoopVectorization to vectorize.
Isn’t the fact that cnt1
and cnt2
of fixed scalar size, instead of a SIMD vector, making it impossible to perform operations involving them vectorized?
VectorizationBase.jl has more features than SIMD.jl
julia> using VectorizationBase
julia> VectorizationBase.pick_vector_width(Float64)
static(8)