I think I found myself the answer in the documentation of https://github.com/JuliaSIMD/LoopVectorization.jl#dot-product
function mydotavx(A)
s = 0.0
@turbo for i ∈ eachindex(A)
s += A[i]*A[i]
end
s
end
@btime mydotavx($A)
105.684 ns (0 allocations: 0 bytes)
1041.8233479785772