Simd and dot vectorization

Does dot vectorization use SIMD explicitly or implicitly? If I look at the LLVM code for a function call to ‘myadd!’ in

julia> myadd(x, y) = x + y
myadd (generic function with 1 method)

julia> function myadd!(z::Vector{T}, x::Vector{T}, y::Vector{T}) where {T}
           @. z = myadd(x, y)
           return z
       end
myadd! (generic function with 1 method)

it has tags for @ simdloop and it seems to have multiple sections of very similar code for the inner function. I imagine these are setup, simd, and cleanup sections. However I have not seen it explicitly documented that the materialization of a dot vectorization will use SIMD when feasible. Did I miss something?

1 Like

Yes this function should auto-vectorize. Accumulate functions though won’t vectorize without explicit use of the @simd macro but in this case the order of evaluation won’t affect the output. Sometimes auto-vectorization can be limited by bounds checks which this will probably make. That should just decrease performance slightly though and still vectorize. For example…

function myadd!(z, x, y)
    @assert length(z) == length(x) == length(y)
    for i in eachindex(z)
        z[i] = x[i] + y[i]
    end
    return z
end

… will autovectorize well and avoid bounds checks. You can check with either @code_native or @code_llvm macros. Here is the relevant block…

julia> @code_llvm myadd!(z, x, y)
;  @ REPL[1]:1 within `myadd!`
...
...
%30 = getelementptr inbounds double, double* %23, i64 %index
   %31 = bitcast double* %30 to <2 x double>*
   %wide.load53 = load <2 x double>, <2 x double>* %31, align 8
   %32 = getelementptr inbounds double, double* %30, i64 2
   %33 = bitcast double* %32 to <2 x double>*
   %wide.load54 = load <2 x double>, <2 x double>* %33, align 8
; └
; ┌ @ float.jl:383 within `+`
   %34 = fadd <2 x double> %wide.load, %wide.load53
   %35 = fadd <2 x double> %wide.load52, %wide.load54
; └
; ┌ @ array.jl:966 within `setindex!`
   %36 = getelementptr inbounds double, double* %25, i64 %index
...
...

So the compiler is actually unrolling this by a factor of two (improve a memory bound algorithm) and doing two SIMD adds within each iteration (so four scalar adds per loop iteration).

1 Like