Does dot vectorization use SIMD explicitly or implicitly? If I look at the LLVM code for a function call to ‘myadd!’ in
julia> myadd(x, y) = x + y
myadd (generic function with 1 method)
julia> function myadd!(z::Vector{T}, x::Vector{T}, y::Vector{T}) where {T}
@. z = myadd(x, y)
return z
end
myadd! (generic function with 1 method)
it has tags for @ simdloop
and it seems to have multiple sections of very similar code for the inner function. I imagine these are setup, simd, and cleanup sections. However I have not seen it explicitly documented that the materialization of a dot vectorization will use SIMD when feasible. Did I miss something?
1 Like
Yes this function should auto-vectorize. Accumulate functions though won’t vectorize without explicit use of the @simd
macro but in this case the order of evaluation won’t affect the output. Sometimes auto-vectorization can be limited by bounds checks which this will probably make. That should just decrease performance slightly though and still vectorize. For example…
function myadd!(z, x, y)
@assert length(z) == length(x) == length(y)
for i in eachindex(z)
z[i] = x[i] + y[i]
end
return z
end
… will autovectorize well and avoid bounds checks. You can check with either @code_native
or @code_llvm
macros. Here is the relevant block…
julia> @code_llvm myadd!(z, x, y)
; @ REPL[1]:1 within `myadd!`
...
...
%30 = getelementptr inbounds double, double* %23, i64 %index
%31 = bitcast double* %30 to <2 x double>*
%wide.load53 = load <2 x double>, <2 x double>* %31, align 8
%32 = getelementptr inbounds double, double* %30, i64 2
%33 = bitcast double* %32 to <2 x double>*
%wide.load54 = load <2 x double>, <2 x double>* %33, align 8
; └
; ┌ @ float.jl:383 within `+`
%34 = fadd <2 x double> %wide.load, %wide.load53
%35 = fadd <2 x double> %wide.load52, %wide.load54
; └
; ┌ @ array.jl:966 within `setindex!`
%36 = getelementptr inbounds double, double* %25, i64 %index
...
...
So the compiler is actually unrolling this by a factor of two (improve a memory bound algorithm) and doing two SIMD adds within each iteration (so four scalar adds per loop iteration).
1 Like