Simd and dot vectorization

dmbates · April 23, 2023, 2:00pm

Does dot vectorization use SIMD explicitly or implicitly? If I look at the LLVM code for a function call to ‘myadd!’ in

julia> myadd(x, y) = x + y
myadd (generic function with 1 method)

julia> function myadd!(z::Vector{T}, x::Vector{T}, y::Vector{T}) where {T}
           @. z = myadd(x, y)
           return z
       end
myadd! (generic function with 1 method)

it has tags for @ simdloop and it seems to have multiple sections of very similar code for the inner function. I imagine these are setup, simd, and cleanup sections. However I have not seen it explicitly documented that the materialization of a dot vectorization will use SIMD when feasible. Did I miss something?

heltonmc · April 23, 2023, 5:54pm

Yes this function should auto-vectorize. Accumulate functions though won’t vectorize without explicit use of the @simd macro but in this case the order of evaluation won’t affect the output. Sometimes auto-vectorization can be limited by bounds checks which this will probably make. That should just decrease performance slightly though and still vectorize. For example…

function myadd!(z, x, y)
    @assert length(z) == length(x) == length(y)
    for i in eachindex(z)
        z[i] = x[i] + y[i]
    end
    return z
end

… will autovectorize well and avoid bounds checks. You can check with either @code_native or @code_llvm macros. Here is the relevant block…

julia> @code_llvm myadd!(z, x, y)
;  @ REPL[1]:1 within `myadd!`
...
...
%30 = getelementptr inbounds double, double* %23, i64 %index
   %31 = bitcast double* %30 to <2 x double>*
   %wide.load53 = load <2 x double>, <2 x double>* %31, align 8
   %32 = getelementptr inbounds double, double* %30, i64 2
   %33 = bitcast double* %32 to <2 x double>*
   %wide.load54 = load <2 x double>, <2 x double>* %33, align 8
; └
; ┌ @ float.jl:383 within `+`
   %34 = fadd <2 x double> %wide.load, %wide.load53
   %35 = fadd <2 x double> %wide.load52, %wide.load54
; └
; ┌ @ array.jl:966 within `setindex!`
   %36 = getelementptr inbounds double, double* %25, i64 %index
...
...

So the compiler is actually unrolling this by a factor of two (improve a memory bound algorithm) and doing two SIMD adds within each iteration (so four scalar adds per loop iteration).

Topic		Replies	Views
LLVM code changes if code is wrapped in function Performance	2	328	March 15, 2023
Does Julia use SIMD instructions for broadcast operations? General Usage question , performance , broadcast	18	5150	March 7, 2017
How to wrap a vector so that it does simd? Performance	0	319	January 8, 2022
Autovectorization in Julia 101 Internals & Design simd , loopvectorization	2	341	December 5, 2024
Is it possible to use SIMD in multithreading? General Usage performance	5	1626	June 28, 2021

Simd and dot vectorization

Related topics