Hi,
the vectorized fma instructions (unrolled by 4x) can easily be generated for the C code below via clang -Ofast -S -Wall -std=c11 -march=skylake vecfma.c
.
void vecfma(float * restrict c, float * restrict a, float * restrict b, const int n)
{
for (int i = 0; i < n; i++)
c[i] += a[i] * b[i];
}
vfmadd213ps (%rdi,%rax), %ymm0, %ymm4 # ymm4 = (ymm0 * ymm4) + mem
vfmadd213ps 32(%rdi,%rax), %ymm1, %ymm5 # ymm5 = (ymm1 * ymm5) + mem
vfmadd213ps 64(%rdi,%rax), %ymm2, %ymm6 # ymm6 = (ymm2 * ymm6) + mem
vfmadd213ps 96(%rdi,%rax), %ymm3, %ymm7 # ymm7 = (ymm3 * ymm7) + mem
However, with julia --cpu-target=skylake
I cannot generate these fma instructions (vmulps
and vaddps
instead) for the Julia version of this function.
function vecfma!(c, a, b, n)
@inbounds @simd for i in 1:n
c[i] += a[i] * b[i]
end
return nothing
end
code_native(vecfma!, (Vector{Float32}, Vector{Float32}, Vector{Float32}, Int32,))
And the unrolling factor is 2x (not 4x).
; ││┌ @ float.jl:410 within `*`
vmulps (%rdx,%rsi,4), %ymm0, %ymm0
vmulps 32(%rdx,%rsi,4), %ymm1, %ymm1
; ││└
; ││┌ @ float.jl:408 within `+`
vaddps (%rax,%rsi,4), %ymm0, %ymm0
vaddps 32(%rax,%rsi,4), %ymm1, %ymm1
How could I generate the fma instructions? and also fine tuning the unrolling factor for loops in Julia?
Thanks in advance!
Xin