# How to enable vectorized fma instruction for multiply-add vectors?

Hi,

the vectorized fma instructions (unrolled by 4x) can easily be generated for the C code below via `clang -Ofast -S -Wall -std=c11 -march=skylake vecfma.c`.

``````void vecfma(float * restrict c, float * restrict a, float * restrict b, const int n)
{
for (int i = 0; i < n; i++)
c[i] += a[i] * b[i];
}
``````
``````vfmadd213ps     (%rdi,%rax), %ymm0, %ymm4 # ymm4 = (ymm0 * ymm4) + mem
vfmadd213ps     32(%rdi,%rax), %ymm1, %ymm5 # ymm5 = (ymm1 * ymm5) + mem
vfmadd213ps     64(%rdi,%rax), %ymm2, %ymm6 # ymm6 = (ymm2 * ymm6) + mem
vfmadd213ps     96(%rdi,%rax), %ymm3, %ymm7 # ymm7 = (ymm3 * ymm7) + mem
``````

However, with `julia --cpu-target=skylake` I cannot generate these fma instructions (`vmulps` and `vaddps` instead) for the Julia version of this function.

``````function vecfma!(c, a, b, n)
@inbounds @simd for i in 1:n
c[i] += a[i] * b[i]
end
return nothing
end
code_native(vecfma!, (Vector{Float32}, Vector{Float32}, Vector{Float32}, Int32,))
``````

And the unrolling factor is 2x (not 4x).

``````; â”‚â”‚â”Ś @ float.jl:410 within `*`
vmulps  (%rdx,%rsi,4), %ymm0, %ymm0
vmulps  32(%rdx,%rsi,4), %ymm1, %ymm1
; â”‚â”‚â””
; â”‚â”‚â”Ś @ float.jl:408 within `+`
``````

How could I generate the fma instructions? and also fine tuning the unrolling factor for loops in Julia?

Xin

1 Like

Have you tried `LoopVectorization.jl` ?
The `@turbo unroll = 4` (instead of `@simd @inbounds`) should allow you to suggest a unrolling factor. Whether it will generate the desired fma instructions I donâ€™t know for sure but I guess it will.

1 Like

Usually LoopVectorization does a very good job but if you want to investigate alternatives, some other options are:

• Add `@fastmath` to your loop.
• Explicitly call `fma` or `muladd` within the loop.
• Use the SIMD package. This requires more work on your part but also gives you full control.
2 Likes

Julia doesnâ€™t allow floating point reassociation by default. you need to give it a flag like fastmath to tell it that it is legal to contact.

3 Likes

Complementing the point by @Oscar_Smith, `-Ofast` in the C compiler line is not the same as `-O3` which enables most/all optimization while respecting the standard, the `gcc`/`g++` manual will tell you that `-Ofast` allows fast math which breaks some standards.

``````   -Ofast
Disregard strict standards compliance.  -Ofast enables all -O3
optimizations.  It also enables optimizations that are not valid for all
standard-compliant programs.  It turns on -ffast-math,
-fallow-store-data-races and the Fortran-specific -fstack-arrays, unless
-fmax-stack-var-size is specified, and -fno-protect-parens.  It turns
off -fsemantic-interposition.
``````
1 Like

Donâ€™t write a vectorized function when a scalar function will suffice. Julia has broadcasting to make simple loops over scalar functions.

Your function (with FMA) can be written as any of the one-liners

``````c .= muladd.(a,b,c) # expands singleton dimensions automatically
@. c = muladd(a,b,c) # the macro expands this to be identical to the above
``````

`muladd` permits FMA operations and `fma` requires them (even if they must be painstakingly emulated on your hardware, so `muladd` should be used for speed and `fma` for correctness). When I use the broadcast definition `c .= ...`, I observe FMA instructions and 4x unrolling. I didnâ€™t try the `map!` version.

But something using `LoopVectorization.jl` is usually the fastest option because it takes extra efforts to make sure the tail of the operation is fast (which can otherwise dominate runtime at small and medium sizes). If you care about achieving the highest speeds possible, itâ€™s a good candidate.

1 Like

Thank you all!

After reading â€śThe Julia Language V1.9.0â€ť I tried to set `JULIA_LLVM_ARGS` to pass options to the LLVM backend, but quite disappointing.

The Julia and Bash codes are:

``````using InteractiveUtils
function vecfma!(c, a, b, n)
@fastmath @inbounds @simd for i in 1:n
c[i] += a[i] * b[i]
end
return nothing
end

code_native(vecfma!, (Vector{Float32}, Vector{Float32}, Vector{Float32}, Int32,))
``````
``````export JULIA_LLVM_ARGS=" -unroll-count=4 "
julia --cpu-target=skylake -O3 -t 1 vecfma.jl > as.out
``````

The problems are (as can be seen in `as.out`):

1. 4x loop unrolling does not happen (actually no loop unrolling at all!)
2. vector instructions (`v*ps`) are changed to scalar instructions (`v*ss`)

IMHO, these are very common optimization techniques and a good compiler is expected to do the work automatically (without relying on other packages or manual verification from the programmers).

Indeed, can confirm that running

``````export JULIA_LLVM_ARGS=" -unroll-count=4 "
julia --cpu-target=skylake -O3 -t 1 vecfma.jl > as.out
``````

makes it worse.

But removing the LLVM flag and running with `--cpu-target=tigerlake` gives me the right thing (tested on `v1.8.5, v1.9.0, v1.9.1`):

``````...
â”‚â”‚â”Ś @ essentials.jl:13 within `getindex`
vmovups (%rcx,%rsi,4), %ymm0
vmovups 32(%rcx,%rsi,4), %ymm1
vmovups 64(%rcx,%rsi,4), %ymm2
vmovups 96(%rcx,%rsi,4), %ymm3
vmovups (%rdx,%rsi,4), %ymm4
vmovups 32(%rdx,%rsi,4), %ymm5
vmovups 64(%rdx,%rsi,4), %ymm6
vmovups 96(%rdx,%rsi,4), %ymm7
; â”‚â”‚â””
; â”‚â”‚â”Ś @ fastmath.jl:165 within `add_fast`
vfmadd213ps     (%rax,%rsi,4), %ymm0, %ymm4 # ymm4 = (ymm0 * ymm4) + mem
vfmadd213ps     32(%rax,%rsi,4), %ymm1, %ymm5 # ymm5 = (ymm1 * ymm5) + mem
vfmadd213ps     64(%rax,%rsi,4), %ymm2, %ymm6 # ymm6 = (ymm2 * ymm6) + mem
vfmadd213ps     96(%rax,%rsi,4), %ymm3, %ymm7 # ymm7 = (ymm3 * ymm7) + mem
; â”‚â”‚â””
; â”‚â”‚â”Ś @ array.jl:969 within `setindex!`
vmovups %ymm4, (%rax,%rsi,4)
vmovups %ymm5, 32(%rax,%rsi,4)
vmovups %ymm6, 64(%rax,%rsi,4)
vmovups %ymm7, 96(%rax,%rsi,4)
...
``````

Thanks for the verification. I just assume a decent compiler should give the programmers the option(s) to control how the loops are unrolled to optimize the code performance.

Donâ€™t hesitate to submit your PRs if you think the compiler is not decent enough.

â€¦ give the programmers the option(s) to control how the loops are unrolled to optimize the code performance.

As others said, you can use `@turbo` from `LoopVectorization.jl` (docs: API reference Â· LoopVectorization.jl) to do this manually (although it really should work for `skylake` out of the box):

``````using LoopVectorization
using InteractiveUtils

function vecfma!(c, a, b, n)
@turbo unroll=4 for i in 1:n
c[i] += a[i] * b[i]
end
return nothing
end

code_native(vecfma!, (Vector{Float32}, Vector{Float32}, Vector{Float32}, Int32,))
``````
1 Like