How to enable vectorized fma instruction for multiply-add vectors?

xinwu · June 12, 2023, 7:31am

Hi,

the vectorized fma instructions (unrolled by 4x) can easily be generated for the C code below via clang -Ofast -S -Wall -std=c11 -march=skylake vecfma.c.

void vecfma(float * restrict c, float * restrict a, float * restrict b, const int n)
{
  for (int i = 0; i < n; i++)
    c[i] += a[i] * b[i];
}

vfmadd213ps     (%rdi,%rax), %ymm0, %ymm4 # ymm4 = (ymm0 * ymm4) + mem
vfmadd213ps     32(%rdi,%rax), %ymm1, %ymm5 # ymm5 = (ymm1 * ymm5) + mem
vfmadd213ps     64(%rdi,%rax), %ymm2, %ymm6 # ymm6 = (ymm2 * ymm6) + mem
vfmadd213ps     96(%rdi,%rax), %ymm3, %ymm7 # ymm7 = (ymm3 * ymm7) + mem

However, with julia --cpu-target=skylake I cannot generate these fma instructions (vmulps and vaddps instead) for the Julia version of this function.

function vecfma!(c, a, b, n)
  @inbounds @simd for i in 1:n
    c[i] += a[i] * b[i]
  end
  return nothing
end
code_native(vecfma!, (Vector{Float32}, Vector{Float32}, Vector{Float32}, Int32,))

And the unrolling factor is 2x (not 4x).

; ││┌ @ float.jl:410 within `*`
        vmulps  (%rdx,%rsi,4), %ymm0, %ymm0
        vmulps  32(%rdx,%rsi,4), %ymm1, %ymm1
; ││└
; ││┌ @ float.jl:408 within `+`
        vaddps  (%rax,%rsi,4), %ymm0, %ymm0
        vaddps  32(%rax,%rsi,4), %ymm1, %ymm1

How could I generate the fma instructions? and also fine tuning the unrolling factor for loops in Julia?

Thanks in advance!

Xin

Salmon · June 12, 2023, 7:49am

Have you tried LoopVectorization.jl ?
The @turbo unroll = 4 (instead of @simd @inbounds) should allow you to suggest a unrolling factor. Whether it will generate the desired fma instructions I don’t know for sure but I guess it will.

GunnarFarneback · June 12, 2023, 7:56am

Usually LoopVectorization does a very good job but if you want to investigate alternatives, some other options are:

Add @fastmath to your loop.
Explicitly call fma or muladd within the loop.
Use the SIMD package. This requires more work on your part but also gives you full control.

Oscar_Smith · June 12, 2023, 12:19pm

Julia doesn’t allow floating point reassociation by default. you need to give it a flag like fastmath to tell it that it is legal to contact.

Henrique_Becker · June 12, 2023, 2:12pm

Complementing the point by @Oscar_Smith, -Ofast in the C compiler line is not the same as -O3 which enables most/all optimization while respecting the standard, the gcc/g++ manual will tell you that -Ofast allows fast math which breaks some standards.

   -Ofast
      Disregard strict standards compliance.  -Ofast enables all -O3
      optimizations.  It also enables optimizations that are not valid for all
      standard-compliant programs.  It turns on -ffast-math,
      -fallow-store-data-races and the Fortran-specific -fstack-arrays, unless
      -fmax-stack-var-size is specified, and -fno-protect-parens.  It turns
      off -fsemantic-interposition.

mikmoore · June 12, 2023, 3:00pm

Don’t write a vectorized function when a scalar function will suffice. Julia has broadcasting to make simple loops over scalar functions.

Your function (with FMA) can be written as any of the one-liners

c .= muladd.(a,b,c) # expands singleton dimensions automatically
@. c = muladd(a,b,c) # the macro expands this to be identical to the above
broadcast!(muladd,c,a,b,c) # equivalent to the above
map!(muladd,c,a,b,c) # requires matching dimensions

muladd permits FMA operations and fma requires them (even if they must be painstakingly emulated on your hardware, so muladd should be used for speed and fma for correctness). When I use the broadcast definition c .= ..., I observe FMA instructions and 4x unrolling. I didn’t try the map! version.

But something using LoopVectorization.jl is usually the fastest option because it takes extra efforts to make sure the tail of the operation is fast (which can otherwise dominate runtime at small and medium sizes). If you care about achieving the highest speeds possible, it’s a good candidate.

xinwu · June 12, 2023, 3:00pm

Thank you all!

xinwu · June 14, 2023, 6:42am

After reading “The Julia Language V1.9.0” I tried to set JULIA_LLVM_ARGS to pass options to the LLVM backend, but quite disappointing.

The Julia and Bash codes are:

using InteractiveUtils
function vecfma!(c, a, b, n)
  @fastmath @inbounds @simd for i in 1:n
    c[i] += a[i] * b[i]
  end
  return nothing
end

code_native(vecfma!, (Vector{Float32}, Vector{Float32}, Vector{Float32}, Int32,))

export JULIA_LLVM_ARGS=" -unroll-count=4 "
julia --cpu-target=skylake -O3 -t 1 vecfma.jl > as.out

The problems are (as can be seen in as.out):

4x loop unrolling does not happen (actually no loop unrolling at all!)
vector instructions (v*ps) are changed to scalar instructions (v*ss)

IMHO, these are very common optimization techniques and a good compiler is expected to do the work automatically (without relying on other packages or manual verification from the programmers).

fatteneder · June 14, 2023, 7:57am

Indeed, can confirm that running

export JULIA_LLVM_ARGS=" -unroll-count=4 "
julia --cpu-target=skylake -O3 -t 1 vecfma.jl > as.out

makes it worse.

But removing the LLVM flag and running with --cpu-target=tigerlake gives me the right thing (tested on v1.8.5, v1.9.0, v1.9.1):

...
││┌ @ essentials.jl:13 within `getindex`
        vmovups (%rcx,%rsi,4), %ymm0
        vmovups 32(%rcx,%rsi,4), %ymm1
        vmovups 64(%rcx,%rsi,4), %ymm2
        vmovups 96(%rcx,%rsi,4), %ymm3
        vmovups (%rdx,%rsi,4), %ymm4
        vmovups 32(%rdx,%rsi,4), %ymm5
        vmovups 64(%rdx,%rsi,4), %ymm6
        vmovups 96(%rdx,%rsi,4), %ymm7
; ││└
; ││┌ @ fastmath.jl:165 within `add_fast`
        vfmadd213ps     (%rax,%rsi,4), %ymm0, %ymm4 # ymm4 = (ymm0 * ymm4) + mem
        vfmadd213ps     32(%rax,%rsi,4), %ymm1, %ymm5 # ymm5 = (ymm1 * ymm5) + mem
        vfmadd213ps     64(%rax,%rsi,4), %ymm2, %ymm6 # ymm6 = (ymm2 * ymm6) + mem
        vfmadd213ps     96(%rax,%rsi,4), %ymm3, %ymm7 # ymm7 = (ymm3 * ymm7) + mem
; ││└
; ││┌ @ array.jl:969 within `setindex!`
        vmovups %ymm4, (%rax,%rsi,4)
        vmovups %ymm5, 32(%rax,%rsi,4)
        vmovups %ymm6, 64(%rax,%rsi,4)
        vmovups %ymm7, 96(%rax,%rsi,4)
...

xinwu · June 14, 2023, 8:21am

Thanks for the verification. I just assume a decent compiler should give the programmers the option(s) to control how the loops are unrolled to optimize the code performance.

fatteneder · June 14, 2023, 8:31am

Don’t hesitate to submit your PRs if you think the compiler is not decent enough.

… give the programmers the option(s) to control how the loops are unrolled to optimize the code performance.

As others said, you can use @turbo from LoopVectorization.jl (docs: API reference · LoopVectorization.jl) to do this manually (although it really should work for skylake out of the box):

using LoopVectorization
using InteractiveUtils

function vecfma!(c, a, b, n)
  @turbo unroll=4 for i in 1:n
    c[i] += a[i] * b[i]
  end
  return nothing
end

code_native(vecfma!, (Vector{Float32}, Vector{Float32}, Vector{Float32}, Int32,))

Topic		Replies	Views
Unroll and Vectorize EvalPoly General Usage	9	1373	May 22, 2017
Simple loop won't vectorize New to Julia	12	1690	January 29, 2019
Does Julia use SIMD instructions for broadcast operations? General Usage question , performance , broadcast	18	5236	March 7, 2017
Use different methods depending on presence of FMA Internals & Design	21	2253	July 1, 2017
X * y + z does not automatically use FMA instruction Performance fast-math	31	2678	July 27, 2023

How to enable vectorized fma instruction for multiply-add vectors?

Related topics