How to enable vectorized fma instruction for multiply-add vectors?

Hi,

the vectorized fma instructions (unrolled by 4x) can easily be generated for the C code below via clang -Ofast -S -Wall -std=c11 -march=skylake vecfma.c.

void vecfma(float * restrict c, float * restrict a, float * restrict b, const int n)
{
  for (int i = 0; i < n; i++)
    c[i] += a[i] * b[i];
}
vfmadd213ps     (%rdi,%rax), %ymm0, %ymm4 # ymm4 = (ymm0 * ymm4) + mem
vfmadd213ps     32(%rdi,%rax), %ymm1, %ymm5 # ymm5 = (ymm1 * ymm5) + mem
vfmadd213ps     64(%rdi,%rax), %ymm2, %ymm6 # ymm6 = (ymm2 * ymm6) + mem
vfmadd213ps     96(%rdi,%rax), %ymm3, %ymm7 # ymm7 = (ymm3 * ymm7) + mem

However, with julia --cpu-target=skylake I cannot generate these fma instructions (vmulps and vaddps instead) for the Julia version of this function.

function vecfma!(c, a, b, n)
  @inbounds @simd for i in 1:n
    c[i] += a[i] * b[i]
  end
  return nothing
end
code_native(vecfma!, (Vector{Float32}, Vector{Float32}, Vector{Float32}, Int32,))

And the unrolling factor is 2x (not 4x).

; ││┌ @ float.jl:410 within `*`
        vmulps  (%rdx,%rsi,4), %ymm0, %ymm0
        vmulps  32(%rdx,%rsi,4), %ymm1, %ymm1
; ││└
; ││┌ @ float.jl:408 within `+`
        vaddps  (%rax,%rsi,4), %ymm0, %ymm0
        vaddps  32(%rax,%rsi,4), %ymm1, %ymm1

How could I generate the fma instructions? and also fine tuning the unrolling factor for loops in Julia?

Thanks in advance!

Xin

1 Like

Have you tried LoopVectorization.jl ?
The @turbo unroll = 4 (instead of @simd @inbounds) should allow you to suggest a unrolling factor. Whether it will generate the desired fma instructions I don’t know for sure but I guess it will.

1 Like

Usually LoopVectorization does a very good job but if you want to investigate alternatives, some other options are:

  • Add @fastmath to your loop.
  • Explicitly call fma or muladd within the loop.
  • Use the SIMD package. This requires more work on your part but also gives you full control.
2 Likes

Julia doesn’t allow floating point reassociation by default. you need to give it a flag like fastmath to tell it that it is legal to contact.

3 Likes

Complementing the point by @Oscar_Smith, -Ofast in the C compiler line is not the same as -O3 which enables most/all optimization while respecting the standard, the gcc/g++ manual will tell you that -Ofast allows fast math which breaks some standards.

   -Ofast
      Disregard strict standards compliance.  -Ofast enables all -O3
      optimizations.  It also enables optimizations that are not valid for all
      standard-compliant programs.  It turns on -ffast-math,
      -fallow-store-data-races and the Fortran-specific -fstack-arrays, unless
      -fmax-stack-var-size is specified, and -fno-protect-parens.  It turns
      off -fsemantic-interposition.
1 Like

Don’t write a vectorized function when a scalar function will suffice. Julia has broadcasting to make simple loops over scalar functions.

Your function (with FMA) can be written as any of the one-liners

c .= muladd.(a,b,c) # expands singleton dimensions automatically
@. c = muladd(a,b,c) # the macro expands this to be identical to the above
broadcast!(muladd,c,a,b,c) # equivalent to the above
map!(muladd,c,a,b,c) # requires matching dimensions

muladd permits FMA operations and fma requires them (even if they must be painstakingly emulated on your hardware, so muladd should be used for speed and fma for correctness). When I use the broadcast definition c .= ..., I observe FMA instructions and 4x unrolling. I didn’t try the map! version.

But something using LoopVectorization.jl is usually the fastest option because it takes extra efforts to make sure the tail of the operation is fast (which can otherwise dominate runtime at small and medium sizes). If you care about achieving the highest speeds possible, it’s a good candidate.

1 Like

Thank you all!

After reading “The Julia Language V1.9.0” I tried to set JULIA_LLVM_ARGS to pass options to the LLVM backend, but quite disappointing.

The Julia and Bash codes are:

using InteractiveUtils
function vecfma!(c, a, b, n)
  @fastmath @inbounds @simd for i in 1:n
    c[i] += a[i] * b[i]
  end
  return nothing
end

code_native(vecfma!, (Vector{Float32}, Vector{Float32}, Vector{Float32}, Int32,))
export JULIA_LLVM_ARGS=" -unroll-count=4 "
julia --cpu-target=skylake -O3 -t 1 vecfma.jl > as.out

The problems are (as can be seen in as.out):

  1. 4x loop unrolling does not happen (actually no loop unrolling at all!)
  2. vector instructions (v*ps) are changed to scalar instructions (v*ss)

IMHO, these are very common optimization techniques and a good compiler is expected to do the work automatically (without relying on other packages or manual verification from the programmers).

Indeed, can confirm that running

export JULIA_LLVM_ARGS=" -unroll-count=4 "
julia --cpu-target=skylake -O3 -t 1 vecfma.jl > as.out

makes it worse.

But removing the LLVM flag and running with --cpu-target=tigerlake gives me the right thing (tested on v1.8.5, v1.9.0, v1.9.1):

...
││┌ @ essentials.jl:13 within `getindex`
        vmovups (%rcx,%rsi,4), %ymm0
        vmovups 32(%rcx,%rsi,4), %ymm1
        vmovups 64(%rcx,%rsi,4), %ymm2
        vmovups 96(%rcx,%rsi,4), %ymm3
        vmovups (%rdx,%rsi,4), %ymm4
        vmovups 32(%rdx,%rsi,4), %ymm5
        vmovups 64(%rdx,%rsi,4), %ymm6
        vmovups 96(%rdx,%rsi,4), %ymm7
; ││└
; ││┌ @ fastmath.jl:165 within `add_fast`
        vfmadd213ps     (%rax,%rsi,4), %ymm0, %ymm4 # ymm4 = (ymm0 * ymm4) + mem
        vfmadd213ps     32(%rax,%rsi,4), %ymm1, %ymm5 # ymm5 = (ymm1 * ymm5) + mem
        vfmadd213ps     64(%rax,%rsi,4), %ymm2, %ymm6 # ymm6 = (ymm2 * ymm6) + mem
        vfmadd213ps     96(%rax,%rsi,4), %ymm3, %ymm7 # ymm7 = (ymm3 * ymm7) + mem
; ││└
; ││┌ @ array.jl:969 within `setindex!`
        vmovups %ymm4, (%rax,%rsi,4)
        vmovups %ymm5, 32(%rax,%rsi,4)
        vmovups %ymm6, 64(%rax,%rsi,4)
        vmovups %ymm7, 96(%rax,%rsi,4)
...

Thanks for the verification. I just assume a decent compiler should give the programmers the option(s) to control how the loops are unrolled to optimize the code performance.

Don’t hesitate to submit your PRs if you think the compiler is not decent enough.

… give the programmers the option(s) to control how the loops are unrolled to optimize the code performance.

As others said, you can use @turbo from LoopVectorization.jl (docs: API reference · LoopVectorization.jl) to do this manually (although it really should work for skylake out of the box):

using LoopVectorization
using InteractiveUtils

function vecfma!(c, a, b, n)
  @turbo unroll=4 for i in 1:n
    c[i] += a[i] * b[i]
  end
  return nothing
end

code_native(vecfma!, (Vector{Float32}, Vector{Float32}, Vector{Float32}, Int32,))
1 Like