Hi,

the vectorized fma instructions (unrolled by 4x) can easily be generated for the C code below via `clang -Ofast -S -Wall -std=c11 -march=skylake vecfma.c`

.

```
void vecfma(float * restrict c, float * restrict a, float * restrict b, const int n)
{
for (int i = 0; i < n; i++)
c[i] += a[i] * b[i];
}
```

```
vfmadd213ps (%rdi,%rax), %ymm0, %ymm4 # ymm4 = (ymm0 * ymm4) + mem
vfmadd213ps 32(%rdi,%rax), %ymm1, %ymm5 # ymm5 = (ymm1 * ymm5) + mem
vfmadd213ps 64(%rdi,%rax), %ymm2, %ymm6 # ymm6 = (ymm2 * ymm6) + mem
vfmadd213ps 96(%rdi,%rax), %ymm3, %ymm7 # ymm7 = (ymm3 * ymm7) + mem
```

However, with `julia --cpu-target=skylake`

I cannot generate these fma instructions (`vmulps`

and `vaddps`

instead) for the Julia version of this function.

```
function vecfma!(c, a, b, n)
@inbounds @simd for i in 1:n
c[i] += a[i] * b[i]
end
return nothing
end
code_native(vecfma!, (Vector{Float32}, Vector{Float32}, Vector{Float32}, Int32,))
```

And the unrolling factor is 2x (not 4x).

```
; â”‚â”‚â”Ś @ float.jl:410 within `*`
vmulps (%rdx,%rsi,4), %ymm0, %ymm0
vmulps 32(%rdx,%rsi,4), %ymm1, %ymm1
; â”‚â”‚â””
; â”‚â”‚â”Ś @ float.jl:408 within `+`
vaddps (%rax,%rsi,4), %ymm0, %ymm0
vaddps 32(%rax,%rsi,4), %ymm1, %ymm1
```

How could I generate the fma instructions? and also fine tuning the unrolling factor for loops in Julia?

Thanks in advance!

Xin