X * y + z does not automatically use FMA instruction

EDIT: That’s just not true in all cases. Clang does it by default depending on the arch, e.g. uses fmadd.d for RISC-V rv64gc clang 14.0.0 (changed from older versions; same change for 32-bit). It’s not only for -O1, also with -O0, or no options which I assume is the same. [It’s just that I overlooked when looking at boilerplate, the adds you see then are not for the floating point.]

That’s not entirely true, depending on what you mean by you “asked for” it, and “by default”. I see I can get it with only -O1 or -O2, depending on compiler/arch, e.g. for “power64 AT 13.0 (gcc3)”.

Nobody uses -O0 for production compiles (as far as I know, it’s the fastest compile, thus slowest code, used for development/debug), and since -O0 or at least -O1 only in some cases triggers fma, it seems clear that’s what clang wants to do it if gets away with it (can assume fma hardware).

What I wrote before is for that Gotbot result and yes it has -march=haswell (which isn’t a default, yet…) because:

They are already over 10-year old CPUs, and I kind of expect it to be implicit at some point. I certainly don’t expect asking for tuning for that arch meaning asking for “unsafe non-IEEE compliant math mode” which I see it does down to -O0 depending on the compiler (I only found that one combination generating FMA with -O0, and it went away if -march=haswell taken out).

I think the writing is on the wall, FMA (Haswell) will be assumed available as with e.g. zig c++ compiler on -O0 for x86 (though it generates long boilerplace unless -O1 used).

About “unsafe non-IEEE compliant”, you may be right about “non-IEEE compliant”, but actually IEEE may not prevent it, I’m not sure. I know IEEE specifies FMA and allows it as a separate operation. Does it for sure forbid the FMA substitution? I would understand if “unsafe”, but I ask again, why would clang then do it by default if unsafe and not allowed? I think it’s actually safer, i.e. more accurate in many (most?) cases, and while yes not bit-identical then, and sometimes less accurate in context of other instructions, I can see it as a controversial change.

I confirmed Julia doesn’t produce FMA at least even with:

$ julia -C haswell

but can you confidently state that Julia will not do it on some platform already e.g. ARM (I’m not sure how I [cross]-compile for it), nor that Julia will ever to it with some upgrade to its LLVM dependency? Since Juli’a default is -O2 and I’ve never know exactly what that controls except directing LLVM back-end.

Notable absence of FMA is on “WebAssembly clang (trunk)”, on all optimization levels (also trunk only “version” listed, so newly supported target?).

Problem and existing solutions

In WASM there is no way to get speed improvement from this widely supported instruction. There are two way how to implement fallback if hardware is not support it: correctly rounded but slow and simple combination that fast but accumulate error.

For example OpenCL has both. In C/C++ you have to implement fma fallback by yourself but in Rust/Go/C#/Java/Julia fma implemented with correctly rounded fallback. It doesn’t make much difference how it will be implemented because you always can detect fma feature in runtime initialisation with special floating point expression and implement conditional fallback as you wish in your software.

If at least first two basic instructions will be implemented it will be great step forward, because right now software that focused on precision need to implement correct fma fallback with more than 23 FLOPs instead of one. […]

Implementation

Here a draft how to implement software fallback based on relatively new Boldo-Melquiond paper.

Strangely ARM64 needs -O2 for fmadd (used in AArch64), while (32-bit) ARM only needs -O1 for, vmla.f64 (it’s also “Floating-point multiply accumulate”) plus a vmov.f64 instruction…

Also strangely, x64 msvc generates worse code on -O3 than with -O2… neither fma, but -O3 adding boilerplate code to what -O2 does…