X * y + z does not automatically use FMA instruction

Palli · July 27, 2023, 11:14pm

EDIT: That’s just not true in all cases. Clang does it by default depending on the arch, e.g. uses fmadd.d for RISC-V rv64gc clang 14.0.0 (changed from older versions; same change for 32-bit). It’s not only for -O1, also with -O0, or no options which I assume is the same. [It’s just that I overlooked when looking at boilerplate, the adds you see then are not for the floating point.]

That’s not entirely true, depending on what you mean by you “asked for” it, and “by default”. I see I can get it with only -O1 or -O2, depending on compiler/arch, e.g. for “power64 AT 13.0 (gcc3)”.

Nobody uses -O0 for production compiles (as far as I know, it’s the fastest compile, thus slowest code, used for development/debug), and since -O0 or at least -O1 only in some cases triggers fma, it seems clear that’s what clang wants to do it if gets away with it (can assume fma hardware).

What I wrote before is for that Gotbot result and yes it has -march=haswell (which isn’t a default, yet…) because:

They are already over 10-year old CPUs, and I kind of expect it to be implicit at some point. I certainly don’t expect asking for tuning for that arch meaning asking for “unsafe non-IEEE compliant math mode” which I see it does down to -O0 depending on the compiler (~~I only found that one combination~~ generating FMA with -O0, and it went away if -march=haswell taken out).

I think the writing is on the wall, FMA (Haswell) will be assumed available as with e.g. zig c++ compiler on -O0 for x86 (though it generates long boilerplace unless -O1 used).

About “unsafe non-IEEE compliant”, you may be right about “non-IEEE compliant”, but actually IEEE may not prevent it, I’m not sure. I know IEEE specifies FMA and allows it as a separate operation. Does it for sure forbid the FMA substitution? I would understand if “unsafe”, but I ask again, why would clang then do it by default if unsafe and not allowed? I think it’s actually safer, i.e. more accurate in many (most?) cases, and while yes not bit-identical then, and sometimes less accurate in context of other instructions, I can see it as a controversial change.

I confirmed Julia doesn’t produce FMA at least even with:

$ julia -C haswell

but can you confidently state that Julia will not do it on some platform already e.g. ARM (I’m not sure how I [cross]-compile for it), nor that Julia will ever to it with some upgrade to its LLVM dependency? Since Juli’a default is -O2 and I’ve never know exactly what that controls except directing LLVM back-end.

Notable absence of FMA is on “WebAssembly clang (trunk)”, on all optimization levels (also trunk only “version” listed, so newly supported target?).

github.com/WebAssembly/design

[post-MVP proposal] FMA instruction

opened 08:31PM - 14 Dec 20 UTC

munrocket

floating point

**Motivation** Fused multiply–add (FMA) is a floating-point operation perform…ed in one step, with a single rounding. FMA can speed up and improve the accuracy of many computations: dot product, matrix multiplication, convolutions and artificial neural networks, polynomial evaluation, Newton's method, Kahan summation, Veltkamp-Dekker algorithm. This instruction exist in languages like: [C](https://en.cppreference.com/w/c/numeric/math/fma) / [C++](https://en.cppreference.com/w/cpp/numeric/math/fma), [Rust](https://users.rust-lang.org/t/why-does-the-mul-add-method-produce-a-more-accurate-result-with-better-performance/1626), [C#](https://docs.microsoft.com/en-us/dotnet/api/system.math.fusedmultiplyadd?view=net-5.0), [Go](https://golang.org/src/math/fma.go?s=2262:2295), [Java](https://www.geeksforgeeks.org/math-fma-method-in-java-with-examples/), [Julia](https://github.com/JuliaLang/julia/issues/6330), [Swift](https://developer.apple.com/documentation/simd/fma), [OpenGL 4+](https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/fma.xhtml). **Problem and existing solutions** In WASM there is no way to get speed improvement from this widely supported instruction. There are two way how to implement fallback if hardware is not support it: correctly rounded but slow and simple combination that fast but accumulate error. For example OpenCL has both. In C/C++ you have to implement fma fallback by yourself but in Rust/Go/C#/Java/Julia fma implemented with correctly rounded fallback. It doesn't make much difference how it will be implemented because you always can detect fma feature in runtime initialisation with special floating point expression and implement conditional fallback as you wish in your software. If at least first two basic instructions will be implemented it will be great step forward, because right now software that focused on precision need to implement correct fma fallback with more than 23 FLOPs instead of one. This can be finance application or arbitrary precision libraries or space flight simulator with orbital dynamic. **Proposed instructions** ```lisp (f64.fma $x $y $z) ;; equivalent to fma(x, y, z) in C99 (f32.fma $x $y $z) ;; equivalent to fmaf(x, y, z) in C99 ``` Usually languages compiles fma(x,y,-z) into fused multiply-subtract under the hood. Since .wasm is compilation target looks like all instruction set can be implemented. ```lisp (f64.fms $x $y $z) ;; RN(x * y - z) = fma(x,y,-z) (f32.fms $x $y $z) ;; (f64.fnma $x $y $z) ;; RN(-x * y + z) (f32.fnma $x $y $z) ;; (f64.fnms $x $y $z) ;; RN(-x * y - z) (f32.fnms $x $y $z) ;; ``` But not everyone doing all of them because result with negation the same. **Implementation** Here a draft how to implement software fallback based on relatively new [Boldo-Melquiond paper](https://hal-ens-lyon.archives-ouvertes.fr/inria-00080427v2/document). ```c #include <math.h> //fma, FP_FAST_FMA static inline double two_prod(const double x, const double y, double &err) { double splitter = 0x8000001p0; //2^27+1 //float splitter = 0x2001p0; //2^13+1 double t = splitter * a; double ah = t + (a - t); double al = a - ah; t = splitter * b; double bh = t + (b - t); double bl = b - bh; t = a * b; err = ((ah * bh - t) + ah * bl + al * bh) + al * bl; return t; } static inline double two_sum(const double a, const double b, double &err){ double s = a + b; double a1 = s - b; err = (a - a1) + (b - (s - a1)); return s; } static inline double fma_correct_fallback(const double x, const double y, const double z) { // Check overflows // Veltkamp-Dekker multiplication: x * y -> (mul, err) // Moller-Knuth summation: mul + z -> (sum, err2) // Boldo-Melquiond ternary summation: sum + err + err2 -> fma } static inline double f64_fma(const double x, const double y, const double z) { #ifdef FP_FAST_FMA return fma(x, y, z); #else return fma_correct_fallback(x, y, z); #endif } ``` But [chromium](https://chromium.googlesource.com/external/github.com/WebAssembly/musl/+/landing-branch/src/math/fma.c) and [apple](https://opensource.apple.com/source/Libm/Libm-315/Source/Intel/xmm_fma.c) already had some own implementation.

Problem and existing solutions

In WASM there is no way to get speed improvement from this widely supported instruction. There are two way how to implement fallback if hardware is not support it: correctly rounded but slow and simple combination that fast but accumulate error.

For example OpenCL has both. In C/C++ you have to implement fma fallback by yourself but in Rust/Go/C#/Java/Julia fma implemented with correctly rounded fallback. It doesn’t make much difference how it will be implemented because you always can detect fma feature in runtime initialisation with special floating point expression and implement conditional fallback as you wish in your software.

If at least first two basic instructions will be implemented it will be great step forward, because right now software that focused on precision need to implement correct fma fallback with more than 23 FLOPs instead of one. […]

Implementation

Here a draft how to implement software fallback based on relatively new Boldo-Melquiond paper.

Strangely ARM64 needs -O2 for fmadd (used in AArch64), while (32-bit) ARM only needs -O1 for, vmla.f64 (it’s also “Floating-point multiply accumulate”) plus a vmov.f64 instruction…

Also strangely, x64 msvc generates worse code on -O3 than with -O2… neither fma, but -O3 adding boilerplate code to what -O2 does…

Topic		Replies	Views
Why doesn’t Julia fma automatically? General Usage	43	2407	October 18, 2023
Floating point optimizations General Usage question , float	5	477	November 1, 2022
`@fastmath` is not applied to macros Performance fast-math	5	408	July 20, 2023
Use different methods depending on presence of FMA Internals & Design	21	2153	July 1, 2017
Fast math in NASA benchmark General Usage	8	1519	July 23, 2020

X * y + z does not automatically use FMA instruction

Related topics