X * y + z does not automatically use FMA instruction

mikmoore · June 22, 2023, 3:55pm

Kahan summation is a great example where fastmath can wreak havok, but FMA alone is not the problem. Rather, the problem is the reassocation enabled using the other fastmath flags.

If you want a case where FMA alone is undesirable, look no further than complex arithmetic:

julia> x = Complex(0.1,0.2)
0.1 + 0.2im

julia> x*x'
0.05000000000000001 + 0.0im

julia> @fastmath x*x'
0.05000000000000001 - 1.6653345369377347e-18im

Yes, here I also just applied @fastmath because I am lazy, but the issue in this case arises purely from the application of FMA and not from additional reassociations.

We all know that imag(x*y) == real(x)*imag(y) + imag(x)*real(y). But when we permit this to be replaced by fma(real(x), imag(y), imag(x)*real(y), we lose the precise cancellation we were counting on to deliver a zero imaginary component when multiplying conjugate pairs.

FMA is probably the single weakest fastmath optimization. It can, at best, change 2 instructions into 1. It cannot eliminate entire subexpressions, branches, or other blocks of code.

A dot product on my x86-64 usually requires a load, multiply, and add (i.e., 3 instructions – the load is required because only one operand can be from memory) for each operand pair, plus loop overhead. FMA reduces this to just a load and an FMA. I usually see 3 instructions of loop overhead, so at 4x unrolling (with or without SIMD) it changes a 15 instruction loop to 11 for a 37% increase in throughput (assuming all instructions are equal, which isn’t so far off for the ones used in a dot product). With grotesque levels of unrolling, FMA could approach 50% additional throughput. However, this operation is memory-bound even before FMA so practical speedups are much smaller once the data is out of cache. Still, FMA can at least offer speedups in compute-bound situations. But these compute-bound calculations are usually not built purely of FMAs so you still can’t realize the full theoretical 2x speedup. EDIT: matrix-matrix multiplication is a ubiquitous example of a FMA-heavy, compute-bound calculation where this full speedup can be realized – see discussion in later posts.

While FMA is specified in IEEE754 standard, it is not required (or as far as I can tell, even suggested) to enable floating point contractions by default. This is a purely language-level decision. Do note, however, that FMA breaks associativity rules (usually left-to-right) within most languages.

mikmoore · June 22, 2023, 4:08pm

fma is literally required to return the result of a fused multiply-add with the rounding it entails. If a system does not have native FMA (e.g., the 2008 laptop I have limping around somewhere…) then it will painstakingly emulate the instruction (see @edit Base.fma_emulated(1.0,2.0,3.0)) to produce the equivalent result. Emulated FMA is more than 30x slower than unfused mul-add on my machine.

muladd is more flexible. It permits FMA, but does not require it. Whether it does is up to the compiler. I guess in your case it decided not to.

Use fma if you require the rounding semantics. Otherwise, use muladd if you want the compiler to have the option to try to speed things up.

Note that, in your example, you had already computed a*b when you compared a*b >= c. The compiler probably noticed this and recycled the a*b calculation in the muladd below. This means it only needed to do an + to that previous result to complete the muladd. It preferred this over starting from scratch to do a fma. Perhaps if you removed the assert line, so that it didn’t need to compute a*b earlier, it would have used an fma later? But I don’t know why it didn’t do the same with the @fastmath version, so clearly this explanation is incomplete.

GLOBEX_CORP · June 22, 2023, 5:11pm

Thank you for this post. Admittedly I am learning a lot reading all of the posts in this thread.

As you suggested, if i use muladd it does not force the use of FMA as it already performed the multiplication prior to the comparison.

...
	fmul	d0, d10, d9
; │└
	fcmp	d0, d8
	b.lt	LBB0_3
; %bb.1:                                ; %L4
; │ @ REPL[11]:3 within `f_muladd`
; │┌ @ float.jl:413 within `muladd`
	fsub	d0, d0, d8
; │└
; │┌ @ math.jl:677 within `sqrt`
	fcmp	d0, #0.0
...

This section looks identical to the simple function suggested by @arch.d.robison.

...
	fmul	d0, d10, d9
; │└
	fcmp	d0, d8
	b.lt	LBB0_3
; %bb.1:                                ; %L4
; │ @ REPL[2]:3 within `f`
; │┌ @ float.jl:409 within `-`
	fsub	d0, d0, d8
; │└
; │┌ @ math.jl:677 within `sqrt`
	fcmp	d0, #0.0
...

While the fma version uses a floating-point negated multiply-subtract which is I guess the issue.

...
	fmul	d0, d10, d9
; │└
	fcmp	d0, d8
	b.lt	LBB0_3
; %bb.1:                                ; %L12
; │ @ REPL[1]:3 within `f_fma`
; │┌ @ floatfuncs.jl:426 within `fma`
; ││┌ @ floatfuncs.jl:421 within `fma_llvm`
	fnmsub	d0, d10, d9, d8
; │└└
; │┌ @ math.jl:677 within `sqrt`
	fcmp	d0, #0.0
...

mstewart · June 22, 2023, 5:28pm

Nice example. This is discussed in Kahan’s Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic. It’s not at all esoteric. It can happen in computing fma(-4*a, c, b^2) thereby causing problems with the quadratic formula.

Another one is the fact that fma(-a, b, a*b) is not in general equal to zero in the presence of rounding, even though with rounding a*b-a*b obviously is. A consequence is that if fma is used in complex multiplication, then conj(z)*z can have a nonzero imaginary part.

Elrod · June 22, 2023, 5:49pm

Matrix multiply is basically a 2x speedup.
It’s the building block of a ton of machine learning and linear algebra.

mikmoore · June 22, 2023, 5:50pm

I’ll add that there is more “art” to error checking than one might naively imagine. For example, the above function, re-written as

function f(a,b,c)
    d = muladd(a,b,-c)
    @assert d ≥ 0
    return sqrt(d)
end

achieves “consistency” (recall that checking a*b >= c works for checking that a*b-c >= 0 but not fma(a,b,-c) >= 0) and does use fma on my machine. Note that I would usually prefer fma in a calculation that looked like this – it’s faster and (more importantly) more accurate in most settings. Don’t shy away from it implicitly. But you should try to recognize when it may be inappropriate (eg, complex arithmetic). In general, you’ll be less surprised by raw operations like * and +, so I’d default to these and only add muladd after you take a moment to give it some thought.

I prefer Julia’s current stance, where this optimization (usually called “floating point contraction”) is opt-in rather than opt-out. Contraction messes with the normal arithmetic rules and is a compiler optimization that may or may not occur. If the compiler can fiddle with it, it’s hard to see from the code alone where it might be going wrong. For example, @fastmath sqrt(-1.0) returns NaN on my machine. It would be devilishly hard to recognize what had happened if this calculation started returning NaN after you’d explicitly ensured a*b >= c. Adding extra prints or checks might cause the issue to mysteriously vanish, turning this into a Heisenbug.

mikmoore · June 22, 2023, 6:59pm

As you show, unrolling is huge - it reduces loop overhead and buries latency. However, this is done by the default compiler (at least when @fastmath or @simd allows the reduction to be reorganized, or when no reduction is performed like in axpy). You also show that reducing memory access (even to data that is already in the cache) is significant in a memory-bound operation.

My comment was specifically a comparison of fma versus multiply-then-add in the common use case of dot (arguably the canonical case for basic discussions). All of your demonstrated dot products use fma (thanks to @turbo), so don’t address this point.

I thought my napkin math on a 37% speed increase to dot via instruction counting was reasonable. I didn’t bother to benchmark the difference because I would have needed to write a manually-unrolled implementation since @turbo and @simd (#49387) both enable floating point contract. And then I would have had to optimize it to avoid additional protest…

You are absolutely correct to point out that matrix-matrix multiplication (with a tile-based implementation or otherwise at small, known sizes) is a compute-bound operation that can successfully exploit the full 2x speedup of FMA. I should have included this in my mention. However, typical users do not write this kernel unless they fail to recognize what they’re computing… but I would anticipate that someone with the expertise to write a tiled matmul to fully benefit from fma would recognize what was happening and just use BLAS or StaticArrays.

My point is that “general” calculations often are either memory bound and/or require other operations, in which case mul-add will form only some fraction of the computation and the total benefit will be less than 2x. But I was too dismissive in my original reply – some workloads really are dominated by mul-add and fma can at least deliver the full 2x in those situations.

My other point was that contract is still the “most boring” fastmath optimization (and probably the safest as a result, at least in very broad terms). It can never provide more than a 2x performance benefit because at best it can replace two instructions with one instruction (that is the same speed or slower than either of the other two individually, otherwise the others would be implemented in terms of it). This isn’t to say 2x is small (it’s quite big and I’m grateful to have it). But other fastmath flags can eliminate entire calculations, resulting in unbounded speedup. For example, sum(z->0*z,x::Vector{Float64}) could be entirely eliminated and replaced by 0.0 when the no-signed-zeros and no-nan flags are enabled, no matter the length or content of x (although the compiler is not so clever, at present). As a less extreme example, reassoc can reduce Kahan summation to naive summation by eliminating the compensation calculations (but in this context this is usually an accident and someone specifically wanted Kahan summation).

Elrod · June 22, 2023, 7:42pm

Ha ha, I have no idea what I thought I read, or why I thought it had something to do with unrolling.

For a dot product, I actually don’t expect fma to be of much help at all.
Most CPUs can do 2 loads/cycle, and 2 arithmetic operations/cycle.
Zen3 and Zen4 CPUs, however, can actually do twice the arithmetic as the number of loads!
Meaning they’ll be severely bottlenecked by loads, rather than fma.

Intel AVX512 w/ a single fma/cycle are the only CPUs I would expect to get much benefit (this means some ancient Xeons – most have double fma/cycle – and client CPUs like icelake and tigerlakes found in laptops, or rocketlake).

Here are a couple examples:
fma
*+
This shows SkylakeX going from 4->4.5 cycles when going from fma → mul and add, while tigerlake actually gets the 2x penalty.
A real dot product suffers from needing to reduce vectors (overhead) and memory latency and bandwidth, so real world differences will be much smaller.

BTW, StaticArrays didn’t use muladd for years. A lot of people simply don’t think to use it. I certainly would not hold StaticArrays up as an example of an optimized library. It takes the approach of generating a ton of code, throwing it at the compiler, and crossing its fingers.

contract is boring, but it’s generally good, and applies something people generally avoid using because they don’t think the small benefit is worth the hit on readability.
This is why libraries like Muladdmacro exist.

Reassociation is nice, because it allows the compiler to SIMD reductions and unroll, to break the dependency chain of accumulators. This is great, and would be annoying to do manually.
But in a lot of other examples, like undoing a kahan summation, that’s an optimization that’d have been easier to do manually than to not do.

But I have also seen code_typed littered with massive amounts of multiplies by 0 and additions, thanks to ForwardDiff. These could have been cleared up by fastmath, so I agree that in practice, we really do have tons of non-trivial examples where the ability to delete code would be helpful, and likely to make a bigger difference than muladd.

FWIW, most Julia code I run spends most of its time on GC, waiting on memory, dynamic dispatches, etc. Low level optimizations are so far down the list of priorities, that @fastmath doesn’t matter.

StefanKarpinski · June 23, 2023, 2:16pm

One subtle issue here that is discussed on GitHub but hasn’t been bought up here is that there isn’t one obvious way to apply this optimization. If you write a*b + c*d there’s two ways to apply FMA:

fma(a, b, c*d)
fma(c, d, a*b)

Which one gets done by LLVM? Well, that’s hard to predict because LLVM knows that + is commutative, so it might have changed a*b + c*d into c*d + a*b by the time the FMA transformation gets done. So by enabling automatic FMAs, you’ve taken code that’s completely unambiguous and deterministic—the naive implementation of a*b + c*d as two multiplies followed by an add—and changed it to something where what you’re computing depends on whims of this particular version of the compiler and could change if we upgrade LLVM or choose a different set or ordering of optimization passes.

Note that there’s no such problem if you explicitly write fma in your code: then it’s completely unambiguous what you want to compute and will always be the same. It also seems fine to me if you’ve used @fast_math to explicitly give the compiler permission to compute something a bit different than what you wrote if it deems it faster. In such cases, we understand that the compiler is making a judgement and that this judgement might change. But we try very hard not to do that kind of thing by default in Julia—we compute what you asked for, not something else.

Palli · July 23, 2023, 11:38pm

Would you be ok with introducing FMA for the former case (like clang and gcc do), and when all numbers are real, and do it in 1.x? I.e. leave out the other case, and some other I could think of e.g. a*b*c + d*e or a*b + c*d*e, that I’m not sure clang optimizes.

It seems bad to not be competitive with clang and gcc producing potentially 2x faster (and more accurate usually) results by default, and thus basically all compiled code, also in e.g. Python, since it might rely on such compiled libraries. Yes, you can opt-in with fma in Julia, but programmers might not, and we would for some cases need a way to opt- out (or not, unclear how done with clang; on a non-global level)?

Are there some less restrictive cases where you would be opposed to FMA until 2.0? Or even just for case 1 or 2?

I’m trying to think of why what doesn’t apply to clang too… I see it uses FMA (i.e. vfmadd213sd) down to -O0, and -O1. The former though with a lot more instructions, but still not separate mul and add. (gcc still uses those separately, up to, not including -O2, then fma).

StefanKarpinski · July 27, 2023, 3:36pm

Clang and GCC do not do this transformation by default. It’s only done if you’ve asked for some unsafe non-IEEE compliant math mode because it changes the meaning of code. We do the exact same thing.

Palli · July 27, 2023, 11:14pm

EDIT: That’s just not true in all cases. Clang does it by default depending on the arch, e.g. uses fmadd.d for RISC-V rv64gc clang 14.0.0 (changed from older versions; same change for 32-bit). It’s not only for -O1, also with -O0, or no options which I assume is the same. [It’s just that I overlooked when looking at boilerplate, the adds you see then are not for the floating point.]

That’s not entirely true, depending on what you mean by you “asked for” it, and “by default”. I see I can get it with only -O1 or -O2, depending on compiler/arch, e.g. for “power64 AT 13.0 (gcc3)”.

Nobody uses -O0 for production compiles (as far as I know, it’s the fastest compile, thus slowest code, used for development/debug), and since -O0 or at least -O1 only in some cases triggers fma, it seems clear that’s what clang wants to do it if gets away with it (can assume fma hardware).

What I wrote before is for that Gotbot result and yes it has -march=haswell (which isn’t a default, yet…) because:

They are already over 10-year old CPUs, and I kind of expect it to be implicit at some point. I certainly don’t expect asking for tuning for that arch meaning asking for “unsafe non-IEEE compliant math mode” which I see it does down to -O0 depending on the compiler (~~I only found that one combination~~ generating FMA with -O0, and it went away if -march=haswell taken out).

I think the writing is on the wall, FMA (Haswell) will be assumed available as with e.g. zig c++ compiler on -O0 for x86 (though it generates long boilerplace unless -O1 used).

About “unsafe non-IEEE compliant”, you may be right about “non-IEEE compliant”, but actually IEEE may not prevent it, I’m not sure. I know IEEE specifies FMA and allows it as a separate operation. Does it for sure forbid the FMA substitution? I would understand if “unsafe”, but I ask again, why would clang then do it by default if unsafe and not allowed? I think it’s actually safer, i.e. more accurate in many (most?) cases, and while yes not bit-identical then, and sometimes less accurate in context of other instructions, I can see it as a controversial change.

I confirmed Julia doesn’t produce FMA at least even with:

$ julia -C haswell

but can you confidently state that Julia will not do it on some platform already e.g. ARM (I’m not sure how I [cross]-compile for it), nor that Julia will ever to it with some upgrade to its LLVM dependency? Since Juli’a default is -O2 and I’ve never know exactly what that controls except directing LLVM back-end.

Notable absence of FMA is on “WebAssembly clang (trunk)”, on all optimization levels (also trunk only “version” listed, so newly supported target?).

github.com/WebAssembly/design

[post-MVP proposal] FMA instruction

opened 08:31PM - 14 Dec 20 UTC

munrocket

floating point

**Motivation** Fused multiply–add (FMA) is a floating-point operation perform…ed in one step, with a single rounding. FMA can speed up and improve the accuracy of many computations: dot product, matrix multiplication, convolutions and artificial neural networks, polynomial evaluation, Newton's method, Kahan summation, Veltkamp-Dekker algorithm. This instruction exist in languages like: [C](https://en.cppreference.com/w/c/numeric/math/fma) / [C++](https://en.cppreference.com/w/cpp/numeric/math/fma), [Rust](https://users.rust-lang.org/t/why-does-the-mul-add-method-produce-a-more-accurate-result-with-better-performance/1626), [C#](https://docs.microsoft.com/en-us/dotnet/api/system.math.fusedmultiplyadd?view=net-5.0), [Go](https://golang.org/src/math/fma.go?s=2262:2295), [Java](https://www.geeksforgeeks.org/math-fma-method-in-java-with-examples/), [Julia](https://github.com/JuliaLang/julia/issues/6330), [Swift](https://developer.apple.com/documentation/simd/fma), [OpenGL 4+](https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/fma.xhtml). **Problem and existing solutions** In WASM there is no way to get speed improvement from this widely supported instruction. There are two way how to implement fallback if hardware is not support it: correctly rounded but slow and simple combination that fast but accumulate error. For example OpenCL has both. In C/C++ you have to implement fma fallback by yourself but in Rust/Go/C#/Java/Julia fma implemented with correctly rounded fallback. It doesn't make much difference how it will be implemented because you always can detect fma feature in runtime initialisation with special floating point expression and implement conditional fallback as you wish in your software. If at least first two basic instructions will be implemented it will be great step forward, because right now software that focused on precision need to implement correct fma fallback with more than 23 FLOPs instead of one. This can be finance application or arbitrary precision libraries or space flight simulator with orbital dynamic. **Proposed instructions** ```lisp (f64.fma $x $y $z) ;; equivalent to fma(x, y, z) in C99 (f32.fma $x $y $z) ;; equivalent to fmaf(x, y, z) in C99 ``` Usually languages compiles fma(x,y,-z) into fused multiply-subtract under the hood. Since .wasm is compilation target looks like all instruction set can be implemented. ```lisp (f64.fms $x $y $z) ;; RN(x * y - z) = fma(x,y,-z) (f32.fms $x $y $z) ;; (f64.fnma $x $y $z) ;; RN(-x * y + z) (f32.fnma $x $y $z) ;; (f64.fnms $x $y $z) ;; RN(-x * y - z) (f32.fnms $x $y $z) ;; ``` But not everyone doing all of them because result with negation the same. **Implementation** Here a draft how to implement software fallback based on relatively new [Boldo-Melquiond paper](https://hal-ens-lyon.archives-ouvertes.fr/inria-00080427v2/document). ```c #include <math.h> //fma, FP_FAST_FMA static inline double two_prod(const double x, const double y, double &err) { double splitter = 0x8000001p0; //2^27+1 //float splitter = 0x2001p0; //2^13+1 double t = splitter * a; double ah = t + (a - t); double al = a - ah; t = splitter * b; double bh = t + (b - t); double bl = b - bh; t = a * b; err = ((ah * bh - t) + ah * bl + al * bh) + al * bl; return t; } static inline double two_sum(const double a, const double b, double &err){ double s = a + b; double a1 = s - b; err = (a - a1) + (b - (s - a1)); return s; } static inline double fma_correct_fallback(const double x, const double y, const double z) { // Check overflows // Veltkamp-Dekker multiplication: x * y -> (mul, err) // Moller-Knuth summation: mul + z -> (sum, err2) // Boldo-Melquiond ternary summation: sum + err + err2 -> fma } static inline double f64_fma(const double x, const double y, const double z) { #ifdef FP_FAST_FMA return fma(x, y, z); #else return fma_correct_fallback(x, y, z); #endif } ``` But [chromium](https://chromium.googlesource.com/external/github.com/WebAssembly/musl/+/landing-branch/src/math/fma.c) and [apple](https://opensource.apple.com/source/Libm/Libm-315/Source/Intel/xmm_fma.c) already had some own implementation.

Problem and existing solutions

In WASM there is no way to get speed improvement from this widely supported instruction. There are two way how to implement fallback if hardware is not support it: correctly rounded but slow and simple combination that fast but accumulate error.

For example OpenCL has both. In C/C++ you have to implement fma fallback by yourself but in Rust/Go/C#/Java/Julia fma implemented with correctly rounded fallback. It doesn’t make much difference how it will be implemented because you always can detect fma feature in runtime initialisation with special floating point expression and implement conditional fallback as you wish in your software.

If at least first two basic instructions will be implemented it will be great step forward, because right now software that focused on precision need to implement correct fma fallback with more than 23 FLOPs instead of one. […]

Implementation

Here a draft how to implement software fallback based on relatively new Boldo-Melquiond paper.

Strangely ARM64 needs -O2 for fmadd (used in AArch64), while (32-bit) ARM only needs -O1 for, vmla.f64 (it’s also “Floating-point multiply accumulate”) plus a vmov.f64 instruction…

Also strangely, x64 msvc generates worse code on -O3 than with -O2… neither fma, but -O3 adding boilerplate code to what -O2 does…

Topic		Replies	Views
Why doesn’t Julia fma automatically? General Usage	43	2397	October 18, 2023
Floating point optimizations General Usage question , float	5	474	November 1, 2022
`@fastmath` is not applied to macros Performance fast-math	5	407	July 20, 2023
Use different methods depending on presence of FMA Internals & Design	21	2149	July 1, 2017
Fast math in NASA benchmark General Usage	8	1516	July 23, 2020

X * y + z does not automatically use FMA instruction

Related topics