Kahan summation is a great example where fastmath can wreak havok, but FMA alone is not the problem. Rather, the problem is the reassocation enabled using the other fastmath flags.
If you want a case where FMA alone is undesirable, look no further than complex arithmetic:
Yes, here I also just applied @fastmath because I am lazy, but the issue in this case arises purely from the application of FMA and not from additional reassociations.
We all know that imag(x*y) == real(x)*imag(y) + imag(x)*real(y). But when we permit this to be replaced by fma(real(x), imag(y), imag(x)*real(y), we lose the precise cancellation we were counting on to deliver a zero imaginary component when multiplying conjugate pairs.
FMA is probably the single weakest fastmath optimization. It can, at best, change 2 instructions into 1. It cannot eliminate entire subexpressions, branches, or other blocks of code.
A dot product on my x86-64 usually requires a load, multiply, and add (i.e., 3 instructions â the load is required because only one operand can be from memory) for each operand pair, plus loop overhead. FMA reduces this to just a load and an FMA. I usually see 3 instructions of loop overhead, so at 4x unrolling (with or without SIMD) it changes a 15 instruction loop to 11 for a 37% increase in throughput (assuming all instructions are equal, which isnât so far off for the ones used in a dot product). With grotesque levels of unrolling, FMA could approach 50% additional throughput. However, this operation is memory-bound even before FMA so practical speedups are much smaller once the data is out of cache. Still, FMA can at least offer speedups in compute-bound situations. But these compute-bound calculations are usually not built purely of FMAs so you still canât realize the full theoretical 2x speedup. EDIT: matrix-matrix multiplication is a ubiquitous example of a FMA-heavy, compute-bound calculation where this full speedup can be realized â see discussion in later posts.
While FMA is specified in IEEE754 standard, it is not required (or as far as I can tell, even suggested) to enable floating point contractions by default. This is a purely language-level decision. Do note, however, that FMA breaks associativity rules (usually left-to-right) within most languages.
fma is literally required to return the result of a fused multiply-add with the rounding it entails. If a system does not have native FMA (e.g., the 2008 laptop I have limping around somewhereâŠ) then it will painstakingly emulate the instruction (see @edit Base.fma_emulated(1.0,2.0,3.0)) to produce the equivalent result. Emulated FMA is more than 30x slower than unfused mul-add on my machine.
muladd is more flexible. It permits FMA, but does not require it. Whether it does is up to the compiler. I guess in your case it decided not to.
Use fma if you require the rounding semantics. Otherwise, use muladd if you want the compiler to have the option to try to speed things up.
Note that, in your example, you had already computed a*b when you compared a*b >= c. The compiler probably noticed this and recycled the a*b calculation in the muladd below. This means it only needed to do an + to that previous result to complete the muladd. It preferred this over starting from scratch to do a fma. Perhaps if you removed the assert line, so that it didnât need to compute a*b earlier, it would have used an fma later? But I donât know why it didnât do the same with the @fastmath version, so clearly this explanation is incomplete.
Another one is the fact that fma(-a, b, a*b) is not in general equal to zero in the presence of rounding, even though with rounding a*b-a*b obviously is. A consequence is that if fma is used in complex multiplication, then conj(z)*z can have a nonzero imaginary part.
Iâll add that there is more âartâ to error checking than one might naively imagine. For example, the above function, re-written as
function f(a,b,c)
d = muladd(a,b,-c)
@assert d â„ 0
return sqrt(d)
end
achieves âconsistencyâ (recall that checking a*b >= c works for checking that a*b-c >= 0 but not fma(a,b,-c) >= 0) and does use fma on my machine. Note that I would usually prefer fma in a calculation that looked like this â itâs faster and (more importantly) more accurate in most settings. Donât shy away from it implicitly. But you should try to recognize when it may be inappropriate (eg, complex arithmetic). In general, youâll be less surprised by raw operations like * and +, so Iâd default to these and only add muladd after you take a moment to give it some thought.
I prefer Juliaâs current stance, where this optimization (usually called âfloating point contractionâ) is opt-in rather than opt-out. Contraction messes with the normal arithmetic rules and is a compiler optimization that may or may not occur. If the compiler can fiddle with it, itâs hard to see from the code alone where it might be going wrong. For example, @fastmath sqrt(-1.0) returns NaN on my machine. It would be devilishly hard to recognize what had happened if this calculation started returning NaN after youâd explicitly ensured a*b >= c. Adding extra prints or checks might cause the issue to mysteriously vanish, turning this into a Heisenbug.
As you show, unrolling is huge - it reduces loop overhead and buries latency. However, this is done by the default compiler (at least when @fastmath or @simd allows the reduction to be reorganized, or when no reduction is performed like in axpy). You also show that reducing memory access (even to data that is already in the cache) is significant in a memory-bound operation.
My comment was specifically a comparison of fma versus multiply-then-add in the common use case of dot (arguably the canonical case for basic discussions). All of your demonstrated dot products use fma (thanks to @turbo), so donât address this point.
I thought my napkin math on a 37% speed increase to dot via instruction counting was reasonable. I didnât bother to benchmark the difference because I would have needed to write a manually-unrolled implementation since @turbo and @simd (#49387) both enable floating point contract. And then I would have had to optimize it to avoid additional protestâŠ
You are absolutely correct to point out that matrix-matrix multiplication (with a tile-based implementation or otherwise at small, known sizes) is a compute-bound operation that can successfully exploit the full 2x speedup of FMA. I should have included this in my mention. However, typical users do not write this kernel unless they fail to recognize what theyâre computing⊠but I would anticipate that someone with the expertise to write a tiled matmul to fully benefit from fma would recognize what was happening and just use BLAS or StaticArrays.
My point is that âgeneralâ calculations often are either memory bound and/or require other operations, in which case mul-add will form only some fraction of the computation and the total benefit will be less than 2x. But I was too dismissive in my original reply â some workloads really are dominated by mul-add and fma can at least deliver the full 2x in those situations.
My other point was that contract is still the âmost boringâ fastmath optimization (and probably the safest as a result, at least in very broad terms). It can never provide more than a 2x performance benefit because at best it can replace two instructions with one instruction (that is the same speed or slower than either of the other two individually, otherwise the others would be implemented in terms of it). This isnât to say 2x is small (itâs quite big and Iâm grateful to have it). But other fastmath flags can eliminate entire calculations, resulting in unbounded speedup. For example, sum(z->0*z,x::Vector{Float64}) could be entirely eliminated and replaced by 0.0 when the no-signed-zeros and no-nan flags are enabled, no matter the length or content of x (although the compiler is not so clever, at present). As a less extreme example, reassoc can reduce Kahan summation to naive summation by eliminating the compensation calculations (but in this context this is usually an accident and someone specifically wanted Kahan summation).
Ha ha, I have no idea what I thought I read, or why I thought it had something to do with unrolling.
For a dot product, I actually donât expect fma to be of much help at all.
Most CPUs can do 2 loads/cycle, and 2 arithmetic operations/cycle.
Zen3 and Zen4 CPUs, however, can actually do twice the arithmetic as the number of loads!
Meaning theyâll be severely bottlenecked by loads, rather than fma.
Intel AVX512 w/ a single fma/cycle are the only CPUs I would expect to get much benefit (this means some ancient Xeons â most have double fma/cycle â and client CPUs like icelake and tigerlakes found in laptops, or rocketlake).
Here are a couple examples: fma *+
This shows SkylakeX going from 4->4.5 cycles when going from fma â mul and add, while tigerlake actually gets the 2x penalty.
A real dot product suffers from needing to reduce vectors (overhead) and memory latency and bandwidth, so real world differences will be much smaller.
BTW, StaticArrays didnât use muladd for years. A lot of people simply donât think to use it. I certainly would not hold StaticArrays up as an example of an optimized library. It takes the approach of generating a ton of code, throwing it at the compiler, and crossing its fingers.
contract is boring, but itâs generally good, and applies something people generally avoid using because they donât think the small benefit is worth the hit on readability.
This is why libraries like Muladdmacro exist.
Reassociation is nice, because it allows the compiler to SIMD reductions and unroll, to break the dependency chain of accumulators. This is great, and would be annoying to do manually.
But in a lot of other examples, like undoing a kahan summation, thatâs an optimization thatâd have been easier to do manually than to not do.
But I have also seen code_typed littered with massive amounts of multiplies by 0 and additions, thanks to ForwardDiff. These could have been cleared up by fastmath, so I agree that in practice, we really do have tons of non-trivial examples where the ability to delete code would be helpful, and likely to make a bigger difference than muladd.
FWIW, most Julia code I run spends most of its time on GC, waiting on memory, dynamic dispatches, etc. Low level optimizations are so far down the list of priorities, that @fastmath doesnât matter.
One subtle issue here that is discussed on GitHub but hasnât been bought up here is that there isnât one obvious way to apply this optimization. If you write a*b + c*d thereâs two ways to apply FMA:
fma(a, b, c*d)
fma(c, d, a*b)
Which one gets done by LLVM? Well, thatâs hard to predict because LLVM knows that + is commutative, so it might have changed a*b + c*d into c*d + a*b by the time the FMA transformation gets done. So by enabling automatic FMAs, youâve taken code thatâs completely unambiguous and deterministicâthe naive implementation of a*b + c*d as two multiplies followed by an addâand changed it to something where what youâre computing depends on whims of this particular version of the compiler and could change if we upgrade LLVM or choose a different set or ordering of optimization passes.
Note that thereâs no such problem if you explicitly write fma in your code: then itâs completely unambiguous what you want to compute and will always be the same. It also seems fine to me if youâve used @fast_math to explicitly give the compiler permission to compute something a bit different than what you wrote if it deems it faster. In such cases, we understand that the compiler is making a judgement and that this judgement might change. But we try very hard not to do that kind of thing by default in Juliaâwe compute what you asked for, not something else.
Would you be ok with introducing FMA for the former case (like clang and gcc do), and when all numbers are real, and do it in 1.x? I.e. leave out the other case, and some other I could think of e.g. a*b*c + d*e or a*b + c*d*e, that Iâm not sure clang optimizes.
It seems bad to not be competitive with clang and gcc producing potentially 2x faster (and more accurate usually) results by default, and thus basically all compiled code, also in e.g. Python, since it might rely on such compiled libraries. Yes, you can opt-in with fma in Julia, but programmers might not, and we would for some cases need a way to opt- out (or not, unclear how done with clang; on a non-global level)?
Are there some less restrictive cases where you would be opposed to FMA until 2.0? Or even just for case 1 or 2?
Iâm trying to think of why what doesnât apply to clang too⊠I see it uses FMA (i.e. vfmadd213sd) down to -O0, and -O1. The former though with a lot more instructions, but still not separate mul and add. (gcc still uses those separately, up to, not including -O2, then fma).
Clang and GCC do not do this transformation by default. Itâs only done if youâve asked for some unsafe non-IEEE compliant math mode because it changes the meaning of code. We do the exact same thing.
EDIT: Thatâs just not true in all cases. Clang does it by default depending on the arch, e.g. uses fmadd.d for RISC-V rv64gc clang 14.0.0 (changed from older versions; same change for 32-bit). Itâs not only for -O1, also with -O0, or no options which I assume is the same. [Itâs just that I overlooked when looking at boilerplate, the adds you see then are not for the floating point.]
Thatâs not entirely true, depending on what you mean by you âasked forâ it, and âby defaultâ. I see I can get it with only -O1 or -O2, depending on compiler/arch, e.g. for âpower64 AT 13.0 (gcc3)â.
Nobody uses -O0 for production compiles (as far as I know, itâs the fastest compile, thus slowest code, used for development/debug), and since -O0 or at least -O1 only in some cases triggers fma, it seems clear thatâs what clang wants to do it if gets away with it (can assume fma hardware).
What I wrote before is for that Gotbot result and yes it has -march=haswell (which isnât a default, yetâŠ) because:
They are already over 10-year old CPUs, and I kind of expect it to be implicit at some point. I certainly donât expect asking for tuning for that arch meaning asking for âunsafe non-IEEE compliant math modeâ which I see it does down to -O0 depending on the compiler (I only found that one combination generating FMA with -O0, and it went away if -march=haswell taken out).
I think the writing is on the wall, FMA (Haswell) will be assumed available as with e.g. zig c++ compiler on -O0 for x86 (though it generates long boilerplace unless -O1 used).
About âunsafe non-IEEE compliantâ, you may be right about ânon-IEEE compliantâ, but actually IEEE may not prevent it, Iâm not sure. I know IEEE specifies FMA and allows it as a separate operation. Does it for sure forbid the FMA substitution? I would understand if âunsafeâ, but I ask again, why would clang then do it by default if unsafe and not allowed? I think itâs actually safer, i.e. more accurate in many (most?) cases, and while yes not bit-identical then, and sometimes less accurate in context of other instructions, I can see it as a controversial change.
I confirmed Julia doesnât produce FMA at least even with:
$ julia -C haswell
but can you confidently state that Julia will not do it on some platform already e.g. ARM (Iâm not sure how I [cross]-compile for it), nor that Julia will ever to it with some upgrade to its LLVM dependency? Since Juliâa default is -O2 and Iâve never know exactly what that controls except directing LLVM back-end.
Notable absence of FMA is on âWebAssembly clang (trunk)â, on all optimization levels (also trunk only âversionâ listed, so newly supported target?).
Problem and existing solutions
In WASM there is no way to get speed improvement from this widely supported instruction. There are two way how to implement fallback if hardware is not support it: correctly rounded but slow and simple combination that fast but accumulate error.
For example OpenCL has both. In C/C++ you have to implement fma fallback by yourself but in Rust/Go/C#/Java/Julia fma implemented with correctly rounded fallback. It doesnât make much difference how it will be implemented because you always can detect fma feature in runtime initialisation with special floating point expression and implement conditional fallback as you wish in your software.
If at least first two basic instructions will be implemented it will be great step forward, because right now software that focused on precision need to implement correct fma fallback with more than 23 FLOPs instead of one. [âŠ]
Implementation
Here a draft how to implement software fallback based on relatively new Boldo-Melquiond paper.
Strangely ARM64 needs -O2 for fmadd (used in AArch64), while (32-bit) ARM only needs -O1 for, vmla.f64 (itâs also âFloating-point multiply accumulateâ) plus a vmov.f64 instructionâŠ
Also strangely, x64 msvc generates worse code on -O3 than with -O2⊠neither fma, but -O3 adding boilerplate code to what -O2 doesâŠ