Why doesn’t Julia fma automatically?

Is there any reason why the compiler doesn’t optimize these? I feel like having to do this is a detriment to high level programming, and especially the math like syntax that I can write in Julia (same expression I use in my paper).

fma changes results because IEEE floating point numbers are weird.


Yes, it changes them to be more accurate (only 1 roundoff instead of 2), so this still really doesn’t make sense.

Yes, it changes them to be more accurate (only 1 roundoff instead of 2), so this still really doesn’t make sense.

It changes them to be unpredictable

more accurate or not is mostly irrelevant here


I guess this was already discussed lots of times. One relevant thread:

The TLDR is because of correctness, it is assumed that the user uses the exact FP operations that are necessary, and changing the operations in such a way that the result is changed would be incorrect. This is consistent with other optimizations that compilers (not just Julia) do, in general an optimization is allowed if it doesn’t change any observable result.

The thread also gives specific examples of FP code that can break in such cases. Search for “error-free transforms” or “multi-word arithmetics” or “floating-point expansions” for more. An example Julia package that relies on the current behavior is MultiFloats. Many implementations of transcendental math functions would also be broken.


They’re perfectly predictable; whenever you see A x + b, it gives the same answer, consistently, so people familiar with FMA can predict exactly what it will do. People who aren’t familiar with FMA can also predict exactly what it will do: It will calculate A x + b, and anything beyond that doesn’t matter to them, and because if your code relies on whether A x + b rounds exactly once or exactly twice, it’s bad code.

The problem is a*b+c*d there are two ways to fuse this that will give different answers.


It’s consistent with some other compilers (like C or C++) but not with others (like Haskell or every Python DSL for machine learning).

Yeah, you’re right that this is an unfortunate case. That being said, I think it still makes sense to just pick a standard interpretation, since (a*b + (c*d)) or vice-versa will both usually be a bit more accurate than (a*b) + (c*d). Alternatively, we could choose to interpret anything more complex than a * b + c in strict numerical order, and just stick to the easy win of optimizing a * b + c.

f(a,b) = a * b
f(a,b) + c

would then depend on inlining? Are floating point results allowed to depend on inlining?


Not sure what you mean. If I make a program that prints “hello world”, in whichever language, I wouldn’t expect it to print “hello Earth”. Unless that was somehow explicitly allowed by some documented semantics, of course.

1 Like

Or rather would depend on inlining NOT happening, or at least if inlining done, not behaving as if written as one function. And that would be reasonable to do.

Because changing exact floating-point values isn’t (and can’t) be considered a breaking change, or it would imply that we could never change any numerical function. If you figure out an optimization that lets you double the speed of sum, too bad, can’t be merged, it’s a breaking change. Upgrading LLVM would be a breaking change too (because LLVM doesn’t guarantee exact floating-point results). Pretty much any kind of optimization would end up becoming illegal (e.g. broadcast fusion would have to be thrown out as well).

Whenever I write a * x + b, I’m already asking for an abstract thing, and accepting that what I get will only be approximately correct (that’s how floats work!); I can’t get hung up on the exact values of the last one or two bits.


Yes, they are (and already do in a few situations). The exact value of the last bit isn’t guaranteed.

pretty false equivalency between changing sum and changing +

they have different contracts

1 Like

Neither of them guarantees exact floating-point values in their contract (and they can’t, because LLVM doesn’t guarantee exact floating-point values). But if you want a more analogous case, A * B * C chooses the execution order that has the best performance, rather than exactly left-to-right.


A * B * C chooses the execution order that has the best performance, rather than exactly left-to-right.

and this is indeed already a controversial design choice itself. but in any case (A*B)*C does not. where do you draw the line? if you allow rewriting a*b + c, why not rewrite more complicated arithmetic? will parentheses save you? or only function barriers? or neither?

because LLVM doesn’t guarantee exact floating-point values

I am not an LLVM expert, but I’d venture a guess that * and + mean one thing and one thing only. without fast math flags, doesn’t it follow IEEE754?


It’s very unclear that LLVM does NOT fuse by default, or at least isn’t allowed to, or that IEEE disallows that. IEEE demands FMA available now; since 2008 revision? At least 2019 revision. LLVM is an implementation detail, Clang based on it does FMA by default:


The C standard permits intermediate floating-point results within an expression to be computed with more precision than their type would normally allow. This permits operation fusing, and Clang takes advantage of this by default. This behavior can be controlled with the FP_CONTRACT and clang fp contract pragmas. Please refer to the pragma documentation for a description of how the pragmas interact with this option.

Valid values are:

  • fast (fuse across statements disregarding pragmas, default for CUDA)
  • on (fuse in the same statement unless dictated by pragmas, default for languages other than CUDA/HIP)
  • off (never fuse)
  • fast-honor-pragmas (fuse across statements unless dictated by pragmas, default for HIP)

It’s not clear to me IEEE demands fused used, i.e. not just available. It’s very unlikely it demands it used, but neither clear it demands it NOT used.

So it’s up to Julia, and it can and would still claim IEEE compatibility. I think there might be exceptions to Julia wanting to do this, e.g. for complex values. But otherwise I think we should do the same “on” the default (and I guess “fast” for CUDA), as above. But that also means we need some pragma equivalent to opt out. I suppose that would be a macro in Julia. I think we should do this to compete with C/C++ (in recent compilers; and Mojo also, I suppose it defaults to fused too).


That is not how floats work. They have an exact arithmetic with a limited number of “decimal” places. They just snap to the nearest representable value if you land somewhere that’s in between the values they can represent — but that value is an exact number. It’s not a “fuzzy” thing. There are many algorithms that explicitly depend upon this.