X * y + z does not automatically use FMA instruction

I am new to Julia and was wondering why the code x * y + z does not use fmadd where all values are floats, but instead fmul followed by fadd. On the other hand, if I use the @fastmath decorator, the @code_native changes to use the fmadd instruction. My question is, if FMA is faster and more precise, why won’t it use fmadd to begin with on supported machines?

Machine: M1 apple silicon on 13.4 Ventura
Julia: 1.9.1
Optimization: -O3 and -O2

Because it gives a different result. You can explicitly tell the compiler that you are fine with that with @fastmath or by using either the muladd or fma function. See also the recent discussion How to enable vectorized fma instruction for multiply-add vectors?.

8 Likes

@GunnarFarneback Thank you for your reply and the link.

I was under the impression that because rounding only occurs at the end as opposed to at each step, fma is more precise. So because not all machines have native fma instructions, it is better to default to behavior that will be consistent between all machines?

No, I think the reasoning is that fma does not comply with the floating point standard IIUC

1 Like

@lrnv I’m not familiar with this standard, do you have a link?

This part of Julia docs details the behavior : Integers and Floating-Point Numbers · The Julia Language

On this page, it is said that Julia floats comply with the IEEE 754 standard, with a link to this Wikipedia page : IEEE 754-2008 revision - Wikipedia

2 Likes

This is some misunderstanding. IEEE 754 specifies FMA.

Yeah, FMA is more accurate, but the added accuracy may not be what the user wants. C/C++ compilers have flags to control this behavior (-ffp-contract in GCC and Clang).

3 Likes

That corroborates what I had found while digging around a bit.

I understand, thanks.

2 Likes

More often than not, @fastmath will make code more accurate, e.g. by using fma instructions and multiple separate accumulators in reductions.

Different than what you explicitly wrote is the problem, not accuracy.

On average, @fastmath also makes code much less accurate, because of a small number of catastrophic cases when code relies on being executed explicitly as written.

4 Likes

Neither GCC nor Clang need explicit flags to use FMA instructions:

I am personally in favor of automatic FMA, but enough people oppose it that I don’t expect it to happen.

4 Likes

Is that not because you specified -march= ? -O3 on its own is not enough to use FMA.

I think Clang used to require setting -ffp-contract for this, however they changed the default a year or so ago. Currently on is the default.

2 Likes

We may want to do that on julia if we add an easy way to back off from it.

1 Like

Haswell CPUs have FMA. Older ones don’t.
Code that can run on older CPUs thus can’t use it.

Julia starts with the equivalent of march=native by default.

1 Like

We already have --math-mode=ieee.
But I think we should add a local overwrite, too.

2 Likes

FMA is probably not used for (outdated) historical reasons. What Julia does currently, and IEEE standard allows, i.e. no FMA, can result in catastrophic cancellation (loss of all accuracy) that FMA would avoid:

Let’s consider an example to illustrate how the FMA operation works using decimal arithmetic first for clarity. […] the correct mathematical result is [not zero, but close, and zero result would be very bad]

Rounding the multiply and add separately yields a result that is off by 0.00064. The corresponding FMA computation is wrong by only 0.00004, and its result is closest to the correct mathematical answer. […]

Figure 1 shows CUDA C++ code and output corresponding to inputs A and B and operations from the example above. The code is executed on two different hardware platforms: an x86-class CPU using SSE in single precision, and an NVIDIA GPU with compute capability 2.0. At the time this paper is written (Spring 2011) there are no commercially available x86 CPUs which offer hardware FMA. Because of this, the computed result in single precision in SSE would be 0.

That seems like a good reason, but that different FMA result is (usually) more accurate, sometime much better, especially with Float32 (that you want to use for speed), and Float16 I guess. On this, I see why FMA is the default in C compilers now:

The C standard permits intermediate floating-point results within an expression to be computed with more precision than their type would normally allow. This permits operation fusing, and Clang takes advantage of this by default

Since it’s faster (and more accurate) and we want to compete with those other compilers I support changing the default. On non-FMA hardware it’s going to be way slower though (if we would insist on same result by emulation), that seems ok, as such hardware is increasingly rare, already obsolete in my view.

I’m not sure FMA by default contradicts IEEE. IEEE demands by now FMA availability, I believe (it’s mentioned in that 2008 revision, Clause 5: Operations, as new, though might be optional still) but not its use. If I’m wrong on using it usually more accurate, i.e. sometimes less (or catastrophically so?), then I also support:

You had the global --math-mode=fast option, but it’s dangerous (in Julia at least, so my PR to make it a no-op was merged; it’s the default I think with -O3 in other languages, not Julia, they get away with it). I’m actually not sure why that is, was it since it enabled something more than FMA? I’m assume just the FMA by default is mostly ok, why I support it by default, and likely ok with that possible local opt-out. The standard library (and all the current package ecosystem) could be made to work ok with the new default, fixed with an opt-out where needed. There’s some issue on this already on JuliaLang.

1 Like

No, Gunnar already gave the right reason.

1 Like

Consider things like Kahan summation algorithm - Wikipedia
Translating their algorithm from pseudo-code into Julia:

julia> using AccurateArithmetic, StableRNGs

julia> function KahanSum(input)
           sum = zero(eltype(input))                    # Prepare the accumulator.
           c = zero(eltype(input))                      # A running compensation for lost low-order bits.
       
           for i = eachindex(input)
               y = input[i] - c         # c is zero the first time around.
               t = sum + y              # Alas, sum is big, y small, so low-order digits of y are lost.
               c = (t - sum) - y            # (t - sum) cancels the high-order part of y; subtracting y recovers negative (low part of y)
               sum = t                      # Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!
           end                           # Next time around, the lost low part will be added to y in a fresh attempt.
       
           return sum
       end
KahanSum (generic function with 1 method)

julia> function KahanSumFast(input)
           sum = zero(eltype(input))                    # Prepare the accumulator.
           c = zero(eltype(input))                      # A running compensation for lost low-order bits.
       
           @fastmath for i = eachindex(input)
               y = input[i] - c         # c is zero the first time around.
               t = sum + y              # Alas, sum is big, y small, so low-order digits of y are lost.
               c = (t - sum) - y            # (t - sum) cancels the high-order part of y; subtracting y recovers negative (low part of y)
               sum = t                      # Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!
           end                           # Next time around, the lost low part will be added to y in a fresh attempt.
       
           return sum
       end
KahanSumFast (generic function with 1 method)

julia> rng = StableRNG(1); x = 5000 |> N->randn(rng, N) .* exp.(10 .* randn(rng, N)) |> x->[x;-x;1.0] |> x->x[sortperm(rand(rng,length(x)))];

julia> foldl(+, x)
4.554186716906959

julia> sum(x)
1.875

julia> sum(big, x)
1.0

julia> KahanSum(x)
1.0893429687695757

julia> KahanSumFast(x)
0.985827341906959

julia> sum_kbn(x)
1.0

julia> sum_oro(x)
1.0

julia> function sum_fast(input)
           sum = zero(eltype(input))
           @fastmath for i = eachindex(input)
               @inbounds sum += input[i]           end           return sum       endsum_fast (generic function with 1 method)julia> sum_fast(x)0.985827341906959

If floating point arithmetic were associative, we could transform

y = input[i] - c
t = sum + y
c = (t - sum) - y
sum = t

into

y = input[i] - c
t =  sum + y
c = (sum + y - sum) - y = 0
sum = t 

into

sum += input[i] 

That is, fast math essentially lets the compiler delete the code that tries to track and compensate for accumulating floating point error.

However, because fast math allows the compiler to change the result, it also applies a transform – multiple separate accumulators – that massively decreases the floating point error. Hence KahanSumFast – while equivalent to the naive @fastmath loop – is way more accurate (and faster) than foldl(+, x), and not actually much worse than kahan summation for the example above.
AccurateArithmetic.jl is much better than the naive KahanSummation, combining the compensation of error with the separate parallel accumulators (making it of course also much faster than the naive Kahan Summation).

julia> @btime KahanSum($x)
  45.747 μs (0 allocations: 0 bytes)
1.0893429687695757

julia> @btime KahanSumFast($x) # not actually a Kahan sum at all!
  511.698 ns (0 allocations: 0 bytes)
0.985827341906959

julia> @btime sum_kbn($x) # kahan sum with parallel accumulators
  1.722 μs (0 allocations: 0 bytes)
1.0

julia> @btime sum_oro($x) # Ogita–Rump–Oishi with parallel accumulators
  1.505 μs (0 allocations: 0 bytes)
1.0

Thanks to the parallel accumulators, AccurateArithmetic is only around 3x slower than the straight @fastmath sum, instead of 90x slower like the naive kahan summation described in the Wikipedia article.

Anyway, to answer your question:
There are error-free transforms like in this article. @fastmath can undo them, like I showed. In this particular example, it isn’t bad, because @fastmath also allows applying another transform that dramatically improves both accuracy and speed, which isn’t allowed without it (@simd also allows this).
But in other cases, such as implementing functions such as exp, we lose a lot of accuracy with @fastmath, because it only undoes the compensation!

In the case of languages like C, C++, Fortran, etc, a global @fastmath isn’t as bad, because functions like exp are replaced with only slightly less accurate implementations that have already been compiled with their error compensations intact (they just have less of it than the more accurate variants). That is, the “fast” exp you’re calling from C with fast math was itself not compiled under fast math.

Similarly, in Julia, @fastmath exp(x) will call a faster and less accurate exp implementation, but that implementation wasn’t itself compiled under fast math.

Implementing functions like these is a very delicate process. The authors of these libraries (like @Oscar_Smith ) are very deliberate about floating point rounding and error, and would thus need strict IEEE adherence for their implementations.

17 Likes

Using FMA automatically can cause surprises. Consider the following:

function f(a,b,c)
    @assert a*b ≥ c
    return sqrt(a*b-c)
end

a = 1.0 + 0.5^27
b = 1.0 - 0.5^27
c = 1.0
f(a,b,c)

It works as expected. Now suppose the compiler automatically uses FMA inside f, like this:

function f(a,b,c)
    @assert a*b ≥ c
    return sqrt(fma(a,b,-c))
end

Now f(a,b,c) throws and exception, because fma(a,b,-c) is negative, despite the assertion a*b ≥ c.

7 Likes

Thanks for the example.

I guess my question is why does replacing fma with muladd work?

function f(a,b,c)
    @assert a*b ≥ c
    return sqrt(muladd(a,b,-c))
end

What is the difference between the two functions?

Also using @fastmath works.

function f(a,b,c)
    @assert a*b ≥ c
    return @fastmath sqrt(a*b-c)
end
1 Like