# @fastmath macro accuracy

As I understand it @fastmath explores whatever functions you write and replaces them with fast and rough versions of these functions. For instance rather than the proper exp function you get a faster and less accurate version of exp substituted into the code.

The speed benefits seem huge. In the below trivial test case I get about 5-10 times the speed (and using the @simd, @. or @inbounds macros did not give any substantial speedup) and there is no substantial accuracy loss. It seems too good to be true though. Has anyone seen large error accumulation in large codebases using @fastmath? When the performance tips says it may change numerical results does this just mean it will fail regression tests with small tolerances or can the errors be material? I guess materiality depends on what you are calculating but for most cases I guess people are happy with 5 significant figures (and are calculating at a scale much larger than machine epsilon).

``````using Random

function do_complex_calculations(a::AbstractArray, b::AbstractArray, c::AbstractArray)
len = length(a)
result = zeros(len)
for i in 1:len
result[i] = (a[i]^7 +  a[i]^6 - 100.3*b[i])/sqrt(tan(a[i]) + sin(a[i]) + c[i]^2 + 104.32)
end
return result
end

function do_complex_calculations_fastmath(a::AbstractArray, b::AbstractArray, c::AbstractArray)
len = length(a)
result = zeros(len)
for i in 1:len
@fastmath result[i] = (a[i]^7 +  a[i]^6 - 100.3*b[i])/sqrt(tan(a[i]) + sin(a[i]) + c[i]^2 + 104.32)
end
return result
end

rows = Int(1e7)
a = rand(rows)
b = rand(rows)
c = rand(rows)

@time results1 = do_complex_calculations(a,b,c)
@time results2 = do_complex_calculations_fastmath(a,b,c)
sum(abs.(results1 .- results2))
# 2.823381455333576e-9

``````
3 Likes

The main things `@fastmath` does is allow breaking rules like associativity that floating point numbers don’t quite satisfy. On a small scale, the error introduced will be quite small, but it can be arbitrarily large in general. For example `1/(-1+big-big+1)` can be either 1, -1, or infinity depending on what order you evaluate the denominator.

6 Likes

Start Julia with `--math-mode=fast`:

``````julia> sinpi(1.2)
0.0

julia> @code_llvm sinpi(1.2)
;  @ special/trig.jl:751 within `sinpi'
define double @julia_sinpi_1678(double %0) {
top:
%1 = bitcast double %0 to i64
%2 = and i64 %1, -9223372036854775808
%3 = bitcast i64 %2 to double
ret double %3
}
``````

That is indeed faster than normal!
(Un)fortunately, benchmark tools doesn’t seem to respect the flag:

``````julia> @btime sinpi(1.2)
1.127 ns (0 allocations: 0 bytes)
-0.587785252292473
``````

This isn’t the same as `@fastmath sinpi`, but it is the same as if you `@fastmah`ed the entire definition of `sinpi`.

I think many base versions are guaranteed to be within 1 floating point number of the correctly rounded result.
Fast functions may be within 3 or so?
SLEEF provides functions with different errors. Their fast variants are off by at most 3.5 (rounding to 3 floating point numbers).

You can see a Julia port of SLEEF here, and the tolerance tests it uses.

In loops, with slight modifications, these can be made to be very fast by taking advantage of SIMD.

2 Likes

Another effect is that some errors are avoided (and perhaps other changes with how `NaN` & `Inf` are treated):

``````julia> @fastmath (-1.5)^0.5
NaN

julia> @fastmath log(-1.5)
NaN
``````
2 Likes

So basically if you are sure you do not have any NaNs or Infs, Then you can get a big speedup from @fastmath and will only be off by a 3-4ish machine epsilons. So unless you were going to do something extremely sensitive to rounding error (like matrix inversion) there is no big downside to the faster algorithms.

No, this is not true.

3 Likes

I am actually surprised at the speedup you get here. In the cases I’ve tested before, it usually doesn’t make a big difference.

Note that simply rearranging floating point arithmetic can be enough to yield any value whatsoever. Further, there isn’t a robust metric for what is acceptable for `@fastmath` (or C’s `-ffast-math`). The tradeoffs are rather arbitrary and dependent upon hardware capabilities.

I think the macro should be split up and killed — it’s one thing to sometimes return NaN, it’s another to reorder, and it’s yet another to arbitrarily change accuracy.

4 Likes

It is a pretty contrived example.

Is it true that you get within a few epsilons if you are sure that you never add small and huge numbers like @Oscar_Smith’s example. So there is no catastrophic cancellation. It would be handy to have a checklist or something so you can check that nothing will go wrong in oyur problem if you use @fastmath.

That is an interesting link. I was not aware reordering summation terms can change the answer that much.

Such checklist is impossible to come up with. Since all what you can come up with is something tied to a particular compiler (julia + llvm) version. Any minor update can totally break it.

It’s also in general impossible to tell if you have any big or small numbers. If you aren’t calling anyone else’s code (including Base) it’ll be possible but there’s also less need for `fastmath` since you can usually just do the transformation manually. If you are using someone else’s code then your analysis is again at best tied to the version you looked at. A otherwise non-breaking change can totally break your analysis.

5 Likes

Cancellations happen as soon as you subtract two numbers that are close from one another. This includes, for example, any algorithm where you compute a form of residual or error by subtracting two likewise quantities computed by different means.

In short, you do not have to have very large numbers in a computation for such issues to arise.

What’s true is that it is safe to add two numbers having the same order of magnitude (and with the same sign!). But that does not get you very far in general…

3 Likes

I get that catastrophic cancellation is a big problem in general. I think everyone already avoids that though- for instance using you would never naively compute `exp(-50) * (exp(51) - exp(50))`.
However if @fastmath only gave erroneous answers in cases you would already be avoiding then that would not be so bad. It appears though that @fastmath can give bad answers anywhere and so it is probably not a great idea to use it too much.

Another way to put it, fastmath in julia is basically mainly useful when you think the compiler can figure out something better that you cannot yourself. That by itself means that it’ll be hard to find a case where you can be confident on the safety of `@fastmath` and that it is still helpful.

This is fairly different from C, where the language gives you fewer tools to play with. (For example, I don’t think there’s a way to express `muladd` in pure C).

The case that I have used `@fastmath` before is to use it on math functions to avoid branches and allow better vectorization (`@simd` on `sqrt` for example). Even in that case, I wrote the code to only put `@fastmath` on each functions separately and being awaring that it might break. https://github.com/JuliaLang/julia/pull/31862 is a step in the right direction but as I commented in that PR there are also a few more that might be needed and the implementation itself may not be as clean as it should be… (That PR might be all what you need to get the speedup you are seeing though.)

4 Likes

Here is a fun `@fastmath` example:

``````julia> function vdiv!(x, y, a)
inva = inv(a)
@inbounds for i ∈ eachindex(x, y)
x[i] = y[i] * inva
end
x
end
vdiv! (generic function with 1 method)

julia> function vdiv_fast!(x, y, a)
@fastmath inva = inv(a)
@inbounds for i ∈ eachindex(x, y)
@fastmath x[i] = y[i] * inva
end
x
end
vdiv_fast! (generic function with 1 method)

julia> using BenchmarkTools

julia> y = rand(200); x1 = similar(y); x2 = similar(y); a = 1.6;

julia> @benchmark vdiv!(\$x1, \$y, \$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     15.516 ns (0.00% GC)
median time:      15.556 ns (0.00% GC)
mean time:        16.261 ns (0.00% GC)
maximum time:     46.570 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     998

julia> @benchmark vdiv_fast!(\$x2, \$y, \$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     111.777 ns (0.00% GC)
median time:      112.296 ns (0.00% GC)
mean time:        117.401 ns (0.00% GC)
maximum time:     179.112 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     925

julia> x1 ≈ x2
true
``````

The version without `@fastmath` inefficiently does a division (`1/a`) to get the inverse of a, and then multiplies each element of `y` by the answer.
The `@fastmath` flag allows LLVM to simplify this into diving `y` directly.

However, division is many times slower than multiplication. So dividing just once (before the loop) and then multiplying inside the loop is much faster.

11 Likes

Oh, funny! Sure seems like LLVM should know that division is so costly.

1 Like

That’s something that should be reported to LLVM… Should be a fun bug report…

6 Likes

The `@fastmath` flag basically just wraps features offered by LLVM. LLVM offers flags called `nnan`, `ninf`, `nsz`, `arcp`, and a convenient flag `fast` that implies the former four https://llvm.org/docs/LangRef.html#id1214.

Unfortunately, it’s a bit tedious to set these flags manually; currently, one has to write a C++ function that generates code for a machine instruction with this flag, make it visible to Julia, and write Julia wrappers for these. That’s why there is only `@fastmath`, and not the individual `@no_nan_math`, `@no_inf_math`, `@no_signed_zero_math`, and `@allow_reciprocals_math`. (Of course, flags can be combined, so there are 16 possibilities…)

With a bit of guidance from someone more familiar with handling intrinsic functions in Julia, we could easily do this. Maybe `@llvmcall` would work – I don’t know.

2 Likes

And much more I believe.

I think it is just e.g.

``````julia> function add_fast(x::Float64, y::Float64)
Base.llvmcall("""
%3 = fadd fast double %0, %1
ret double %3
""",
Float64, Tuple{Float64, Float64}, x, y)
end
add_fast (generic function with 1 method)