As I understand it @fastmath explores whatever functions you write and replaces them with fast and rough versions of these functions. For instance rather than the proper exp function you get a faster and less accurate version of exp substituted into the code.
The speed benefits seem huge. In the below trivial test case I get about 5-10 times the speed (and using the @simd, @. or @inbounds macros did not give any substantial speedup) and there is no substantial accuracy loss. It seems too good to be true though. Has anyone seen large error accumulation in large codebases using @fastmath? When the performance tips says it may change numerical results does this just mean it will fail regression tests with small tolerances or can the errors be material? I guess materiality depends on what you are calculating but for most cases I guess people are happy with 5 significant figures (and are calculating at a scale much larger than machine epsilon).
using Random
function do_complex_calculations(a::AbstractArray, b::AbstractArray, c::AbstractArray)
len = length(a)
result = zeros(len)
for i in 1:len
result[i] = (a[i]^7 + a[i]^6 - 100.3*b[i])/sqrt(tan(a[i]) + sin(a[i]) + c[i]^2 + 104.32)
end
return result
end
function do_complex_calculations_fastmath(a::AbstractArray, b::AbstractArray, c::AbstractArray)
len = length(a)
result = zeros(len)
for i in 1:len
@fastmath result[i] = (a[i]^7 + a[i]^6 - 100.3*b[i])/sqrt(tan(a[i]) + sin(a[i]) + c[i]^2 + 104.32)
end
return result
end
rows = Int(1e7)
a = rand(rows)
b = rand(rows)
c = rand(rows)
@time results1 = do_complex_calculations(a,b,c)
@time results2 = do_complex_calculations_fastmath(a,b,c)
sum(abs.(results1 .- results2))
# 2.823381455333576e-9
The main things @fastmath does is allow breaking rules like associativity that floating point numbers don’t quite satisfy. On a small scale, the error introduced will be quite small, but it can be arbitrarily large in general. For example 1/(-1+big-big+1) can be either 1, -1, or infinity depending on what order you evaluate the denominator.
This isn’t the same as @fastmath sinpi, but it is the same as if you @fastmahed the entire definition of sinpi.
I think many base versions are guaranteed to be within 1 floating point number of the correctly rounded result.
Fast functions may be within 3 or so? SLEEF provides functions with different errors. Their fast variants are off by at most 3.5 (rounding to 3 floating point numbers).
You can see a Julia port of SLEEF here, and the tolerance tests it uses.
In loops, with slight modifications, these can be made to be very fast by taking advantage of SIMD.
So basically if you are sure you do not have any NaNs or Infs, Then you can get a big speedup from @fastmath and will only be off by a 3-4ish machine epsilons. So unless you were going to do something extremely sensitive to rounding error (like matrix inversion) there is no big downside to the faster algorithms.
Note that simply rearranging floating point arithmetic can be enough to yield any value whatsoever. Further, there isn’t a robust metric for what is acceptable for @fastmath (or C’s -ffast-math). The tradeoffs are rather arbitrary and dependent upon hardware capabilities.
I think the macro should be split up and killed — it’s one thing to sometimes return NaN, it’s another to reorder, and it’s yet another to arbitrarily change accuracy.
Is it true that you get within a few epsilons if you are sure that you never add small and huge numbers like @Oscar_Smith’s example. So there is no catastrophic cancellation. It would be handy to have a checklist or something so you can check that nothing will go wrong in oyur problem if you use @fastmath.
Such checklist is impossible to come up with. Since all what you can come up with is something tied to a particular compiler (julia + llvm) version. Any minor update can totally break it.
It’s also in general impossible to tell if you have any big or small numbers. If you aren’t calling anyone else’s code (including Base) it’ll be possible but there’s also less need for fastmath since you can usually just do the transformation manually. If you are using someone else’s code then your analysis is again at best tied to the version you looked at. A otherwise non-breaking change can totally break your analysis.
Cancellations happen as soon as you subtract two numbers that are close from one another. This includes, for example, any algorithm where you compute a form of residual or error by subtracting two likewise quantities computed by different means.
In short, you do not have to have very large numbers in a computation for such issues to arise.
What’s true is that it is safe to add two numbers having the same order of magnitude (and with the same sign!). But that does not get you very far in general…
I get that catastrophic cancellation is a big problem in general. I think everyone already avoids that though- for instance using you would never naively compute exp(-50) * (exp(51) - exp(50)).
However if @fastmath only gave erroneous answers in cases you would already be avoiding then that would not be so bad. It appears though that @fastmath can give bad answers anywhere and so it is probably not a great idea to use it too much.
Another way to put it, fastmath in julia is basically mainly useful when you think the compiler can figure out something better that you cannot yourself. That by itself means that it’ll be hard to find a case where you can be confident on the safety of @fastmath and that it is still helpful.
This is fairly different from C, where the language gives you fewer tools to play with. (For example, I don’t think there’s a way to express muladd in pure C).
The case that I have used @fastmath before is to use it on math functions to avoid branches and allow better vectorization (@simd on sqrt for example). Even in that case, I wrote the code to only put @fastmath on each functions separately and being awaring that it might break. https://github.com/JuliaLang/julia/pull/31862 is a step in the right direction but as I commented in that PR there are also a few more that might be needed and the implementation itself may not be as clean as it should be… (That PR might be all what you need to get the speedup you are seeing though.)
julia> function vdiv!(x, y, a)
inva = inv(a)
@inbounds for i ∈ eachindex(x, y)
x[i] = y[i] * inva
end
x
end
vdiv! (generic function with 1 method)
julia> function vdiv_fast!(x, y, a)
@fastmath inva = inv(a)
@inbounds for i ∈ eachindex(x, y)
@fastmath x[i] = y[i] * inva
end
x
end
vdiv_fast! (generic function with 1 method)
julia> using BenchmarkTools
julia> y = rand(200); x1 = similar(y); x2 = similar(y); a = 1.6;
julia> @benchmark vdiv!($x1, $y, $a)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 15.516 ns (0.00% GC)
median time: 15.556 ns (0.00% GC)
mean time: 16.261 ns (0.00% GC)
maximum time: 46.570 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
julia> @benchmark vdiv_fast!($x2, $y, $a)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 111.777 ns (0.00% GC)
median time: 112.296 ns (0.00% GC)
mean time: 117.401 ns (0.00% GC)
maximum time: 179.112 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 925
julia> x1 ≈ x2
true
The version without @fastmath inefficiently does a division (1/a) to get the inverse of a, and then multiplies each element of y by the answer.
The @fastmath flag allows LLVM to simplify this into diving y directly.
However, division is many times slower than multiplication. So dividing just once (before the loop) and then multiplying inside the loop is much faster.
The @fastmath flag basically just wraps features offered by LLVM. LLVM offers flags called nnan, ninf, nsz, arcp, and a convenient flag fast that implies the former four https://llvm.org/docs/LangRef.html#id1214.
Unfortunately, it’s a bit tedious to set these flags manually; currently, one has to write a C++ function that generates code for a machine instruction with this flag, make it visible to Julia, and write Julia wrappers for these. That’s why there is only @fastmath, and not the individual @no_nan_math, @no_inf_math, @no_signed_zero_math, and @allow_reciprocals_math. (Of course, flags can be combined, so there are 16 possibilities…)
With a bit of guidance from someone more familiar with handling intrinsic functions in Julia, we could easily do this. Maybe @llvmcall would work – I don’t know.