@fastmath switched off in Julia 0.7-Dev, deliberately?

Sorry, I meant to be Socratic, not rude.

If you’d like to read a long discussion, there’s loads of good stuff here:

Here are the LLVM docs on fastmath:
https://llvm.org/docs/LangRef.html#fast-math-flags

Basically, it turns on a lot of possible optimizations. A few highlights:

  • Approximate functions - Allow substitution of approximate calculations for functions (sin, log, sqrt, etc). See floating-point intrinsic definitions for places where this can apply to LLVM’s intrinsic math functions.
  • Allow Reciprocal - Allow optimizations to use the reciprocal of an argument rather than perform division.
  • Allow floating-point contraction (e.g. fusing a multiply followed by an addition into a fused multiply-and-add).
  • Allow reassociation transformations for floating-point instructions. This may dramatically change results in floating-point.

Doing things like this can change the results of floating point calculations, so they’re not allowed by default.
For many of these – the approximate functions being an exception – it’s things you could have done yourself.
If you wrote

out1 += a/w + a /x  + a/y + a/z 
out2 += b/x + b/y + b/z + b/w
out3 += c/y + c/z + c/w + c/x
out4 += d/z + d/w + d/x + d/y

You could have written this instead:

rw = 1/w
rx = 1/x
ry = 1/y
rz = 1/z
out1 = (((out1 + rw*a) + rx*a) + ry*a) + rz*a
out2 = (((out2 + rw*b) + rx*b) + ry*b) + rz*b
out3 = (((out3 + rw*c) + rx*c) + ry*c) + rz*c
out4 = (((out4 + rw*d) + rx*d) + ry*d) + rz*d

The latter is going to be faster. Looking only at the arithmetic operations and a Haswell processor:

“Reciprocal throughput” (RT) is basically how many clock cycles it takes on average, if you’re doing a lot of the instructions.

fdiv has a RT of 8-18 clock cycles. This is a single division.
fdivpd takes 16-28. This is 4 divisions.

Most data movement and shuffling instructions take about RT 1.

A fused multiply-add instruction takes only 0.5.

Both versions are mathematically equivalent, but not numerically. They will give you a different answer.

Also, worth noting that not many instructions don’t conflict, so you can’t just add RTs together to guestimate how long it’ll take. Processors multitask, can do things out of order, etc…
I only started reading about this recently, while working on optimizing matrix multiplication. Many here will know this much better than I.

You can see how much I know when actually trying to benchmark the above lol (these tests are on a Ryzen processor, which are fairly similar, with the notable difference that they have RT of 1 for fma instead of 0.5):

julia> function test1(out, A, W)
           out1, out2, out3, out4 = out
           a, b, c, d = A
           w, x, y, z = W
           out1 += a/w + a /x  + a/y + a/z 
           out2 += b/x + b/y + b/z + b/w
           out3 += c/y + c/z + c/w + c/x
           out4 += d/z + d/w + d/x + d/y
           (out1, out2, out3, out4)
       end
test1 (generic function with 1 method)

julia> function test1_fm(out, A, W)
           @fastmath begin
               out1, out2, out3, out4 = out
               a, b, c, d = A
               w, x, y, z = W
               out1 += a/w + a /x  + a/y + a/z 
               out2 += b/x + b/y + b/z + b/w
               out3 += c/y + c/z + c/w + c/x
               out4 += d/z + d/w + d/x + d/y
           end
           (out1, out2, out3, out4)
       end
test1_fm (generic function with 1 method)

julia> function test2(out, A, W)
           out1, out2, out3, out4 = out
           a, b, c, d = A
           w, x, y, z = W
           rw = 1/w
           rx = 1/x
           ry = 1/y
           rz = 1/z
           out1 = (((out1 + rw*a) + rx*a) + ry*a) + rz*a
           out2 = (((out2 + rw*b) + rx*b) + ry*b) + rz*b
           out3 = (((out3 + rw*c) + rx*c) + ry*c) + rz*c
           out4 = (((out4 + rw*d) + rx*d) + ry*d) + rz*d
           (out1, out2, out3, out4)
       end
test2 (generic function with 1 method)

julia> function test2_fm(out, A, W)
           @fastmath begin
               out1, out2, out3, out4 = out
               a, b, c, d = A
               w, x, y, z = W
               rw = 1/w
               rx = 1/x
               ry = 1/y
               rz = 1/z
               out1 = (((out1 + rw*a) + rx*a) + ry*a) + rz*a
               out2 = (((out2 + rw*b) + rx*b) + ry*b) + rz*b
               out3 = (((out3 + rw*c) + rx*c) + ry*c) + rz*c
               out4 = (((out4 + rw*d) + rx*d) + ry*d) + rz*d
           end
           (out1, out2, out3, out4)
       end
test2_fm (generic function with 1 method)

julia> out = ntuple(i -> randn(), 4)
(0.36465241072570437, -0.8177038358601129, -0.6176666333370491, 0.6337676331752973)

julia> A = ntuple(i -> randn(), 4)
(-3.3341832981886057, 1.239717550334394, 0.005370551858030382, -0.05803825677944311)

julia> W = ntuple(i -> randn(), 4)
(-1.0521219836695515, -0.7005568413173827, -0.7599365767331884, -1.05623323338091)

julia> using BenchmarkTools

julia> @btime test1($out, $A, $W)
  8.765 ns (0 allocations: 0 bytes)
(15.837116493227779, -6.5706824679638975, -0.6425889796194907, 0.9030974099511657)

julia> @btime test1_fm($out, $A, $W)
  8.705 ns (0 allocations: 0 bytes)
(15.837116493227779, -6.5706824679638975, -0.6425889796194907, 0.9030974099511657)

julia> @btime test2($out, $A, $W)
  6.642 ns (0 allocations: 0 bytes)
(15.83711649322778, -6.5706824679638975, -0.6425889796194907, 0.9030974099511657)

julia> @btime test2_fm($out, $A, $W)
  4.418 ns (0 allocations: 0 bytes)
(15.83711649322778, -6.5706824679638975, -0.6425889796194908, 0.9030974099511657)

julia> versioninfo()
Julia Version 0.7.0-alpha.94
Commit 4420717* (2018-06-12 20:26 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, znver1)

I expected @fastmath not to help version 2 much, as I thought I did a decent job optimizing it, but that it would drastically improve version 1.
Instead, it didn’t help version 1 at all. Version 2 was faster, and improved even further via @fastmath.

Too bad on version 1 not getting any faster, since your typical scientist or statistician isn’t going to want to look at tables of instruction throughputs.

I like to use @fastmath in generated functions. Whenever I didn’t choose the order of my operations for some good reason – but either arbitrarily, or because I thought “this order would be fastest”, I like to try @fastmath and let the compiler make it’s much more educated decisions about what order is actually fastest.

That is when it pays off.
In your case, there was nothing @fastmath could do.

EDIT: Looking at @code_native test2(out, A, W), I needed this optimization:

  • Allow floating-point contraction (e.g. fusing a multiply followed by an addition into a fused multiply-and-add).

for my code. Without it, test2 was – of course! – not actually using the fused multiply add instructions. test2_fm was.

7 Likes