@fastmath switched off in Julia 0.7-Dev, deliberately?

Seif_Shebl · June 23, 2018, 9:27pm

I noticed that the @fastmath macro is no longer effective in version 0.7-Dev. Has it been switched off deliberately, or something wrong with my installation is preventing it from taking off? I tested on both Windows 10 and Ubuntu Linux.

The following randmatstat for example gets 2.x faster in Julia 0.6 when activating @fastmath. I know the consequences of enabling @fastmath and use it judiciously, but in this specific example, we’re dealing with random matrices anyway, no hurt.

function randmatstat(t)
    n = 256
    v = zeros(t)
    w = zeros(t)
    for i = 1:t
        a = randn(n,n)
        b = randn(n,n)
        c = randn(n,n)
        d = randn(n,n)
        P = [a b c d]
        Q = [a b; c d]
        @fastmath v[i] = trace((P'*P)^4)
        @fastmath w[i] = trace((Q'*Q)^4)
    end
    return (std(v)/mean(v), std(w)/mean(w))
end

@time (s1, s2) = randmatstat(100)

Elrod · June 23, 2018, 9:43pm

julia> x = rand(2,2)
2×2 Array{Float64,2}:
 0.97157   0.255659
 0.365014  0.149847

julia> x ^ 2
2×2 Array{Float64,2}:
 1.03727   0.286701
 0.409333  0.115773

julia> x * x
2×2 Array{Float64,2}:
 1.03727   0.286701
 0.409333  0.115773

julia> @fastmath x ^ 2
WARNING: pow_fast(x::AbstractArray{T1}, y::Number) where T1 <: Number is deprecated, use pow_fast.(x, y) instead.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] pow_fast(::Array{Float64,2}, ::Int64) at ./deprecated.jl:57
 [3] eval(::Module, ::Any) at ./boot.jl:235
 [4] eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:66
 [5] macro expansion at /home/chris/.julia/v0.6/Revise/src/Revise.jl:775 [inlined]
 [6] (::Revise.##17#18{Base.REPL.REPLBackend})() at ./event.jl:73
while loading no file, in expression starting on line 0
2×2 Array{Float64,2}:
 0.943948  0.0653616
 0.133235  0.0224541

julia> x .^ 2
2×2 Array{Float64,2}:
 0.943948  0.0653616
 0.133235  0.0224541

This bug has been fixed on Julia 0.7.

Seif_Shebl · June 23, 2018, 9:50pm

Yes, I know, it was on some 0.6x sub-versions only. My question was why this happens in 0.7-Dev?

julia> @btime $x^2;
  47.239 ns (1 allocation: 112 bytes)

julia> @btime @fastmath $x^2;
  47.469 ns (1 allocation: 112 bytes)

Elrod · June 23, 2018, 9:51pm

Why should it get faster?

It gets faster on Julia 0.6 because @fastmath is calculating element-wise exponents instead of matrix multiplication.

Seif_Shebl · June 23, 2018, 10:01pm

Thanks, @Elrod, for the kind response, so still the question: now what does @fastmath do?

Elrod · June 23, 2018, 10:45pm

Sorry, I meant to be Socratic, not rude.

If you’d like to read a long discussion, there’s loads of good stuff here:

Here are the LLVM docs on fastmath:
https://llvm.org/docs/LangRef.html#fast-math-flags

Basically, it turns on a lot of possible optimizations. A few highlights:

Approximate functions - Allow substitution of approximate calculations for functions (sin, log, sqrt, etc). See floating-point intrinsic definitions for places where this can apply to LLVM’s intrinsic math functions.
Allow Reciprocal - Allow optimizations to use the reciprocal of an argument rather than perform division.
Allow floating-point contraction (e.g. fusing a multiply followed by an addition into a fused multiply-and-add).
Allow reassociation transformations for floating-point instructions. This may dramatically change results in floating-point.

Doing things like this can change the results of floating point calculations, so they’re not allowed by default.
For many of these – the approximate functions being an exception – it’s things you could have done yourself.
If you wrote

out1 += a/w + a /x  + a/y + a/z 
out2 += b/x + b/y + b/z + b/w
out3 += c/y + c/z + c/w + c/x
out4 += d/z + d/w + d/x + d/y

You could have written this instead:

rw = 1/w
rx = 1/x
ry = 1/y
rz = 1/z
out1 = (((out1 + rw*a) + rx*a) + ry*a) + rz*a
out2 = (((out2 + rw*b) + rx*b) + ry*b) + rz*b
out3 = (((out3 + rw*c) + rx*c) + ry*c) + rz*c
out4 = (((out4 + rw*d) + rx*d) + ry*d) + rz*d

The latter is going to be faster. Looking only at the arithmetic operations and a Haswell processor:

“Reciprocal throughput” (RT) is basically how many clock cycles it takes on average, if you’re doing a lot of the instructions.

fdiv has a RT of 8-18 clock cycles. This is a single division.
fdivpd takes 16-28. This is 4 divisions.

Most data movement and shuffling instructions take about RT 1.

A fused multiply-add instruction takes only 0.5.

Both versions are mathematically equivalent, but not numerically. They will give you a different answer.

Also, worth noting that not many instructions don’t conflict, so you can’t just add RTs together to guestimate how long it’ll take. Processors multitask, can do things out of order, etc…
I only started reading about this recently, while working on optimizing matrix multiplication. Many here will know this much better than I.

You can see how much I know when actually trying to benchmark the above lol (these tests are on a Ryzen processor, which are fairly similar, with the notable difference that they have RT of 1 for fma instead of 0.5):

julia> function test1(out, A, W)
           out1, out2, out3, out4 = out
           a, b, c, d = A
           w, x, y, z = W
           out1 += a/w + a /x  + a/y + a/z 
           out2 += b/x + b/y + b/z + b/w
           out3 += c/y + c/z + c/w + c/x
           out4 += d/z + d/w + d/x + d/y
           (out1, out2, out3, out4)
       end
test1 (generic function with 1 method)

julia> function test1_fm(out, A, W)
           @fastmath begin
               out1, out2, out3, out4 = out
               a, b, c, d = A
               w, x, y, z = W
               out1 += a/w + a /x  + a/y + a/z 
               out2 += b/x + b/y + b/z + b/w
               out3 += c/y + c/z + c/w + c/x
               out4 += d/z + d/w + d/x + d/y
           end
           (out1, out2, out3, out4)
       end
test1_fm (generic function with 1 method)

julia> function test2(out, A, W)
           out1, out2, out3, out4 = out
           a, b, c, d = A
           w, x, y, z = W
           rw = 1/w
           rx = 1/x
           ry = 1/y
           rz = 1/z
           out1 = (((out1 + rw*a) + rx*a) + ry*a) + rz*a
           out2 = (((out2 + rw*b) + rx*b) + ry*b) + rz*b
           out3 = (((out3 + rw*c) + rx*c) + ry*c) + rz*c
           out4 = (((out4 + rw*d) + rx*d) + ry*d) + rz*d
           (out1, out2, out3, out4)
       end
test2 (generic function with 1 method)

julia> function test2_fm(out, A, W)
           @fastmath begin
               out1, out2, out3, out4 = out
               a, b, c, d = A
               w, x, y, z = W
               rw = 1/w
               rx = 1/x
               ry = 1/y
               rz = 1/z
               out1 = (((out1 + rw*a) + rx*a) + ry*a) + rz*a
               out2 = (((out2 + rw*b) + rx*b) + ry*b) + rz*b
               out3 = (((out3 + rw*c) + rx*c) + ry*c) + rz*c
               out4 = (((out4 + rw*d) + rx*d) + ry*d) + rz*d
           end
           (out1, out2, out3, out4)
       end
test2_fm (generic function with 1 method)

julia> out = ntuple(i -> randn(), 4)
(0.36465241072570437, -0.8177038358601129, -0.6176666333370491, 0.6337676331752973)

julia> A = ntuple(i -> randn(), 4)
(-3.3341832981886057, 1.239717550334394, 0.005370551858030382, -0.05803825677944311)

julia> W = ntuple(i -> randn(), 4)
(-1.0521219836695515, -0.7005568413173827, -0.7599365767331884, -1.05623323338091)

julia> using BenchmarkTools

julia> @btime test1($out, $A, $W)
  8.765 ns (0 allocations: 0 bytes)
(15.837116493227779, -6.5706824679638975, -0.6425889796194907, 0.9030974099511657)

julia> @btime test1_fm($out, $A, $W)
  8.705 ns (0 allocations: 0 bytes)
(15.837116493227779, -6.5706824679638975, -0.6425889796194907, 0.9030974099511657)

julia> @btime test2($out, $A, $W)
  6.642 ns (0 allocations: 0 bytes)
(15.83711649322778, -6.5706824679638975, -0.6425889796194907, 0.9030974099511657)

julia> @btime test2_fm($out, $A, $W)
  4.418 ns (0 allocations: 0 bytes)
(15.83711649322778, -6.5706824679638975, -0.6425889796194908, 0.9030974099511657)

julia> versioninfo()
Julia Version 0.7.0-alpha.94
Commit 4420717* (2018-06-12 20:26 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, znver1)

I expected @fastmath not to help version 2 much, as I thought I did a decent job optimizing it, but that it would drastically improve version 1.
Instead, it didn’t help version 1 at all. Version 2 was faster, and improved even further via @fastmath.

Too bad on version 1 not getting any faster, since your typical scientist or statistician isn’t going to want to look at tables of instruction throughputs.

I like to use @fastmath in generated functions. Whenever I didn’t choose the order of my operations for some good reason – but either arbitrarily, or because I thought “this order would be fastest”, I like to try @fastmath and let the compiler make it’s much more educated decisions about what order is actually fastest.

That is when it pays off.
In your case, there was nothing @fastmath could do.

EDIT: Looking at @code_native test2(out, A, W), I needed this optimization:

Allow floating-point contraction (e.g. fusing a multiply followed by an addition into a fused multiply-and-add).

for my code. Without it, test2 was – of course! – not actually using the fused multiply add instructions. test2_fm was.

ufechner7 · June 24, 2018, 8:56am

There is something strange going on, using 0.7beta:

x=1.1
julia> @btime $x^2;
  0.015 ns (0 allocations: 0 bytes)
julia> @btime @fastmath $x^2;
  1.128 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 0.7.0-beta.0
Commit f41b1ec* (2018-06-24 01:32 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
  JULIA_PKG3_PRECOMPILE = enable
  JULIA_NUM_THREADS = 4

Any idea?

kristoffer.carlsson · June 24, 2018, 9:41am

Constant propagation breaking out of the benchmark loop. Not really weird, just a proof that microbenchmarking with optimizing compilers can be difficult.