I noticed that the @fastmath macro is no longer effective in version 0.7-Dev. Has it been switched off deliberately, or something wrong with my installation is preventing it from taking off? I tested on both Windows 10 and Ubuntu Linux.
The following randmatstat for example gets 2.x faster in Julia 0.6 when activating @fastmath. I know the consequences of enabling @fastmath and use it judiciously, but in this specific example, we’re dealing with random matrices anyway, no hurt.
function randmatstat(t)
n = 256
v = zeros(t)
w = zeros(t)
for i = 1:t
a = randn(n,n)
b = randn(n,n)
c = randn(n,n)
d = randn(n,n)
P = [a b c d]
Q = [a b; c d]
@fastmath v[i] = trace((P'*P)^4)
@fastmath w[i] = trace((Q'*Q)^4)
end
return (std(v)/mean(v), std(w)/mean(w))
end
@time (s1, s2) = randmatstat(100)
julia> x = rand(2,2)
2×2 Array{Float64,2}:
0.97157 0.255659
0.365014 0.149847
julia> x ^ 2
2×2 Array{Float64,2}:
1.03727 0.286701
0.409333 0.115773
julia> x * x
2×2 Array{Float64,2}:
1.03727 0.286701
0.409333 0.115773
julia> @fastmath x ^ 2
WARNING: pow_fast(x::AbstractArray{T1}, y::Number) where T1 <: Number is deprecated, use pow_fast.(x, y) instead.
Stacktrace:
[1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
[2] pow_fast(::Array{Float64,2}, ::Int64) at ./deprecated.jl:57
[3] eval(::Module, ::Any) at ./boot.jl:235
[4] eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:66
[5] macro expansion at /home/chris/.julia/v0.6/Revise/src/Revise.jl:775 [inlined]
[6] (::Revise.##17#18{Base.REPL.REPLBackend})() at ./event.jl:73
while loading no file, in expression starting on line 0
2×2 Array{Float64,2}:
0.943948 0.0653616
0.133235 0.0224541
julia> x .^ 2
2×2 Array{Float64,2}:
0.943948 0.0653616
0.133235 0.0224541
Basically, it turns on a lot of possible optimizations. A few highlights:
Approximate functions - Allow substitution of approximate calculations for functions (sin, log, sqrt, etc). See floating-point intrinsic definitions for places where this can apply to LLVM’s intrinsic math functions.
Allow Reciprocal - Allow optimizations to use the reciprocal of an argument rather than perform division.
Allow floating-point contraction (e.g. fusing a multiply followed by an addition into a fused multiply-and-add).
Allow reassociation transformations for floating-point instructions. This may dramatically change results in floating-point.
Doing things like this can change the results of floating point calculations, so they’re not allowed by default.
For many of these – the approximate functions being an exception – it’s things you could have done yourself.
If you wrote
The latter is going to be faster. Looking only at the arithmetic operations and a Haswell processor:
“Reciprocal throughput” (RT) is basically how many clock cycles it takes on average, if you’re doing a lot of the instructions.
fdiv has a RT of 8-18 clock cycles. This is a single division. fdivpd takes 16-28. This is 4 divisions.
Most data movement and shuffling instructions take about RT 1.
A fused multiply-add instruction takes only 0.5.
Both versions are mathematically equivalent, but not numerically. They will give you a different answer.
Also, worth noting that not many instructions don’t conflict, so you can’t just add RTs together to guestimate how long it’ll take. Processors multitask, can do things out of order, etc…
I only started reading about this recently, while working on optimizing matrix multiplication. Many here will know this much better than I.
You can see how much I know when actually trying to benchmark the above lol (these tests are on a Ryzen processor, which are fairly similar, with the notable difference that they have RT of 1 for fma instead of 0.5):
julia> function test1(out, A, W)
out1, out2, out3, out4 = out
a, b, c, d = A
w, x, y, z = W
out1 += a/w + a /x + a/y + a/z
out2 += b/x + b/y + b/z + b/w
out3 += c/y + c/z + c/w + c/x
out4 += d/z + d/w + d/x + d/y
(out1, out2, out3, out4)
end
test1 (generic function with 1 method)
julia> function test1_fm(out, A, W)
@fastmath begin
out1, out2, out3, out4 = out
a, b, c, d = A
w, x, y, z = W
out1 += a/w + a /x + a/y + a/z
out2 += b/x + b/y + b/z + b/w
out3 += c/y + c/z + c/w + c/x
out4 += d/z + d/w + d/x + d/y
end
(out1, out2, out3, out4)
end
test1_fm (generic function with 1 method)
julia> function test2(out, A, W)
out1, out2, out3, out4 = out
a, b, c, d = A
w, x, y, z = W
rw = 1/w
rx = 1/x
ry = 1/y
rz = 1/z
out1 = (((out1 + rw*a) + rx*a) + ry*a) + rz*a
out2 = (((out2 + rw*b) + rx*b) + ry*b) + rz*b
out3 = (((out3 + rw*c) + rx*c) + ry*c) + rz*c
out4 = (((out4 + rw*d) + rx*d) + ry*d) + rz*d
(out1, out2, out3, out4)
end
test2 (generic function with 1 method)
julia> function test2_fm(out, A, W)
@fastmath begin
out1, out2, out3, out4 = out
a, b, c, d = A
w, x, y, z = W
rw = 1/w
rx = 1/x
ry = 1/y
rz = 1/z
out1 = (((out1 + rw*a) + rx*a) + ry*a) + rz*a
out2 = (((out2 + rw*b) + rx*b) + ry*b) + rz*b
out3 = (((out3 + rw*c) + rx*c) + ry*c) + rz*c
out4 = (((out4 + rw*d) + rx*d) + ry*d) + rz*d
end
(out1, out2, out3, out4)
end
test2_fm (generic function with 1 method)
julia> out = ntuple(i -> randn(), 4)
(0.36465241072570437, -0.8177038358601129, -0.6176666333370491, 0.6337676331752973)
julia> A = ntuple(i -> randn(), 4)
(-3.3341832981886057, 1.239717550334394, 0.005370551858030382, -0.05803825677944311)
julia> W = ntuple(i -> randn(), 4)
(-1.0521219836695515, -0.7005568413173827, -0.7599365767331884, -1.05623323338091)
julia> using BenchmarkTools
julia> @btime test1($out, $A, $W)
8.765 ns (0 allocations: 0 bytes)
(15.837116493227779, -6.5706824679638975, -0.6425889796194907, 0.9030974099511657)
julia> @btime test1_fm($out, $A, $W)
8.705 ns (0 allocations: 0 bytes)
(15.837116493227779, -6.5706824679638975, -0.6425889796194907, 0.9030974099511657)
julia> @btime test2($out, $A, $W)
6.642 ns (0 allocations: 0 bytes)
(15.83711649322778, -6.5706824679638975, -0.6425889796194907, 0.9030974099511657)
julia> @btime test2_fm($out, $A, $W)
4.418 ns (0 allocations: 0 bytes)
(15.83711649322778, -6.5706824679638975, -0.6425889796194908, 0.9030974099511657)
julia> versioninfo()
Julia Version 0.7.0-alpha.94
Commit 4420717* (2018-06-12 20:26 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, znver1)
I expected @fastmath not to help version 2 much, as I thought I did a decent job optimizing it, but that it would drastically improve version 1.
Instead, it didn’t help version 1 at all. Version 2 was faster, and improved even further via @fastmath.
Too bad on version 1 not getting any faster, since your typical scientist or statistician isn’t going to want to look at tables of instruction throughputs.
I like to use @fastmath in generated functions. Whenever I didn’t choose the order of my operations for some good reason – but either arbitrarily, or because I thought “this order would be fastest”, I like to try @fastmath and let the compiler make it’s much more educated decisions about what order is actually fastest.
That is when it pays off.
In your case, there was nothing @fastmath could do.
EDIT: Looking at @code_native test2(out, A, W), I needed this optimization:
Allow floating-point contraction (e.g. fusing a multiply followed by an addition into a fused multiply-and-add).
for my code. Without it, test2 was – of course! – not actually using the fused multiply add instructions. test2_fm was.
Constant propagation breaking out of the benchmark loop. Not really weird, just a proof that microbenchmarking with optimizing compilers can be difficult.