Ah, I see that there are three more flags offered by LLVM (all included by fast
): contract
, afn
, and reassoc
.
Yes, that would work.
I would use a @generated
function so that the fast
token can be generated programmatically. With 7 flags there are 128 versions of these functions to be generated.
Is llvmcall
fast enough, or will it slow down compiling things too much?
-erik
Yeah, that’ what happens in your package now
I think it is pretty ok. It recently got made faster speed up llvmcall unique name generation by JeffBezanson · Pull Request #35144 · JuliaLang/julia · GitHub.
Oh. Does it. How embarrassing. Thank you.
Someone should take this and split it out into a FastMath.jl
, and advertise it…
Okay, I’ll file a report.
I get it with LLVM 9 (+ Julia 1.5), but not with LLVM 8 (+ Julia 1.4).
julia> @benchmark vdiv!($x1, $y, $a)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 20.792 ns (0.00% GC)
median time: 20.940 ns (0.00% GC)
mean time: 22.101 ns (0.00% GC)
maximum time: 50.008 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 997
julia> @benchmark vdiv_fast!($x2, $y, $a)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 21.394 ns (0.00% GC)
median time: 21.447 ns (0.00% GC)
mean time: 22.808 ns (0.00% GC)
maximum time: 49.948 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 997
I think this shows that if you thought about the order of operations – whether because your algorithm depends on order for accuracy or for performance – it’s less likely that things go wrong without @fastmath
.
EDIT:
Just tested LLVM 10 (+ Julia 1.5), and I still see the issue.
Also SIMD.jl now works with @fastmath
e.g.
julia> f1(a, b, c) = a * b - c * 2.0
f1 (generic function with 1 method)
julia> f2(a, b, c) = @fastmath a * b - c * 2.0
f2 (generic function with 1 method)
julia> V = Vec{4, Float64}
Vec{4,Float64}
julia> code_native(f1, Tuple{V, V, V}, debuginfo=:none)
.section __TEXT,__text,regular,pure_instructions
vmovupd (%rsi), %ymm0
vmulpd (%rdx), %ymm0, %ymm0
movq %rdi, %rax
vmovupd (%rcx), %ymm1
vaddpd %ymm1, %ymm1, %ymm1
vsubpd %ymm1, %ymm0, %ymm0
vmovapd %ymm0, (%rdi)
vzeroupper
retq
nop
julia> code_native(f2, Tuple{V, V, V}, debuginfo=:none)
.section __TEXT,__text,regular,pure_instructions
movq %rdi, %rax
vmovupd (%rdx), %ymm0
vmovupd (%rcx), %ymm1
vaddpd %ymm1, %ymm1, %ymm1
# vvvv fused in fast version vvvvvv
vfmsub231pd (%rsi), %ymm0, %ymm1 ## ymm1 = (ymm0 * mem) - ymm1
vmovapd %ymm1, (%rdi)
vzeroupper
retq
nopl (%rax)