@fastmath macro accuracy

Ah, I see that there are three more flags offered by LLVM (all included by fast): contract, afn, and reassoc.

1 Like

Yes, that would work.

I would use a @generated function so that the fast token can be generated programmatically. With 7 flags there are 128 versions of these functions to be generated.

Is llvmcall fast enough, or will it slow down compiling things too much?

-erik

Yeah, that’ what happens in your package now :wink:

I think it is pretty ok. It recently got made faster https://github.com/JuliaLang/julia/pull/35144.

Oh. Does it. How embarrassing. Thank you.

Someone should take this and split it out into a FastMath.jl, and advertise it…

2 Likes

Okay, I’ll file a report.
I get it with LLVM 9 (+ Julia 1.5), but not with LLVM 8 (+ Julia 1.4).

julia> @benchmark vdiv!($x1, $y, $a)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.792 ns (0.00% GC)
  median time:      20.940 ns (0.00% GC)
  mean time:        22.101 ns (0.00% GC)
  maximum time:     50.008 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark vdiv_fast!($x2, $y, $a)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.394 ns (0.00% GC)
  median time:      21.447 ns (0.00% GC)
  mean time:        22.808 ns (0.00% GC)
  maximum time:     49.948 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

I think this shows that if you thought about the order of operations – whether because your algorithm depends on order for accuracy or for performance – it’s less likely that things go wrong without @fastmath.

EDIT:
Just tested LLVM 10 (+ Julia 1.5), and I still see the issue.

2 Likes

Also SIMD.jl now works with @fastmath e.g.

julia> f1(a, b, c) = a * b - c * 2.0
f1 (generic function with 1 method)

julia> f2(a, b, c) = @fastmath a * b - c * 2.0
f2 (generic function with 1 method)

julia> V = Vec{4, Float64}
Vec{4,Float64}

julia> code_native(f1, Tuple{V, V, V}, debuginfo=:none)
        .section        __TEXT,__text,regular,pure_instructions
        vmovupd (%rsi), %ymm0
        vmulpd  (%rdx), %ymm0, %ymm0
        movq    %rdi, %rax
        vmovupd (%rcx), %ymm1
        vaddpd  %ymm1, %ymm1, %ymm1
        vsubpd  %ymm1, %ymm0, %ymm0
        vmovapd %ymm0, (%rdi)
        vzeroupper
        retq
        nop

julia> code_native(f2, Tuple{V, V, V}, debuginfo=:none)
        .section        __TEXT,__text,regular,pure_instructions
        movq    %rdi, %rax
        vmovupd (%rdx), %ymm0
        vmovupd (%rcx), %ymm1
        vaddpd  %ymm1, %ymm1, %ymm1
        # vvvv fused in fast version vvvvvv
        vfmsub231pd     (%rsi), %ymm0, %ymm1 ## ymm1 = (ymm0 * mem) - ymm1 
        vmovapd %ymm1, (%rdi)
        vzeroupper
        retq
        nopl    (%rax)