Investigating numerical change in function return value between v1.4 vs v1.5

jmert · June 13, 2020, 5:12pm

I’ve got a function which is exhibiting a change to the numerical return value when comparing Julia versions 1.2 through 1.4 versus the 1.5beta (and master). I think this is actually an LLVM “error” (or a misunderstanding on my part, but one which has an impact at the LLVM optimization stage), but it may be something that only shows up from Julia where single-instruction fast-math annotations are easy to make.

So my problem case is demonstrated with the following Julia function:

function coeff_α(l::Integer, m::Integer)
    lT = convert(Float64, l)
    mT = convert(Float64, m)
    fac1 = (2lT + 1) / ((2lT - 3) * (lT^2 - mT^2))
    fac2 = 4*(lT - 1)^2 - 1
    @fastmath return sqrt(fac1 * fac2)
end

The unoptimized LLVM IR shows that just the final multiplication of fac1 * fac2 and then subsequent sqrt are marked as fast math instrunctions:

julia> @code_llvm debuginfo=:none optimize=false coeff_α(1,1)

define double @"julia_coeff_\CE\B1_1317"(i64, i64) {
top:
  %2 = call %jl_value_t*** @julia.ptls_states()
  %3 = bitcast %jl_value_t*** %2 to %jl_value_t**
  %4 = getelementptr inbounds %jl_value_t*, %jl_value_t** %3, i64 4
  %5 = bitcast %jl_value_t** %4 to i64**
  %6 = load i64*, i64** %5
  %7 = sitofp i64 %0 to double
  %8 = sitofp i64 %1 to double
  %9 = fmul double 2.000000e+00, %7
  %10 = fadd double %9, 1.000000e+00
  %11 = fmul double 2.000000e+00, %7
  %12 = fsub double %11, 3.000000e+00
  %13 = fmul double %7, %7
  %14 = fmul double %8, %8
  %15 = fsub double %13, %14
  %16 = fmul double %12, %15
  %17 = fdiv double %10, %16
  %18 = fsub double %7, 1.000000e+00
  %19 = fmul double %18, %18
  %20 = fmul double 4.000000e+00, %19
  %21 = fsub double %20, 1.000000e+00
  %22 = fmul fast double %17, %21
  %23 = call fast double @llvm.sqrt.f64(double %22)
  ret double %23
}

For Julia versions 1.2 through 1.4, the optimized LLVM IR continues to only mark those two instructions with the fast annotation, but starting with Julia 1.5, the earlier division is reordered to later in the function and made a fast instruction as well:

julia> @code_llvm debuginfo=:none optimize=false coeff_α(1,1)

define double @"julia_coeff_\CE\B1_1335"(i64, i64) {
top:
  %2 = sitofp i64 %0 to double
  %3 = sitofp i64 %1 to double
  %4 = fmul double %2, 2.000000e+00
  %5 = fadd double %4, 1.000000e+00
  %6 = fadd double %4, -3.000000e+00
  %7 = fmul double %2, %2
  %8 = fmul double %3, %3
  %9 = fsub double %7, %8
  %10 = fmul double %6, %9
  %11 = fadd double %2, -1.000000e+00
  %12 = fmul double %11, %11
  %13 = fmul double %12, 4.000000e+00
  %14 = fadd double %13, -1.000000e+00
  %15 = fmul fast double %14, %5
  %16 = fdiv fast double %15, %10
  %17 = call fast double @llvm.sqrt.f64(double %16)
  ret double %17
}

I’ve entered the unoptimized LLVM IR into the godbolt online compiler tool and found that the change corresponds to the fact that Julia 1.4 uses LLVM 8 while Julia 1.5 uses LLVM 9, and the change is reproducible with LLVM’s built-in optimization pipeline.

(If you edit the input IR and remove the fast from the fmul instruction, the two compiler versions also agree, so it seems to be critically linked to the fmul fast specifically.)

So my questions are:

Is this just a misunderstanding on my part about how the @fastmath instructions are supposed to work? I expect the two initially-marked instructions to remain as the only fast instructions in the LLVM IR, but maybe that assumption is wrong and the fast annotation is allowed to “grow” under certain circumstances?
If this is unexpected behavior, how should I go about reporting?

P.S. I already have a workaround — change @fastmath sqrt(fac1 * fac2) to just @fastmath(sqrt)(fac1 * fac2) since my purpose is to avoid the checks for negative square roots (and subsequent throw) which I know mathematically won’t happen given how the function is called.

jmert · June 13, 2020, 5:41pm

(I also just realized the “Domains > Numerics” category might not be the best filing of this question… Would the “Internals” category be better?)

Elrod · June 13, 2020, 6:17pm

Alternative options include Base.sqrt_llvm and Base.FastMath.sqrt_fast:

julia> f(x) = Base.sqrt_llvm(x)
f (generic function with 2 methods)

; julia> @code_llvm f(2.3)
;  @ REPL[30]:1 within `f'
define double @julia_f_1895(double %0) {
top:
  %1 = call double @llvm.sqrt.f64(double %0)
  ret double %1
}

; julia> @code_llvm Base.FastMath.sqrt_fast(2.3)
;  @ fastmath.jl:283 within `sqrt_fast'
define double @julia_sqrt_fast_1898(double %0) {
top:
  %1 = call fast double @llvm.sqrt.f64(double %0)
  ret double %1
}

Fastmath may turn into a no-op soon, so Base.sqrt_llvm seems more likely to keep doing what you want (avoiding the check) across Julia versions. However, Base.FastMath.sqrt_fast is more generic, so if you’d like the code to work as intended for anything other than primitive floats

julia> reinterpret(Int, sqrt(reinterpret(Float64, 2)))
2190614870947216333

julia> Base.sqrt_llvm(2)
2190614870947216333

julia> Base.FastMath.sqrt_fast(2)
1.4142135623730951

then the fast version has an advantage. Of course, you could define your own unchecked_sqrt function and use multiple dispatch to control its behavior more precisely.

jmert · June 13, 2020, 7:51pm

Ah, thanks for pointing me to that. I hadn’t seen that yet. So yes, for my purpose just calling directly to the underlying functions is maybe the better way forward.

I’ve kind of long wanted maybe something akin to @inbounds which asserts on mathematical bounds checking rather than array bounds checking.

At any rate, I’d still be interested in knowing if my mental model of the LLVM fast annotations is wrong or not, and thankfully my function is simple enough that digging into how the optimization passes transform the code is probably feasible (currently, for someone else, but I’d be happy to learn how to dig into the internals more).

Elrod · June 13, 2020, 9:53pm

I’m curious about this as well.
FWIW, you don’t need -O3, all you need is -instcombine to get the fdiv fast. So it seems to have been a change with the instcombine pass between versions.

I also recently encountered an issue with fdiv and instcombine that started with LLVM 9 where the instcombine moves an fdiv inside a loop, dramatically worsening performance.
My issue on LLVM was closed because that was intended behavior; you’re supposed to place a licm at some point after the last instcombine to move the division back out of the loop.

Maybe you could file this as an instcombine issue with LLVM. It seems reasonably likely they’re connected (instcombine getting more aggressive with divisions from LLVM 8 to 9), but your example seems harder to close as a Julia issue / shows up with the default -O3 optimization pipeline.

jmert · June 13, 2020, 10:13pm

Ooops, I meant to send the godbolt link that used -O1 since it shows up even there (but not -O0). Narrowing down to just a specific optimization pass, though, is very helpful!

I’ll look into it — will have to register on LLVMs bug tracker!

jmert · June 15, 2020, 2:14pm

LLVM bug report at 46326 – InstCombine changes sibling non-fast instruction to fast

Elrod · June 20, 2020, 6:29pm

That was an insightful report. In particular:

This is an open question about the semantics of fast-math-flags in IR and codegen. In most cases so far, we have assumed that if the last value in a calculation is relaxed in some way (“fast” means all possible FP shortcuts are allowed – LLVM Language Reference Manual — LLVM 18.0.0git documentation ), then we are free to apply those semantics to intermediate calcs leading to the final value.

I hope that they change their mind on this, and instead require the appropriate fast flag on each of the operations being transformed.
A dangerous example:

mask = a == a # false if a is `NaN`
b = ifelse(mask, a, zero(a)) # replace zero with `NaN`
@fastmath b * c + d # use `b` for something, and apply fast flags

While this currently works as desired – that is, the NaN check doesn’t get eliminated – it sounds like it is allowed to apply the same semantics to intermediate calculations, meaning:

fast assumes no-NaNs. Therefore, b, c, and d are not NaN.
Apply no-NaNs to intermediate calculations: a must not be NaN either.
Therefore, a == a must return true, and we can optimize the code into muladd(a, c, d).

Of course, we may have assumed that there is a very real possibility that a is NaN, and added that check with the express purpose of guaranteeing that b isn’t NaN, and therefore wrongly thought it safe to provide that guarantee when using b.

I’m really not a fan of this. =/