Zygote performances for simple function

Hi all,
I have a the following example in Zygote for computing a single gradient, and I don’t understand why I get this performance regression over the following examples:

using SpecialFunctions,BenchmarkTools,Zygote,IrrationalConstants

function normcdf_wrong(x)
    return erfc(x) / 2
end

function blsdelta_wrong(S0, K, r, T, σ, d)
    rt = r * T
    dt = -d * T
    sigma_sqrtT = σ * sqrt(T)
    S0_K = S0 / K
    d_1 = (log(S0_K) + rt + dt) / sigma_sqrtT + sigma_sqrtT / 2
    Δ = exp(dt) * normcdf_wrong(d_1)
    return Δ
end

function normcdf_wrong_2(x)
    return erfc(-x) / 2
end

function blsdelta_wrong_2(S0, K, r, T, σ, d)
    rt = r * T
    dt = -d * T
    sigma_sqrtT = σ * sqrt(T)
    S0_K = S0 / K
    d_1 = (log(S0_K) + rt + dt) / sigma_sqrtT + sigma_sqrtT / 2
    Δ = exp(dt) * normcdf_wrong_2(d_1)
    return Δ
end

function normcdf_correct(x)
    return erfc(-x * invsqrt2) / 2
end

function blsdelta_correct(S0, K, r, T, σ, d)
    rt = r * T
    dt = -d * T
    sigma_sqrtT = σ * sqrt(T)
    S0_K = S0 / K
    d_1 = (log(S0_K) + rt + dt) / sigma_sqrtT + sigma_sqrtT / 2
    Δ = exp(dt) * normcdf_correct(d_1)
    return Δ
end

S0=100.0; K=100.0; r=0.02;T=1.2;σ=0.2; d=0.01;

@btime Zygote.gradient(blsdelta_wrong, $S0, $K, $r, $T, $σ, $d); #Performance are fine
@btime Zygote.gradient(blsdelta_wrong_2, $S0, $K, $r, $T, $σ, $d); #Allocating?
@btime Zygote.gradient(blsdelta_correct, $S0, $K, $r, $T, $σ, $d);#Allocating?

The first call is fine as performance, but unfortunately it isn’t correct. The second isn’t correct either but it is allocating (the only difference is a minus). The third one is correct from numerical point of view but it allocates for some reason.
Has anybody experienced anything like this?
(I think it is the first post I write, I am sorry in advance for the moderators)

I see very similar benchmark stats for correct and “wrong” implementations. Can you post your results?

using Flux, SpecialFunctions, BenchmarkTools

function normcdf(x)
    return erfc(-x*sqrt(0.5)) * 0.5
end

function bsdelta(S, K, r, T, σ, d)
    v = σ * sqrt(T)
    d1 = (log(S/K) + (r-d)*T) / v + 0.5 * v
    return exp(-d*T) * normcdf(d1)
end

pars = [1.0, 1.0, 0.02, 1.2, 0.2, 0.01]
@btime Flux.gradient(p -> bsdelta(p...), $(pars))
 1.297 μs (42 allocations: 5.73 KiB)
([1.7750677670750958, -1.7750677670750958, 2.130081320490115, 0.021040851149368833, 0.10650406602450575, -2.8003011633010275],)

I see the following:

@btime Zygote.gradient(blsdelta_wrong, $S0, $K, $r, $T, $σ, $d); #Performance are fine
  #65.988 ns (1 allocation: 16 bytes)

@btime Zygote.gradient(blsdelta_wrong_2, $S0, $K, $r, $T, $σ, $d); #Allocating?
  #2.567 μs (45 allocations: 2.86 KiB)

@btime Zygote.gradient(blsdelta_correct, $S0, $K, $r, $T, $σ, $d);#Allocating?
  #2.522 μs (45 allocations: 2.92 KiB)

The problem is almost certainly with -x * invsqrt2. IrrationalConstants is likely doing a bunch of computation behind the scenes which is not Zygote-friendly. If you substitute -x * inv(sqrt(2)), the allocations should disappear.

There is no irrational in the second example

I am surprised at the first result (1 allocation). I see the same allocation for all three cases (except that I use sqrt(0.5) for the correct case).

The second example proves that it isn’t really related to the irrationals.
Anyway, I don’t use sqrt(0.5) because in case of Float32 I add precision that I don’t need, and in case of BigFloats I lose precision.

You’re right, mea culpa. The next thing to do would be to use the usual tools for detecting type instabilities (@code_warntype, Cthulhu.jl, etc) to dig further.

That’s because you’re splatting, I think. Splatting will always incur some allocations.

That’s why we have generic functions which let you work with multiple different number types. e.g. sqrt(one(x) / 2) should not add or subtract precision for Float32 and BigFloat.

Just as a comment, one(x) wouldn’t work for other number types, like complex or dual numbers.

At least for those two examples it works, but perhaps you have other constraints that weren’t expressed in the original post.

Back to the allocations, I had a look at both the generated Julia and LLVM IR with Cthulhu. It appears that replacing -x with a different operation (e.g. sin(x)) has the same outcome. Yet I’m only able to see one allocation at either the Julia or LLVM level. It feels like this is somehow hitting some compiler heuristic, but I can’t figure out what.

1 Like

I saw the application (calculating second order greeks) and thus was not worried about precision beyond Float32 (meaningless).

[Not related to post] Since when the precision is meaningless?
Greeks aren’t used just for hedging where maybe you need precision up to cents.

If I delete exp(dt) from the functions the allocations get back to normal. Do you have any idea what is wrong with this?

The amount of precision you need is a function of application. From my experience in financial engineering, I have never encountered a case for precision beyond Float32 for delta, vega, gamma, vanna, theta, rho. I am keen to hear about your use case which requires it.

[Not related again to the post]
It is a library function and I don’t know the potential final use.
I can even assume that somebody wants to do Taylor expansion over that, in that case the precision is fundamental. Another example could be an implied volatility calculator.

I don’t know exactly where it will stay in a generic pipeline, and forcing Float32 instead of letting the client choose is not a good idea, especially in languages like julia where there is no need for that.