Wrong gradient from Zygote?

I have the following example where I get different gradient values for forward and backward:

using Random, ChainRulesCore, Zygote, Statistics
function simulate(N,  drift, sigma, rng,i)
    ChainRulesCore.@ignore_derivatives randn!(rng,N)
    return @. sigma * N + drift
end

function simulate_bs(S0, r, T, sigma, d, nsim::Integer, nsteps::Integer, rng)
    mu = r - d
    zero_dual =ChainRulesCore.@ignore_derivatives 0 * mu * sigma * T
    dt = T / nsteps
    mu_adj = (mu - sigma^2 / 2) * dt
    sigma_adj = sigma * sqrt(dt)
    X =zeros(typeof(zero_dual), nsim)
    N =Array{Float64}(undef,nsim)
    ChainRulesCore.@ignore_derivatives Random.seed!(rng, 1)
    for i = 1:nsteps
        X += simulate(N, mu_adj, sigma_adj, rng,i)
    end
    @. S0 * exp(X)
end

function pricer_bs(S0, r, T, sigma, d, nsim::Integer, nsteps::Integer, rng)
    X_ = simulate_bs(S0, r, T, sigma, d, nsim, nsteps, rng)
    return mean(X_) * exp(-r * T)
end
function pricer_zygote_f(S0, r, T, sigma, d)
	Nsim1 = 100_000;
	Nstep1 = 3;
	rng1 = MersenneTwister()
	return pricer_bs(S0, r, T, sigma, d, Nsim1, Nstep1, rng1)
end

const S0 = 100.0;
const r = 0.01;
const T = 1.0;
const d = 0.01;
const sigma = 0.2;


fwd_grad=collect(Zygote.forward_jacobian(x->pricer_zygote_f(x...), [S0,r, T, sigma, d])[2])
rev_grad=collect(Zygote.gradient(pricer_zygote_f, S0,r, T, sigma, d))
@show fwd_grad.-rev_grad

and I get some non negligible mismatching on the derivative on T and sigma.
Do you have any idea of what is going on?

This is the difference I get between forward and backward:

-1.1102230246251565e-16
  2.842170943040401e-14
  0.014689082933301956
  0.1468908293330241
 -2.842170943040401e-14

Can you try with ForwardDiff? My guess is that Zygote.forward_jacobian is the issue. That function isn’t in the documentation for a reason.

I tried with DualNumbers and I get the same result as forward_jacobian

Even with ForwardDiff (the difference is against fwd_grad):

ForwardDiff.gradient(x->pricer_zygote_f(x...), [S0,r, T, sigma, d]).-fwd_grad

I get:

 0.0
 0.0
 0.0
 0.0
 0.0

I suspect differences between how both libraries operate are leading to the RNG being called and possibly seeded a different number of times. e.g. the number of pushforward calls far exceeds the number of pullback ones. If you can try (just for testing) generating N once and fixing it for the duration of the simulation, you should be able to isolate if this number of RNG calls discrepancy is the culprit.

5 Likes

Yep, that appears to be the culprit:

julia> const N = rand(100_000);

julia> function simulate_bs(S0, r, T, sigma, d, nsim::Integer, nsteps::Integer, rng)
           mu = r - d
           zero_dual =ChainRulesCore.@ignore_derivatives 0 * mu * sigma * T
           dt = T / nsteps
           mu_adj = (mu - sigma^2 / 2) * dt
           sigma_adj = sigma * sqrt(dt)
           X =zeros(typeof(zero_dual), nsim)
           #N =Array{Float64}(undef,nsim)
           ChainRulesCore.@ignore_derivatives Random.seed!(rng, 1)
           for i = 1:nsteps
               X += simulate(N, mu_adj, sigma_adj, rng,i)
           end
           @. S0 * exp(X)
       end;

julia> function simulate(N,  drift, sigma, rng,i)
           #ChainRulesCore.@ignore_derivatives randn!(rng,N)
           return @. sigma * N + drift
       end;

julia> fwd_grad=collect(Zygote.forward_jacobian(x->pricer_zygote_f(x...), [S0,r, T, sigma, d])[2]);

julia> rev_grad=collect(Zygote.gradient(pricer_zygote_f, S0,r, T, sigma, d));
julia> @show fwd_grad.-rev_grad
fwd_grad .- rev_grad = [2.220446049250313e-16; 1.4210854715202004e-14; -3.552713678800501e-15; -2.842170943040401e-14; -2.842170943040401e-14;;]
5×1 Matrix{Float64}:
  2.220446049250313e-16
  1.4210854715202004e-14
 -3.552713678800501e-15
 -2.842170943040401e-14
 -2.842170943040401e-14
2 Likes

What this means is that taking gradients of a function that includs randomness is tricky business. This article helped me understand things a bit better:

1 Like

See also this lecture and the notes from our matrix-calculus class. For stochastic gradient descent (SGD) and similar algorithms (e.g. Adam), you only need a function whose expected value is the gradient of the expected value of your function.

2 Likes

it seems to me a more Zygote related issue. The number returned is simply wrong.

No, this is just a difference between forwards mode autodiff and reverse mode autodiff. The code path that calls rand is hit a different number of times when you run in reverse mode versus when you run in forwards mode.

This would happen with any autodiff system with both modes.

This is kinda the whole reason reverse mode exists in the first place. For a function with many inputs and few outputs, reverse mode can calculate the gradient with fewer function evaluations than forwards mode.

1 Like

The randn function should be ignored in reverse mode no?

In the derivative part, yes, but the primal (the non-gradient version) still gets called during the construction of the gradient, and the number of times it gets called is different depending on forwards or backwards.

Here’s a demonstration with a simple counter in a @ignore_derivatives block:

julia> using Zygote, ChainRulesCore

julia> const counter = Ref(0);

julia> function foo(x::Vector)
           ChainRulesCore.@ignore_derivatives counter[] += 1
           sum(x)
       end;

julia> Zygote.gradient(foo, rand(100));

julia> counter[]
1

So when we calculate the gradient of foo here, we increment the counter only once.

Now lets reset the counter and look at forwards mode:

julia> counter[] = 0;

julia> Zygote.forward_jacobian(foo, rand(100));

julia> counter[]
9

So you see foo has been called 9 times!

4 Likes

Thank you!
Shouldn’t my example be more similar to the following?

using Zygote, ChainRulesCore

const counter = Ref(0);

function incr_counter()
	ChainRulesCore.@ignore_derivatives counter[] += 1
end

function foo(x::Vector)
		ChainRulesCore.@ignore_derivatives counter[]=10#as a seed
		for i in x
			incr_counter()
		end
       sum(x)
       end