Hi I am trying to reproduce an algorithm based on Natural gradient computations (Natural Gradients in Practice - Salimbeni et al 2018).
The key computation is that you can get rid of the inverse Fisher Information matrix by replacing it with transformation derivatives, here is the relevant extract ;

The whole talk about reverse-mode relevant, as forward mode is available in Julia. However I was wondering if there was a way to compute this quantity in one pass instead of computing first \frac{\partial \xi}{\partial \theta} and then \frac{\partial \mathcal{L}}{\partial \eta}.

For precision here \xi are an arbitrary representation of the variational parameters, \theta are the natural parameters and \eta are the expectation parameters (first and second moment)

However I was wondering if there was a way to compute this quantity in one pass instead of computing first ∂ξ∂θ and then ∂L∂η .

I’m a little confused by this statement, since at no point do you ever actually instantiate ∂ξ∂θ. So could you elaborate a little on what you mean by this?

Well if we take a concrete example where \xi = (\mu, L) where q = \mathcal{N}(\mu, LL^\top), you still need to compute \frac{\partial (\mu, L)}{\partial \theta} right ?

Right, so the point here is that you don’t ever compute the jacobian(η) explicitly – instead, to compute the natural gradient w.r.t. the parameters of your preferred parametrisation ξ you do something like

Ah i think I get it better, you don’t compute the jacobian explicitly, instead you pass dL_dexpec as your final value and compute the result directly from there (as a jacobian vector product).
I cannot find such a function in ForwardDiff.jl, do you know what to use?

I cannot find such a function in ForwardDiff.jl, do you know what to use?

I’m actually not entirely sure either – I agree that I can’t immediately see it in the public API. Probably best to ask about ForwardDiff + jvps in #autodiff on Slack for a quick answer.

Forward-mode automatic differentiation libraries are perhaps less common than reverse-mode, but fortunately there is an elegant way to achieve forward-mode automatic differentiation using reverse-mode differentiation twice

Seriously?! Isn’t forward-mode AD orders of magnitude simpler to implement than reverse-mode?!

For jvp f’(x) * h, can’t you just differentiate f(x+t h) wrt t?

Yeah, it’s more of a comment on AD tools that the ML community tend to use, rather than AD generally – JAX is probably the first major bit of ML tooling that has forwards mode.

@willtebbutt to answer your concern with the problems of such a method, here is the kind of optimiser you need to use to make sure everything is okay :

struct IncreasingRate
α::Float64 # Maximum learning rate
γ::Float64 # Convergence rate to the maximum
state
end
IncreasingRate(α=1.0, γ=1e-8) = IncreasingRate(α, γ, IdDict())
function Optimise.apply!(opt::IncreasingRate, x, g)
t = get!(()->0, opt.state, x)
opt.state[x] += 1
return g .* opt.α * (1 - exp(-opt.γ * t))
end

Forward mode differentiation of f(x + t h) with respect to t only computes the product (the directional derivative). It does not compute the Jacobian matrix f'(x) first and then multiply it by h (which would be vastly less efficient).