Help with Jacobian vector product to get natural gradient

theogf · December 2, 2020, 2:03pm

Hi I am trying to reproduce an algorithm based on Natural gradient computations (Natural Gradients in Practice - Salimbeni et al 2018).
The key computation is that you can get rid of the inverse Fisher Information matrix by replacing it with transformation derivatives, here is the relevant extract ;

The whole talk about reverse-mode relevant, as forward mode is available in Julia. However I was wondering if there was a way to compute this quantity in one pass instead of computing first \frac{\partial \xi}{\partial \theta} and then \frac{\partial \mathcal{L}}{\partial \eta}.

For precision here \xi are an arbitrary representation of the variational parameters, \theta are the natural parameters and \eta are the expectation parameters (first and second moment)

willtebbutt · December 2, 2020, 2:54pm

However I was wondering if there was a way to compute this quantity in one pass instead of computing first ∂ξ∂θ and then ∂L∂η .

I’m a little confused by this statement, since at no point do you ever actually instantiate ∂ξ∂θ. So could you elaborate a little on what you mean by this?

theogf · December 2, 2020, 2:58pm

Well if we take a concrete example where \xi = (\mu, L) where q = \mathcal{N}(\mu, LL^\top), you still need to compute \frac{\partial (\mu, L)}{\partial \theta} right ?

theogf · December 2, 2020, 3:02pm

I actually wrote a basic script for it to make more clear what I am doing

    using Flux: destructure
    using LinearAlgebra
    using ForwardDiff: gradient, jacobian
    E, to_expec = destructure(meanvar_to_expec(μ, L))
    dL_dexpec = gradient(E) do E
        μ, L = expec_to_meanvar(to_expec(E)...)
        θ = L * randn(length(μ), nSamples) .+ μ
        sum(logπ, eachcol(θ)) / nSamples + logdet(L)
    end

    η, to_nat = destructure(meanvar_to_nat(μ, L))
    dξ_dη = jacobian(η) do η
        vec(nat_to_meanvar(to_nat(η)...))
    end

    ξ, to_meanvar = destructure((μ, L))
    nat_grad = dξ_dη * dL_dexpec
    Δμ, ΔL = to_meanvar(nat_grad)

I used destructure out of laziness

willtebbutt · December 2, 2020, 3:18pm

Right, so the point here is that you don’t ever compute the jacobian(η) explicitly – instead, to compute the natural gradient w.r.t. the parameters of your preferred parametrisation ξ you do something like

foward_mode_AD(natural_to_meanvar, theta, dL_dexpec)

which is equivalent to the jacobian-vector product in your code.

Does this clear things up, or am I missing the point?

theogf · December 2, 2020, 3:25pm

Ah i think I get it better, you don’t compute the jacobian explicitly, instead you pass dL_dexpec as your final value and compute the result directly from there (as a jacobian vector product).
I cannot find such a function in ForwardDiff.jl, do you know what to use?

willtebbutt · December 2, 2020, 3:27pm

Exactly.

I cannot find such a function in ForwardDiff.jl, do you know what to use?

I’m actually not entirely sure either – I agree that I can’t immediately see it in the public API. Probably best to ask about ForwardDiff + jvps in autodiff on Slack for a quick answer.

willtebbutt · December 2, 2020, 3:39pm

Note that if that doesn’t work you can presumably just use the reverse-mode trick from the paper.

antoine-levitt · December 2, 2020, 3:48pm

Forward-mode automatic differentiation libraries are perhaps less common than reverse-mode, but fortunately there is an elegant way to achieve forward-mode automatic differentiation using reverse-mode differentiation twice

Seriously?! Isn’t forward-mode AD orders of magnitude simpler to implement than reverse-mode?!

For jvp f’(x) * h, can’t you just differentiate f(x+t h) wrt t?

willtebbutt · December 2, 2020, 4:09pm

Yeah, it’s more of a comment on AD tools that the ML community tend to use, rather than AD generally – JAX is probably the first major bit of ML tooling that has forwards mode.

But yes, I agree that it’s a bit strange.

willtebbutt · December 2, 2020, 4:11pm

From @oxinabox on slack:

SHould be easy enough if just construct the dual numbers youself right?

duals = map(Dual, x, dx)

So the idea would be to write something like

out_duals = natural_to_meanvar(map(Dual, theta, dL_dexpec))

and then the desired natural gradient should be contained within the dual bits of out_duals.

YingboMa · December 2, 2020, 4:36pm

The derivative of f(t * a + c) wrt t at 0 is the Jacobian vector product J(f, c) * a.

julia> using ForwardDiff

julia> foo(x) = [x[1], x[1]*x[3], x[2]^2]
foo (generic function with 1 method)

julia> ForwardDiff.derivative(t->foo([1,2,3] * t + [3, 4, 5]), 0)
3-element Vector{Int64}:
  1
 14
 16

julia> ForwardDiff.jacobian(foo, [3,4,5]) * [1,2,3]
3-element Vector{Int64}:
  1
 14
 16

willtebbutt · December 2, 2020, 4:37pm

For jvp f’(x) * h, can’t you just differentiate f(x+t h) wrt t?

Sorry, didn’t appreciated this properly when you first wrote it!

theogf · December 2, 2020, 4:39pm

@willtebbutt to answer your concern with the problems of such a method, here is the kind of optimiser you need to use to make sure everything is okay :

struct IncreasingRate
    α::Float64 # Maximum learning rate
    γ::Float64 # Convergence rate to the maximum
    state
end

IncreasingRate(α=1.0, γ=1e-8) = IncreasingRate(α, γ, IdDict())

function Optimise.apply!(opt::IncreasingRate, x, g)
    t = get!(()->0, opt.state, x)
    opt.state[x] += 1
    return g .* opt.α * (1 - exp(-opt.γ * t))
end

MilkshakeForReal · February 16, 2022, 1:20am

Is it only computing the product? Or under the hood it computes the Jacobian first, and then computes the product

stevengj · February 16, 2022, 2:06am

Forward mode differentiation of f(x + t h) with respect to t only computes the product (the directional derivative). It does not compute the Jacobian matrix f'(x) first and then multiply it by h (which would be vastly less efficient).

MilkshakeForReal · February 18, 2022, 12:15am

Thanks for the explaination!

Topic		Replies	Views
Using ForwardDiff to compute the jacobian and gradient of a function that includes another function Numerics	0	545	November 19, 2019
How to optimize Jacobian ForwardDiff.jl Modelling & Simulations forwarddiff	2	417	March 20, 2023
ForwardDiff.jl: Compute J_f(x) * f(x) General Usage question	2	236	June 1, 2023
Nested AD with Lux etc Machine Learning ad	26	1276	May 1, 2024
Why does forward diff yield totally different gradients compared to my manual perturbation General Usage question , package , forwarddiff , autodiff , uncertainty-quantifi	1	80	April 10, 2025

Help with Jacobian vector product to get natural gradient

Related topics