I’m trying to find a way to efficiently compute the product of a transposed vector with a Jacobian matrix. So given a vector function f and two large vectors v and x of same length, I want to compute v^T * jacobian(f)(x) without explicitly computing the jacobian matrix. Does anyone know how this can be done? Apparently it can be done in some way similar to the jacobian-vector product jacobian(f)(x)*v, but I’m unable to make the connection.
PS.: for the jacobian-vector product, the way to do it is shown in a paper called “Fast Exact Multiplication by the Hessian”, showing that jacobian(f)(x)v = d/dr f(x+rv)|r=0. Below my simple implementation of it, but as mentionned I need the other way around v^T * jacobian(f)(x).
using ForwardDiff
# Compute jacobian(f)(x)*v without building a jacobian
function jvp(f::Function, x::AbstractVector, v::AbstractVector)
g = r->f(x+r*v)
return ForwardDiff.derivative(g, 0.0)
end
# I actually want v'*jacobian(f)(x)
vjp(f::Function, x::AbstractVector, v::AbstractVector) = ??
This is exactly what reverse-mode automatic differentiation is (e.g. if f is R^n to R you’re asking for a O(1) gradient). There are a few packages doing it, but it’s more tricky than forward diff.
function vjp(f::Function, x::AbstractVector, v::AbstractVector)
g(y) = dot(v, f(y))
ForwardDiff.gradient(g, x)
end
Though reverse mode autodiff would be much more efficient unless your vector x is short.
[Edit: Although it looks nice, I think this formulation amounts to computing the full Jacobian internally (possibly in batches when x is long… but nevertheless).]
Hello, thanks for the tips! So as you posted the trick to use is that jacobian(f, x)*v = d/dx (f(x)'*v) .The ForwardDiff approach already gives a speedup for large vectors for some reason, and using AutoGrad for backward diff, the real speedup comes into play. So if someone else needs this:
using AutoGrad
# Compute jacobian(f, x)*v
function vjp(f::Function, x::AbstractVector, v::AbstractVector)
x = Param(x)
temp = @diff dot(v, f(x))
return grad(temp, x)
end
I also benchmarked the functions agains the “create full jacobian and multiply” version for vectors of length 10000. and the vjp and jvp functions above.
I think it might just be that the jacobian doesn’t need to be stored all at once because the columns are computed in batches with ForwardDiff and each column can be reduced with v as it’s generated. So improved cache efficiency, etc.
Thanks for your answer, but in general it’s best not to ressuscitate threads from 6 years ago, because everyone gets pinged in the process. If you have specific questions you can create a new topic for that
Also do check out DifferentiationInterface.jl with its pullback function. Nowadays that’s probably the most generic solution.