Autodiff : Custom derivatives of vector functions for linear algebra

linearalgebra
#1

Is there a solution for automatic differentiation through code that contains a number of linear algebra operations including, for example, lyap and the matrix exponential exponential (exp)? This type of linear algebra occurs in probabilistic programming of a likelihood based on a Kalman filter type of computation.

The issue with algebra functions is that they rely on FORTRAN (BLAS, LAPACK). There are some options for pure Julia linear algebra but they are quite limited in scope. It seems that the way forward for these functions is a custom derivative. For one, this eliminates the issue of autodiff through FORTRAN code. Furthermore, the result could run faster, and potentially also be more accurate.

One solution is to define a custom Jacobian. This would be great and would help me to move forward. However, potentially there is a better solution based on the Fréchet derivative.

Some examples can illustrate these points using the matrix inverse inv. For a 40x40 matrix, I can compute the Jacobian in 80 milliseseconds using:

using LinearAlgebra
using ForwardDiff

n = 40
A = randn(rng,n,n)
ForwardDiff.jacobian(inv,A)
#  79.573 ms (2015 allocations: 84.73 MiB)

Using the custom Jacobian:

function jacobian_inv(A)
  Ainv = inv(A)
  -kron(Ainv',Ainv)
end

The same Jacobian is computed in 7 milliseconds, a 10x speed-up:

jacobian_inv(A)
# 7.074 ms (10 allocations: 39.10 MiB)

However, the approach based on the Fréchet derivative can be more computationally efficient, as well as easier to formulate. In this approach, the derivative is not a number or a matrix, but a linear function. The chain rule is not a chain of multiplications but a chain of linear functions. The benefit is that in practice it may happen that we only want to evaluate this linear function at one given value, reducing computational load. In this sense, derivatives for vector-valued functions are quite different from scalar derivatives. As an illustration, consider the following function ftest:

n = 40
A = randn(rng,n,n)
B = randn(rng,n,n)
C = randn(rng,n,n)
ftest(x) = (C/(A+x[1]*B))[end,end] # = (C*inv(A+x[1]*B))[end,end]

The Fréchet derivative for the matrix inverse is given by:

function frechet_inv(A,Δ)
  A = lu(A)
  -(A\Δ/A)
end

We can now perform differentiation of ftest using:

xtest = 1
(C*frechet_inv(A+xtest*B,B))[end,end]
# 68.750 μs (11 allocations: 101.52 KiB)

which runs in 69 μs, 100x faster than the custom Jacobian! (ForwardDiff takes 141 μs for this function).

To summarize: Is there an existing package with which I can formulate a custom Jacobian, or, better yet, a custom Fréchet derivative? Besides supporting autodiff for linear algebra, this would also enable autodiff more generally for black box vector-valued functions. In absence of this, is there a developing package where I can bring up this request? Perhaps capstan is an option, although this is very early in its development cycle.

0 Likes

#2

What you are calling a Fréchet derivative appears to be something very close to how backward-mode AD works? Most packages let you define custom gradients, with @grad in Flux you provide exactly the backward function (which gets sewn into a chain).

0 Likes

#3

Thanks for your comments! It is my understanding that Flux’s @grad only allows definition of custom gradients, i.e. derivatives for functions mapping vectors to scalars ( ℝⁿ → ℝ). Conversely, the linear algebra operations require custom derivatives (Jacobian/Fréchet) for a mapping from vectors to vectors ( ℝⁿ → ℝᵐ).

0 Likes

#4

My understanding is that the next generation of the Flux.Tracker, Zygote.jl, has recently become stable enough to start using. It is somewhat bleeding edge but certainly the future here (ie Capstan.jl was a proof of concept for the techniques used in Zygote.jl that may never be exist past that point).

You can look at the docs, http://fluxml.ai/Zygote.jl/dev/adjoints/ but from first glance it looked like it could handle what you are trying to do. If you end up using it, I bet lots of people here would love to hear your thoughts.

0 Likes

#5

Custom gradients only require that the last function in a chain is R^n \rightarrow R, intermediate functions can be any R^n \rightarrow R^m. Usually in backpropagation you don’t compute Jacobians explicitly, but instead use functions similar to your frechet_inv to push derivatives back one step at a time.

2 Likes

#6

@dfdx, I think what we need for linear algebra is a (Frechet) derivative for a vector - valued function, so many of the existing custom gradient options don’t seem to fit the bill. @jlperla thanks for the pointer to Zygote. This is tantalizingly close to what I am looking for, and I believe it is very suitable to include this functionality. The pullbacks have exactly the form of the Frechet derivatives.

I have implemented my test case using Zygote, see below. Unfortunately, the custom adjoint that I define for the matrix inverse (desired method) does not give the correct result at this point.

Any further thoughts?

using LinearAlgebra
using Random
using BenchmarkTools
using Test
using Zygote
using Zygote: @adjoint

function frechet_inv(A,Δ)
  A = lu(A)
  -(A\Δ/A)
end

ftest_inv(x) = (C*inv(A+x*B))[end,end]

# Desired method: define custom adjoint for the inverse
myinv(A) = inv(A)
@adjoint myinv(A) = myinv(A),cb -> (frechet_inv(A,cb),)
ftest_myinv(x) = (C*myinv(A+x*B))[end,end]

# Impractical method - just to show the correctness of the Frechet derivative
ftest_inv_custom(x) = (C*inv(A+x*B))[end,end]
@adjoint ftest_inv_custom(x) = ftest_inv_custom(x), cb -> ((C*frechet_inv(A+x*B,B))[end,end]*cb,)

Zygote.refresh()

rng = Random.MersenneTwister(23445)
n = 40
A = randn(rng,n,n)
B = randn(rng,n,n)
C = randn(rng,n,n)

@show size(A)
xtest = 2.3

@show ftest_inv(xtest)
# size(A) = (40, 40)

# Zygote
println("Zygote inv")
@show Zygote.gradient(ftest_inv,xtest)
y,back = Zygote.forward(ftest_inv,xtest)
@btime Zygote.forward($ftest_inv,$xtest)
# 71.220 μs (85 allocations: 73.50 KiB)

println("Zygote ftest_inv_custom - impractical")
@show Zygote.gradient(ftest_inv_custom,xtest)
y2,back_custom = Zygote.forward(ftest_inv_custom,xtest)
display(@test back_custom(1)[1]≈back(1)[1])
# Test passed

@btime Zygote.forward($ftest_inv_custom,$xtest)
# 61.345 μs (11 allocations: 71.14 KiB)

println("Zygote myinv - desired method")
@show Zygote.gradient(ftest_myinv,xtest)
y2,myback = Zygote.forward(ftest_myinv,xtest)

@test back(1)[1]≈myback(1)[1]
# Test failed
# Evaluated: -2302.6939568625844 ≈ 58.122801062141754

1 Like

#7

Note that your ftest_inv() is scalar-valued. The form of other functions that ftest_inv() depends on doesn’t matter for custom gradients.

0 Likes

#8

Flux and other autodiff packages most definitely let you define gradients of vector valued functions. You wouldn’t get very far without them, many of the functions you want to differentiate are vector or higher order tensor valued.

Take a look at the examples in https://github.com/FluxML/Flux.jl/blob/master/docs/src/internals/tracker.md#custom-gradients. The minus and multiply functions whose grad are defined return general tensors/arrays. In particular, the \delta in those definitions is not restricted to being a scalar.

1 Like

#9

Thanks! You are right. After your pointer I looked further into @grad and actually found it’s code for the custom gradient for the matrix inverse here. Fortunately, it is Julia so we can easily read the source code :slight_smile: . I should have done it earlier!

What confused me is (a) the name “gradient” (definition) and (b) the fact that what @grad expects is almost, but not quite, the Frechet derivative. This also derailed my example above with Zygote. For easy reference, here is the correct custom gradient for the matrix inverse:

inv(A::TrackedArray) = Tracker.track(inv, A)
@grad function inv(A)
    return inv(Tracker.data(A)), function (Δ)
        Ainv = inv(A)
        ∇A = - Ainv' * Δ * Ainv'
        return (∇A, )
    end
end
0 Likes

#10

It is the transpose of the Frechet derivative, because of the way reverse-mode differentiation works. The Frechet derivative would be used directly in e.g. ForwardDiff (although I don’t see in the docs for ForwardDiff where you’re supposed to define custom derivatives)

1 Like

#11

Thanks. To make it fully explicit, this is the relationship, where F is the Frechet derivative, and G is the function used by @grad:

G(Δ) = (F(Δ'))'

Okay, I can work with it, but it is not very intuitive to me… Also, it leads to more transposing in G, as the example of the inverse shows, as well as other functions Tracker.jl/array.jl.

0 Likes

#12

Again, that’s because flux is made for reverse mode autodiff (computing gradients). That’s the important primitive in that setting. Fréchet derivative is the important primitive in forward-mode autodiff, so if you want that you should look into ForwardDiff.

2 Likes