Gradient of a gradient of a FastChain

What is the best way to differentiate an artificial neural network built by FastChain?

The objective is to make a gradient of a neural network output with respect to its output. The result should be differentiable.

Here is a dummy example.

using DiffEqFlux, Zygote
nn = FastChain(FastDense(1,32,tanh), FastDense(32,32,tanh), FastDense(32,1))
θ  = initial_params(nn)
Zygote.gradient(nn[1],0.1,θ)

It results in the following error

MethodError: no method matching getindex(::FastChain{Tuple{FastDense{typeof(tanh), DiffEqFlux.var"#initial_params#92"{Vector{Float32}}}, FastDense{typeof(tanh), DiffEqFlux.var"#initial_params#92"{Vector{Float32}}}, FastDense{typeof(identity), DiffEqFlux.var"#initial_params#92"{Vector{Float32}}}}}, ::Int64)

Stacktrace:
 [1] top-level scope
   @ In[3]:5
 [2] eval
   @ ./boot.jl:373 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1196

Gradient of what? You’re not taking the gradient of a function.

Zygote.gradient(x -> nn(x,θ)[1],[0.1])
4 Likes

Thanks for providing the correct syntax to use. There still is a problem with DiffEqFlux.(sciml_train), it might be related to differentiability or again a mistake in syntax.

using DiffEqFlux, Zygote, LinearAlgebra
nn = FastChain(FastDense(1,32,tanh), FastDense(32,32,tanh), FastDense(32,1))
θ  = initial_params(nn)
cost(nn, θ) = reduce(vcat,first.(Zygote.gradient(x -> nn(x,θ)[1],[0.1])))
loss(θ) = cost(nn,θ)
DiffEqFlux.sciml_train(loss, ADAM, θ, maxiters=2)

Any idea how to fix this?

Btw, the input of the loss() function is the optimization parameter vector \theta. Is there a way to use cost() function directly in DiffEqFlux.(sciml_train)?

The loss should be a scalar.

2 Likes

Isn’t it a scalar in the previous example?

Are you trying to take a gradient of a gradient of a FastChain?

using DiffEqFlux, Zygote, LinearAlgebra
nn = FastChain(FastDense(1,32,tanh), FastDense(32,32,tanh), FastDense(32,1))
θ  = initial_params(nn)
cost(nn, θ) = Zygote.gradient(x -> nn(x,θ)[1],[0.1])[1]
loss(θ) = cost(nn,θ)
DiffEqFlux.sciml_train(loss, θ, ADAM(0.1), maxiters=2)

That’s very different from a gradient of a FastChain.

1 Like

To be precise, I want to use a gradient of a FastChain in a loss function. Many of the times AD was failing at the DiffEqFlux.sciml_train(), which basically means that the gradient of a gradient is failing somewhere. Any idea how to fix the simple example?

Okay yes, this is a gradient of a gradient example, not a gradient example. The gradient of a gradient of a FastChain won’t work because the FastChain adjoint uses mutation. It could be specialized to handle this case, but because this is almost never an efficient way to calculate the second derivative (forward-over-adjoint is just better in almost all respects) I’m not sure it’s a high priority to specialize this.

(And BTW, taking a gradient of gradient is a TensorFlow misnomer. It’s actually Jacobian of a gradient unless it’s a scalar function and thus the second derivative. Otherwise the sizes don’t align. TensorFlow silently makes gradient = sum of Jacobian as I describe here Gradient of Gradient in Zygote - #3 by ChrisRackauckas You should really double check whether that summation is the interpretation you wanted)

3 Likes

reduce(vcat, ...)

Won’t return a scalar in general.
But requiring a gradient of a gradient with the fast chains is a secondary issue, as Chris R pointed out.

1 Like

Interesting. Can you point to a source where I can understand this? I’ve written rrules for rrules with great success (I thought …) with performance reasonably close to the first rrule which indicated to me that any other approach couldn’t really be more performant… But maybe I have a special case, or maybe I misunderstood something?

EDIT: never mind, I see your link above has some discussion … I will peruse it.

Is this topic relevant? The model in their example is based on Chain instead of FastChaint.

Thanks for the additional information on TensorFlow. Maybe this is one of the reasons why people in PINNs use multiple neural networks for estimating each variable? Sum of Jacobian is clearly not the best approach.

Understood. But, in this example, there is only one element in first.(Zygote.gradient(x -> nn(x,θ)[1],[0.1])), so the output of reduce(vcat, ...) is scalar. Is this still a problem here? Removing reduce(vcat, ...) changes the error.

It’s the same issue as in both cases what’s happening is that the adjoint definition itself is not Zygote differentiable, which is where the double differentiation issue comes into play. But then the solution is the same in both cases that you probably want to use Forward-over-Adjoint anyways (replying to @cortner, IIRC there’s a lengthy discussion in Griewank’s AD book about how double reverse mode is almost never optimal), and that thread shows how to do FoA second derivatives via mixing ForwardDiff.jl into Zygote.jl. So my suggestion is the same. In this case, it’s actually not too hard to fix the double adjoint, but you still wouldn’t want to use it if I fixed that so… ehh… maybe later.

Yes, we discuss this a bit in the weird paper-y write-up thingy we wrote on NeuralPDE.jl (not quite a paper, not quite a review, but full of relevant information). Splitting into separate neural networks decreases the asymptotic cost of differentiation for systems of PDEs. You could still use a few tricks to make the neural networks share weights/layers in this form BTW (I should write an example for how to do that), but putting the networks together both has a higher cost and a tendency to compute things that aren’t needed (for example, if you want AD to compute the 2nd derivative it would compute the second derivative w.r.t. every output, even if only one dependent variable was undergoing diffusion).

So yes, there’s a lot going on in AD space to improve PINNs. Mixing the networks it not asymptotically good. Reverse over reverse is not good, so you want to mix forward with reverse. Standard AD mixing without extra tricks is not good to higher order (see the stuff on Taylor-mode AD: you need to use things like that in order to make 3rd derivatives scale much better, and even the heat equation has 3rd derivatives when you consider the one derivative for the loss function, this is why Diffractor.jl exists). The Julia Lab will be spending a good part of next year to demonstrate how handling all of these more effectively together improves PINN training performance, but it’s not all ready right now.

2 Likes

How is it possible to use FoA with sciml_train()? Based on the simple example provided earlier, the cost function is using Zygote.gradient(). Is there an option in sciml_train() to use ForwardDiff in the optimization process?

It’s the same thing. Just use ForwardDiff.jl in the cost function.

Do you mean something like this:

using DiffEqFlux, ForwardDiff, LinearAlgebra
nn = FastChain(FastDense(1,32,tanh), FastDense(32,32,tanh), FastDense(32,1))
θ  = initial_params(nn)
cost(θ_) = ForwardDiff.gradient(x -> nn(x,θ_)[1],[0.1])[1]
DiffEqFlux.sciml_train(cost, θ, ADAM(0.1), maxiters=2)

It results in the following error.

MethodError: Cannot `convert` an object of type Nothing to an object of type Float32
using DiffEqFlux, ForwardDiff, LinearAlgebra
nn = FastChain(FastDense(1,32,tanh), FastDense(32,32,tanh), FastDense(32,1))
θ  = initial_params(nn)
function cost(θ_)
  f = x -> nn(x,θ_)[1]
  x = [0f-1]
  sum(ForwardDiff.gradient(f, x))
end
Zygote.gradient(cost,θ) # nothing

That shouldn’t be a nothing. It looks like it’s an issue with Zygote’s overload of ForwardDiff.gradient. MWE:

using Zygote
θ  = rand(2,2)
function cost(θ_)
  f = x -> sum(θ_*x)
  x = [1.0,2.0]
  sum(ForwardDiff.gradient(f, x))
end
Zygote.gradient(cost,θ) # nothing

@mcabbott it looks like this would’ve been covered by the same thing as https://github.com/FluxML/Zygote.jl/issues/953#issuecomment-841882071 . Maybe you have an idea what’s going on here?

1 Like

Yes, ForwardDiff.gradient(f, x) does not keep a gradient with respect to f.

Maybe that could be done better, there was a thread of attempts somewhere.

Well interesting. @simeonschaub is this a good Diffractor case or still too early?

Is there any update for this thread? Do we have a friendly way to use the derivative of a neural network in the cost function?

1 Like