To be precise, I want to use a gradient of a FastChain in a loss function. Many of the times AD was failing at the DiffEqFlux.sciml_train(), which basically means that the gradient of a gradient is failing somewhere. Any idea how to fix the simple example?
Okay yes, this is a gradient of a gradient example, not a gradient example. The gradient of a gradient of a FastChain won’t work because the FastChain adjoint uses mutation. It could be specialized to handle this case, but because this is almost never an efficient way to calculate the second derivative (forward-over-adjoint is just better in almost all respects) I’m not sure it’s a high priority to specialize this.
(And BTW, taking a gradient of gradient is a TensorFlow misnomer. It’s actually Jacobian of a gradient unless it’s a scalar function and thus the second derivative. Otherwise the sizes don’t align. TensorFlow silently makes gradient = sum of Jacobian as I describe here Gradient of Gradient in Zygote - #3 by ChrisRackauckas You should really double check whether that summation is the interpretation you wanted)
Interesting. Can you point to a source where I can understand this? I’ve written rrules for rrules with great success (I thought …) with performance reasonably close to the first rrule which indicated to me that any other approach couldn’t really be more performant… But maybe I have a special case, or maybe I misunderstood something?
EDIT: never mind, I see your link above has some discussion … I will peruse it.
Is this topic relevant? The model in their example is based on Chain instead of FastChaint.
Thanks for the additional information on TensorFlow. Maybe this is one of the reasons why people in PINNs use multiple neural networks for estimating each variable? Sum of Jacobian is clearly not the best approach.
Understood. But, in this example, there is only one element in first.(Zygote.gradient(x -> nn(x,θ),[0.1])), so the output of reduce(vcat, ...) is scalar. Is this still a problem here? Removing reduce(vcat, ...) changes the error.
It’s the same issue as in both cases what’s happening is that the adjoint definition itself is not Zygote differentiable, which is where the double differentiation issue comes into play. But then the solution is the same in both cases that you probably want to use Forward-over-Adjoint anyways (replying to @cortner, IIRC there’s a lengthy discussion in Griewank’s AD book about how double reverse mode is almost never optimal), and that thread shows how to do FoA second derivatives via mixing ForwardDiff.jl into Zygote.jl. So my suggestion is the same. In this case, it’s actually not too hard to fix the double adjoint, but you still wouldn’t want to use it if I fixed that so… ehh… maybe later.
Yes, we discuss this a bit in the weird paper-y write-up thingy we wrote on NeuralPDE.jl (not quite a paper, not quite a review, but full of relevant information). Splitting into separate neural networks decreases the asymptotic cost of differentiation for systems of PDEs. You could still use a few tricks to make the neural networks share weights/layers in this form BTW (I should write an example for how to do that), but putting the networks together both has a higher cost and a tendency to compute things that aren’t needed (for example, if you want AD to compute the 2nd derivative it would compute the second derivative w.r.t. every output, even if only one dependent variable was undergoing diffusion).
So yes, there’s a lot going on in AD space to improve PINNs. Mixing the networks it not asymptotically good. Reverse over reverse is not good, so you want to mix forward with reverse. Standard AD mixing without extra tricks is not good to higher order (see the stuff on Taylor-mode AD: you need to use things like that in order to make 3rd derivatives scale much better, and even the heat equation has 3rd derivatives when you consider the one derivative for the loss function, this is why Diffractor.jl exists). The Julia Lab will be spending a good part of next year to demonstrate how handling all of these more effectively together improves PINN training performance, but it’s not all ready right now.
How is it possible to use FoA with sciml_train()? Based on the simple example provided earlier, the cost function is using Zygote.gradient(). Is there an option in sciml_train() to use ForwardDiff in the optimization process?
using DiffEqFlux, ForwardDiff, LinearAlgebra
nn = FastChain(FastDense(1,32,tanh), FastDense(32,32,tanh), FastDense(32,1))
θ = initial_params(nn)
f = x -> nn(x,θ_)
x = [0f-1]
Zygote.gradient(cost,θ) # nothing
That shouldn’t be a nothing. It looks like it’s an issue with Zygote’s overload of ForwardDiff.gradient. MWE:
θ = rand(2,2)
f = x -> sum(θ_*x)
x = [1.0,2.0]
Zygote.gradient(cost,θ) # nothing