Gradient of a gradient of a FastChain

mahdiar · December 31, 2021, 7:56pm

What is the best way to differentiate an artificial neural network built by FastChain?

The objective is to make a gradient of a neural network output with respect to its output. The result should be differentiable.

Here is a dummy example.

using DiffEqFlux, Zygote
nn = FastChain(FastDense(1,32,tanh), FastDense(32,32,tanh), FastDense(32,1))
θ  = initial_params(nn)
Zygote.gradient(nn[1],0.1,θ)

It results in the following error

MethodError: no method matching getindex(::FastChain{Tuple{FastDense{typeof(tanh), DiffEqFlux.var"#initial_params#92"{Vector{Float32}}}, FastDense{typeof(tanh), DiffEqFlux.var"#initial_params#92"{Vector{Float32}}}, FastDense{typeof(identity), DiffEqFlux.var"#initial_params#92"{Vector{Float32}}}}}, ::Int64)

Stacktrace:
 [1] top-level scope
   @ In[3]:5
 [2] eval
   @ ./boot.jl:373 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1196

ChrisRackauckas · January 1, 2022, 12:34am

Gradient of what? You’re not taking the gradient of a function.

Zygote.gradient(x -> nn(x,θ)[1],[0.1])

mahdiar · January 1, 2022, 4:29pm

Thanks for providing the correct syntax to use. There still is a problem with DiffEqFlux.(sciml_train), it might be related to differentiability or again a mistake in syntax.

using DiffEqFlux, Zygote, LinearAlgebra
nn = FastChain(FastDense(1,32,tanh), FastDense(32,32,tanh), FastDense(32,1))
θ  = initial_params(nn)
cost(nn, θ) = reduce(vcat,first.(Zygote.gradient(x -> nn(x,θ)[1],[0.1])))
loss(θ) = cost(nn,θ)
DiffEqFlux.sciml_train(loss, ADAM, θ, maxiters=2)

Any idea how to fix this?

Btw, the input of the loss() function is the optimization parameter vector \theta. Is there a way to use cost() function directly in DiffEqFlux.(sciml_train)?

Elrod · January 1, 2022, 4:33pm

The loss should be a scalar.

mahdiar · January 1, 2022, 8:08pm

Isn’t it a scalar in the previous example?

ChrisRackauckas · January 1, 2022, 9:00pm

Are you trying to take a gradient of a gradient of a FastChain?

using DiffEqFlux, Zygote, LinearAlgebra
nn = FastChain(FastDense(1,32,tanh), FastDense(32,32,tanh), FastDense(32,1))
θ  = initial_params(nn)
cost(nn, θ) = Zygote.gradient(x -> nn(x,θ)[1],[0.1])[1]
loss(θ) = cost(nn,θ)
DiffEqFlux.sciml_train(loss, θ, ADAM(0.1), maxiters=2)

That’s very different from a gradient of a FastChain.

mahdiar · January 1, 2022, 9:10pm

To be precise, I want to use a gradient of a FastChain in a loss function. Many of the times AD was failing at the DiffEqFlux.sciml_train(), which basically means that the gradient of a gradient is failing somewhere. Any idea how to fix the simple example?

ChrisRackauckas · January 1, 2022, 9:23pm

Okay yes, this is a gradient of a gradient example, not a gradient example. The gradient of a gradient of a FastChain won’t work because the FastChain adjoint uses mutation. It could be specialized to handle this case, but because this is almost never an efficient way to calculate the second derivative (forward-over-adjoint is just better in almost all respects) I’m not sure it’s a high priority to specialize this.

(And BTW, taking a gradient of gradient is a TensorFlow misnomer. It’s actually Jacobian of a gradient unless it’s a scalar function and thus the second derivative. Otherwise the sizes don’t align. TensorFlow silently makes gradient = sum of Jacobian as I describe here Gradient of Gradient in Zygote - #3 by ChrisRackauckas You should really double check whether that summation is the interpretation you wanted)

Elrod · January 2, 2022, 4:32am

reduce(vcat, ...)

Won’t return a scalar in general.
But requiring a gradient of a gradient with the fast chains is a secondary issue, as Chris R pointed out.

cortner · January 2, 2022, 5:53am

Interesting. Can you point to a source where I can understand this? I’ve written rrules for rrules with great success (I thought …) with performance reasonably close to the first rrule which indicated to me that any other approach couldn’t really be more performant… But maybe I have a special case, or maybe I misunderstood something?

EDIT: never mind, I see your link above has some discussion … I will peruse it.

mahdiar · January 2, 2022, 4:13pm

ChrisRackauckas:

Okay yes, this is a gradient of a gradient example, not a gradient example. The gradient of a gradient of a FastChain won’t work because the FastChain adjoint uses mutation. It could be specialized to handle this case, but because this is almost never an efficient way to calculate the second derivative (forward-over-adjoint is just better in almost all respects) I’m not sure it’s a high priority to specialize this.

(And BTW, taking a gradient of gradient is a TensorFlow misnomer. It’s actually Jacobian of a gradient unless it’s a scalar function and thus the second derivative. Otherwise the sizes don’t align. TensorFlow silently makes gradient = sum of Jacobian as I describe here Gradient of Gradient in Zygote - #3 by ChrisRackauckas You should really double check whether that summation is the interpretation you wanted)

Is this topic relevant? The model in their example is based on Chain instead of FastChaint.

Thanks for the additional information on TensorFlow. Maybe this is one of the reasons why people in PINNs use multiple neural networks for estimating each variable? Sum of Jacobian is clearly not the best approach.

mahdiar · January 2, 2022, 4:21pm

Understood. But, in this example, there is only one element in first.(Zygote.gradient(x -> nn(x,θ)[1],[0.1])), so the output of reduce(vcat, ...) is scalar. Is this still a problem here? Removing reduce(vcat, ...) changes the error.

ChrisRackauckas · January 2, 2022, 4:51pm

It’s the same issue as in both cases what’s happening is that the adjoint definition itself is not Zygote differentiable, which is where the double differentiation issue comes into play. But then the solution is the same in both cases that you probably want to use Forward-over-Adjoint anyways (replying to @cortner, IIRC there’s a lengthy discussion in Griewank’s AD book about how double reverse mode is almost never optimal), and that thread shows how to do FoA second derivatives via mixing ForwardDiff.jl into Zygote.jl. So my suggestion is the same. In this case, it’s actually not too hard to fix the double adjoint, but you still wouldn’t want to use it if I fixed that so… ehh… maybe later.

Yes, we discuss this a bit in the weird paper-y write-up thingy we wrote on NeuralPDE.jl (not quite a paper, not quite a review, but full of relevant information). Splitting into separate neural networks decreases the asymptotic cost of differentiation for systems of PDEs. You could still use a few tricks to make the neural networks share weights/layers in this form BTW (I should write an example for how to do that), but putting the networks together both has a higher cost and a tendency to compute things that aren’t needed (for example, if you want AD to compute the 2nd derivative it would compute the second derivative w.r.t. every output, even if only one dependent variable was undergoing diffusion).

So yes, there’s a lot going on in AD space to improve PINNs. Mixing the networks it not asymptotically good. Reverse over reverse is not good, so you want to mix forward with reverse. Standard AD mixing without extra tricks is not good to higher order (see the stuff on Taylor-mode AD: you need to use things like that in order to make 3rd derivatives scale much better, and even the heat equation has 3rd derivatives when you consider the one derivative for the loss function, this is why Diffractor.jl exists). The Julia Lab will be spending a good part of next year to demonstrate how handling all of these more effectively together improves PINN training performance, but it’s not all ready right now.

mahdiar · January 2, 2022, 6:07pm

How is it possible to use FoA with sciml_train()? Based on the simple example provided earlier, the cost function is using Zygote.gradient(). Is there an option in sciml_train() to use ForwardDiff in the optimization process?

ChrisRackauckas · January 3, 2022, 2:58am

It’s the same thing. Just use ForwardDiff.jl in the cost function.

mahdiar · January 3, 2022, 4:54pm

Do you mean something like this:

using DiffEqFlux, ForwardDiff, LinearAlgebra
nn = FastChain(FastDense(1,32,tanh), FastDense(32,32,tanh), FastDense(32,1))
θ  = initial_params(nn)
cost(θ_) = ForwardDiff.gradient(x -> nn(x,θ_)[1],[0.1])[1]
DiffEqFlux.sciml_train(cost, θ, ADAM(0.1), maxiters=2)

It results in the following error.

MethodError: Cannot `convert` an object of type Nothing to an object of type Float32

ChrisRackauckas · January 5, 2022, 2:52am

using DiffEqFlux, ForwardDiff, LinearAlgebra
nn = FastChain(FastDense(1,32,tanh), FastDense(32,32,tanh), FastDense(32,1))
θ  = initial_params(nn)
function cost(θ_)
  f = x -> nn(x,θ_)[1]
  x = [0f-1]
  sum(ForwardDiff.gradient(f, x))
end
Zygote.gradient(cost,θ) # nothing

That shouldn’t be a nothing. It looks like it’s an issue with Zygote’s overload of ForwardDiff.gradient. MWE:

using Zygote
θ  = rand(2,2)
function cost(θ_)
  f = x -> sum(θ_*x)
  x = [1.0,2.0]
  sum(ForwardDiff.gradient(f, x))
end
Zygote.gradient(cost,θ) # nothing

@mcabbott it looks like this would’ve been covered by the same thing as https://github.com/FluxML/Zygote.jl/issues/953#issuecomment-841882071 . Maybe you have an idea what’s going on here?

mcabbott · January 5, 2022, 3:16am

Yes, ForwardDiff.gradient(f, x) does not keep a gradient with respect to f.

Maybe that could be done better, there was a thread of attempts somewhere.

ChrisRackauckas · January 5, 2022, 10:57am

Well interesting. @simeonschaub is this a good Diffractor case or still too early?

mahdiar · February 8, 2022, 3:58pm

Is there any update for this thread? Do we have a friendly way to use the derivative of a neural network in the cost function?

Topic		Replies	Views
How to use gradient of neural network as the loss function? Machine Learning question	13	2739	March 23, 2021
Flux loss: Gradient wrt input leads to empty gradient wrt parameters or to "can't differentiate foreigncall" Machine Learning flux , forwarddiff , diffeqflux	3	558	April 8, 2022
Speeding up gradients for custom neural network - currently much slower than in PyTorch Machine Learning performance , differentiation	16	2120	August 28, 2021
Flux differentiation error Machine Learning zygote	19	1685	November 19, 2020
Nested and different AD methods altogether: How to add AD calculations inside my loss function when using neural differential equations? Machine Learning sciml , ad , neural-network , differentialequation	9	996	September 28, 2024

Gradient of a gradient of a FastChain

Related topics