Gradient of gradient

jlmaccal · November 6, 2020, 1:36am

I’m implementing a model that requires using the gradients of a feedforward network with respect to its inputs as part of the loss function. I then need to train the network by differentiating the loss with respect to the parameters of the network. I can’t seem to get this working due to ERROR: Mutating arrays is not supported.

Here is a minimal example. In my actual model I need to do something more complicated than simply summing the gradients, but this captures the error.

net = Dense(10, 1)
x = randn(10, 128)  # dims, batch

function pred(x, net)
    y, pullback = Zygote.pullback(net, x)
    grads = pullback(fill!(similar(y), 1))[1]
    return grads
end

gradient(() -> sum(pred(x, net)), params(net))

I’m quite comfortable with python/pytorch, but I’m feeling totally lost with Julia/Flux. What is the right way to do this? This is superficially similar to the gradient penalty in WGAN-GP, but I can’t seem to find a flux implementation.

ChrisRackauckas · November 6, 2020, 2:11am

You can just stack different ADs. ReverseDiff over Zygote. That is a good combination. The new AD has some nice extra compiler optimizations for this though but that’s not ready quite yet.

xiaodai · November 6, 2020, 2:11am

this is quite odd.

jlmaccal · November 6, 2020, 2:23am

Could you expand on this? I’m new to the ecosystem and a bit confused about what all of the pieces are and how they fit together.

jlmaccal · November 6, 2020, 2:24am

My network takes in a batch of inputs and produces a single scalar. I’m trying to get dout/din batch-wise. Is there a better / more idiomatic way to do this?

xiaodai · November 6, 2020, 3:56am

the code is code because it’s filling an array with f1 and then calling the pullback which computes the gradient. So the gradient has nothing to do with the inputs.

Perhaps, it’s easier if you show ur original code in pytorch. In general, I found pytorch to be more robust so I have moved to pytorch.

jlmaccal · November 6, 2020, 4:29am

This is a new project, so I don’t have a pytorch code.

Here is what I am trying to calculate. This is for a single sample, but I would like to do this over a batch.

y = x_1 \cdot \nabla F(x_2) + x_1^T\xi(x_2)x_1,

where x_1 and x_2 are input vectors, F is a neural network that outputs a scalar, and \xi is a network that outputs a PSD matrix. The loss is the mean squared error between the prediction y and the observation \hat y. I want to minimize the loss through gradient descent on the parameters of F and \xi.

martenlienen · November 6, 2020, 8:42am

PyTorch cannot backpropagate through mutations and neither can Zygote. The expression fill!(similar(y), 1) depends on x through y and mutates its arguments (see the exclamation mark). You know that there is no real dependency on the value of y because the outcome is constant but Zygote will still try to differentiate through it. So you should rewrite it without mutations, for example

function pred(x, net)
    y, pullback = Zygote.pullback(net, x)
    grads = pullback(ones(size(y)))[1]
    return grads
end

jlmaccal · November 6, 2020, 6:29pm

Thank you, that clears it up. I didn’t realize similar(x) would create a dependency, as its just creating an uninitialized array.

Another question: python has functions like ones_like and zeros_like. What is the idiomatic julia equivalent? ones(size(y)) will always create an array of Float64, regardless of the type of y.

martenlienen · November 6, 2020, 9:44pm

The equivalent for zeros_like would be zero while for ones_like(x) I only know the uglier ones(eltype(x), size(x)). It would be really nice if it was just ones(x) though.

FYI this is the definition of zero for arrays.

zero(x::AbstractArray{T}) where {T} = fill!(similar(x), zero(T))

Topic		Replies	Views
Compute gradients in neuralODE with Zygote Machine Learning	3	255	August 24, 2023
How to use gradient of neural network as the loss function? Machine Learning question	13	2738	March 23, 2021
Flux loss: Gradient wrt input leads to empty gradient wrt parameters or to "can't differentiate foreigncall" Machine Learning flux , forwarddiff , diffeqflux	3	556	April 8, 2022
Autograd through loss function with derivatives in Machine Learning	1	881	March 25, 2021
Jacobian of NN: Mutating arrays is not supported Machine Learning	2	378	February 11, 2021

Gradient of gradient

Related topics