Gradient of gradient

I’m implementing a model that requires using the gradients of a feedforward network with respect to its inputs as part of the loss function. I then need to train the network by differentiating the loss with respect to the parameters of the network. I can’t seem to get this working due to ERROR: Mutating arrays is not supported.

Here is a minimal example. In my actual model I need to do something more complicated than simply summing the gradients, but this captures the error.

net = Dense(10, 1)
x = randn(10, 128)  # dims, batch

function pred(x, net)
    y, pullback = Zygote.pullback(net, x)
    grads = pullback(fill!(similar(y), 1))[1]
    return grads
end

gradient(() -> sum(pred(x, net)), params(net))

I’m quite comfortable with python/pytorch, but I’m feeling totally lost with Julia/Flux. What is the right way to do this? This is superficially similar to the gradient penalty in WGAN-GP, but I can’t seem to find a flux implementation.

2 Likes

You can just stack different ADs. ReverseDiff over Zygote. That is a good combination. The new AD has some nice extra compiler optimizations for this though but that’s not ready quite yet.

this is quite odd.

Could you expand on this? I’m new to the ecosystem and a bit confused about what all of the pieces are and how they fit together.

My network takes in a batch of inputs and produces a single scalar. I’m trying to get dout/din batch-wise. Is there a better / more idiomatic way to do this?

the code is code because it’s filling an array with f1 and then calling the pullback which computes the gradient. So the gradient has nothing to do with the inputs.

Perhaps, it’s easier if you show ur original code in pytorch. In general, I found pytorch to be more robust so I have moved to pytorch.

This is a new project, so I don’t have a pytorch code.

Here is what I am trying to calculate. This is for a single sample, but I would like to do this over a batch.

y = x_1 \cdot \nabla F(x_2) + x_1^T\xi(x_2)x_1,

where x_1 and x_2 are input vectors, F is a neural network that outputs a scalar, and \xi is a network that outputs a PSD matrix. The loss is the mean squared error between the prediction y and the observation \hat y. I want to minimize the loss through gradient descent on the parameters of F and \xi.

PyTorch cannot backpropagate through mutations and neither can Zygote. The expression fill!(similar(y), 1) depends on x through y and mutates its arguments (see the exclamation mark). You know that there is no real dependency on the value of y because the outcome is constant but Zygote will still try to differentiate through it. So you should rewrite it without mutations, for example

function pred(x, net)
    y, pullback = Zygote.pullback(net, x)
    grads = pullback(ones(size(y)))[1]
    return grads
end

Thank you, that clears it up. I didn’t realize similar(x) would create a dependency, as its just creating an uninitialized array.

Another question: python has functions like ones_like and zeros_like. What is the idiomatic julia equivalent? ones(size(y)) will always create an array of Float64, regardless of the type of y.

The equivalent for zeros_like would be zero while for ones_like(x) I only know the uglier ones(eltype(x), size(x)). It would be really nice if it was just ones(x) though.

FYI this is the definition of zero for arrays.

zero(x::AbstractArray{T}) where {T} = fill!(similar(x), zero(T))