I need to optimize a function that depends on the gradient of another function. When I run the following code:

```
using Zygote
data = rand(1,10)
f(p,x) = [x*p[1];x*p[2].+p[3]]
truef(x) = [x.+1;2*x]
function loss(p)
dndts = [truef(x[1]) for x in data]
dfdt(x) = Zygote.forwarddiff(x->f(p,x),x)
ℓ = sum(abs2,reduce(hcat,[dfdt(data[i]).-dndts[i] for i in 1:10]))
return ℓ
end
println(Zygote.gradient(loss,rand(3)))
```

I get `(nothing,)`

. If I take the gradient inside the function out, something like this:

```
using Zygote
data = rand(1,10)
f(p,x) = [x*p[1];x*p[2].+p[3]]
truef(x) = [x.+1;2*x]
function loss(p)
dndts = [truef(x[1]) for x in data]
dfdt(x) = f(p,x)
ℓ = sum(abs2,reduce(hcat,[dfdt(data[i]).-dndts[i] for i in 1:10]))
return ℓ
end
println(Zygote.gradient(loss,rand(3)))
```

it works correctly. I know second derivatives are complicated, so I was wondering what would be an appropriate solution for this. If it’s of any use:

- I’m trying to do this where my function
`f`

is a neural network, using either the`Lux`

or`Flux`

environment - My function, just like in this MWE, is a single input function that returns several outputs.