I am currently working with Flux and due to a high amount of stochasticity in my data, I often receive
“Loss is NaN” and “Loss is infinite” errors during training. Reducing the stepsize of my gradient updates of course helps, but slows down training unnecessarily.
I would like to avoid this issue by clipping gradients, but I have not really found a way of doing it. Basically, what I want to do is to clip the gradient by its L^2 norm, so that if the gradient has a norm greater than 1 (or some other constant), I would want to divide it by its norm.
After searching around for a bit, I found the hook function in the Zygote package, which theoretically should be able to do this.
Here is the gradient descent function I currently have:
function pupdate!(S, A, δ, model, α, γ, t)
function loss(x) log(model(x)[A]) end
local ps = Flux.params(model)
local gs = Zygote.gradient(() -> loss(S), ps)
#@info "neural network before: $(model(S)[A])"
for p in ps
Flux.Tracker.update!(p, α * (γ^t)*δ.* gs[p])
end
#@info "neural network after: $(model(S)[A])"
end
I thought that switching out the gradient line by:
Zygote.gradient(() -> Zygote.hook(Zygote.hook(clipper,loss(S)),ps)
where
function clipper(x)
if norm(x) > 1
return x./norm(x)
else
return x
end
end
should do the trick, but unfortunately this does not work.
Any help would be appreciated!