How come Flux.jl's network parameters go to NaN?



I am trying to train a neural and the code is quite large and I can’t seem to find a way to reproduce an MWE at the moment. But basically Flux.train!(loss, res_vec, opt) gives an error of Loss is NaN and when I check the params(policy) I can see that all the params are now NaN. This doesn’t always happen, and setting the Random.seed(0) doesn’t help with reproducibility. The policy network is defined like so

policy = Flux.Chain(
  Conv((2,2), 1=>128, relu)
   ,Conv((2,2), 128=>128, relu)
   ,x -> reshape(x, :, size(x,4))
   ,IdentitySkip(Dense(512, 512), relu)
   ,Dense(512, 4)
  ) |> gpu

where IdentifySkip is a residual network block. I am quite new to neural networks and this may not be an issue with Flux but I want to understand under what conditions will the params go to NaN and how can I prevent it? How do I go about diagnosing it? My input training data is fine, and I checked, none of them have NaN. Any tips welcome.


A lot of things can cause this.

  • initialization in a deep network. If some matrices have eigenvalues much greater than 1 gradients might explode.
  • too large learning rate
  • division by zero in some normalization

I would take a test input and evaluate each layer in the chain, step by step, to see if the output seems to be in an okay range, then look at the gradients