How come Flux.jl's network parameters go to NaN?

I am trying to train a neural and the code is quite large and I can’t seem to find a way to reproduce an MWE at the moment. But basically Flux.train!(loss, res_vec, opt) gives an error of Loss is NaN and when I check the params(policy) I can see that all the params are now NaN. This doesn’t always happen, and setting the Random.seed(0) doesn’t help with reproducibility. The policy network is defined like so

policy = Flux.Chain(
  Conv((2,2), 1=>128, relu)
   ,Conv((2,2), 128=>128, relu)
   ,x -> reshape(x, :, size(x,4))
   ,IdentitySkip(Dense(512, 512), relu)
   ,Dense(512, 4)
  ) |> gpu

where IdentifySkip is a residual network block. I am quite new to neural networks and this may not be an issue with Flux but I want to understand under what conditions will the params go to NaN and how can I prevent it? How do I go about diagnosing it? My input training data is fine, and I checked, none of them have NaN. Any tips welcome.

1 Like

A lot of things can cause this.

  • initialization in a deep network. If some matrices have eigenvalues much greater than 1 gradients might explode.
  • too large learning rate
  • division by zero in some normalization

I would take a test input and evaluate each layer in the chain, step by step, to see if the output seems to be in an okay range, then look at the gradients


I have run into a similar problem and have a very simple, reproducible example:

This works:

using Distributions, Random
using Flux
using Flux: @epochs
using Flux: throttle
using PyPlot

#Generate random-ish data
rawX = rand(100)
rawY = 50 .* rawX .+ rand(Normal(0, 2), 100) .+ 50

# and show it
plot(rawX, rawY, "r.")

# Put into format [([x values], [y values]), (...), ...]
regX = []
regY = []
regData = []
for i in 1:length(rawX)
    push!(regX, [rawX[i]])
    push!(regY, [rawY[i]])
    push!(regData, ([rawX[i]], [rawY[i]]))

# Create model
model = Chain(Dense(1, 1, identity)) # Works fine
# model = Chain(Dense(1, 1, identity), Dense(1, 1, identity)) # Loss is NaN
function loss(x, y)
    ŷ = model(x)
    val = mean((ŷ .- y).^2)
    if val == Inf || val == NaN
        println("Here is the problem: ŷ = ", ŷ)
        val = sum(0.0 .* ŷ)
    return val

# opt = SGD(Flux.params(model), 0.1)
opt = Momentum()
ps = Flux.params(model)
evalcb() = Flux.throttle(20) do
    @show(mean(loss.(model.(regX), regY)))

# Train the model
@epochs 100 Flux.train!(loss, ps, regData, opt, cb=evalcb())

# Now run the model on test data and convert back from Tracked to Float64 to plot
testX = 0:0.02:1
testY = []
for i = 1:length(testX)
    xx = testX[i]
    yy = model([xx])

plot(testX, testY, "b-")

If however I change the model to two layers (the commented out option for model = … (Yes, I realize two linear layers are equivalent to a single linear layer. This is just to demonstrate), I get “Loss is Nan” errors. I replaced the loss function to output the result of the model when this happens and it is indeed the model output that results in the NaN.

I had the same problem that occured when I trained with small amount of training data. I did not analyse this so deeply, however tune(decreasing) learning rate parameters helped. I was using ADAM optimiser.

1 Like

I usually adjust the learning rate and use some normalization between layers. I observed that using identity or ReLU (or other unbounded function) as activation functions increases the chances of encountering this issue because after each layer the output values blow up. Adding batchnorm layers or using an activation function which clamps the outputs (like tanh) solved it for me.

This is a little old, but I ran into a similar problem and used the following function

function check_NaN(model,loss,X,Y)
    ps = Flux.params(model)
    gs = Flux.gradient(ps) do 
    search_NaN = []
    for elements in gs
        push!(search_NaN,1 ∈ isnan.(elements))
    return search_NaN

and then added

if true ∈ check_NaN(model,loss,X_train[k],Y_train[k])

to my training loop. That way the training stops before the NaN’s start appearing.

1 Like

I think using numerically stable versions of the loss functions etc can help


Just a short comment, you can basically write:

julia> any(isnan.([1.0, NaN, Inf, 2.0]))

julia> any(isnan.([1.0, 0.0, Inf, 2.0]))

and in your code:

function check_NaN(model,loss,X,Y)
    ps = Flux.params(model)
    gs = Flux.gradient(ps) do 
    return any(isnan.(gs))

This doesn’t quite work, since gs is an Array of matrices, it would have to be something like

for elements in gs
     if any(isnan.(elements))
          return true

Thank you for the suggestion!

Sorry, maybe this could work? :grinning:

f(x) = any(isnan.(x))
return f.(gs)

I think it does. Thanks!