I am trying to train a neural and the code is quite large and I can’t seem to find a way to reproduce an MWE at the moment. But basically Flux.train!(loss, res_vec, opt) gives an error of Loss is NaN and when I check the params(policy) I can see that all the params are now NaN. This doesn’t always happen, and setting the Random.seed(0) doesn’t help with reproducibility. The policy network is defined like so
where IdentifySkip is a residual network block. I am quite new to neural networks and this may not be an issue with Flux but I want to understand under what conditions will the params go to NaN and how can I prevent it? How do I go about diagnosing it? My input training data is fine, and I checked, none of them have NaN. Any tips welcome.
initialization in a deep network. If some matrices have eigenvalues much greater than 1 gradients might explode.
too large learning rate
division by zero in some normalization
I would take a test input and evaluate each layer in the chain, step by step, to see if the output seems to be in an okay range, then look at the gradients
I have run into a similar problem and have a very simple, reproducible example:
This works:
using Distributions, Random
using Flux
using Flux: @epochs
using Flux: throttle
using PyPlot
pygui(true)
#Generate random-ish data
Random.seed!(1234)
rawX = rand(100)
rawY = 50 .* rawX .+ rand(Normal(0, 2), 100) .+ 50
# and show it
plot(rawX, rawY, "r.")
# Put into format [([x values], [y values]), (...), ...]
regX = []
regY = []
regData = []
for i in 1:length(rawX)
push!(regX, [rawX[i]])
push!(regY, [rawY[i]])
push!(regData, ([rawX[i]], [rawY[i]]))
end
# Create model
model = Chain(Dense(1, 1, identity)) # Works fine
# model = Chain(Dense(1, 1, identity), Dense(1, 1, identity)) # Loss is NaN
function loss(x, y)
ŷ = model(x)
val = mean((ŷ .- y).^2)
if val == Inf || val == NaN
println("Here is the problem: ŷ = ", ŷ)
val = sum(0.0 .* ŷ)
end
return val
end
# opt = SGD(Flux.params(model), 0.1)
opt = Momentum()
ps = Flux.params(model)
evalcb() = Flux.throttle(20) do
@show(mean(loss.(model.(regX), regY)))
end
# Train the model
@epochs 100 Flux.train!(loss, ps, regData, opt, cb=evalcb())
# Now run the model on test data and convert back from Tracked to Float64 to plot
testX = 0:0.02:1
testY = []
for i = 1:length(testX)
xx = testX[i]
yy = model([xx])
push!(testY, Flux.Tracker.data(yy)[1])
end
plot(testX, testY, "b-")
If however I change the model to two layers (the commented out option for model = … (Yes, I realize two linear layers are equivalent to a single linear layer. This is just to demonstrate), I get “Loss is Nan” errors. I replaced the loss function to output the result of the model when this happens and it is indeed the model output that results in the NaN.
I had the same problem that occured when I trained with small amount of training data. I did not analyse this so deeply, however tune(decreasing) learning rate parameters helped. I was using ADAM optimiser.
I usually adjust the learning rate and use some normalization between layers. I observed that using identity or ReLU (or other unbounded function) as activation functions increases the chances of encountering this issue because after each layer the output values blow up. Adding batchnorm layers or using an activation function which clamps the outputs (like tanh) solved it for me.
This is a little old, but I ran into a similar problem and used the following function
function check_NaN(model,loss,X,Y)
ps = Flux.params(model)
gs = Flux.gradient(ps) do
loss(X,Y)
end
search_NaN = []
for elements in gs
push!(search_NaN,1 ∈ isnan.(elements))
end
return search_NaN
end
and then added
if true ∈ check_NaN(model,loss,X_train[k],Y_train[k])
break
end
to my training loop. That way the training stops before the NaN’s start appearing.