Getting NaNs in the hello world example of Flux

I don’t know exactly what the problem is but it has to do with the x_train values getting too big (in absolute value) and then utilizing the classic gradient descent algorithm. Someone else who knows more will surely be able to explain what’s happening but when I keep the values that are generated in x_train small, I can use gradient descent. When they are larger, I have to switch to another optimizer (e.g., ADAGrad). Below is the code copied/pasted from the Flux docs and then I’ve added some notes around the x_train definition.

using Flux

W_truth = [1 2 3 4 5;
            5 4 3 2 1]
b_truth = [-1.0; -2.0]
ground_truth(x) = W_truth*x .+ b_truth

x_train = [ 5 .* rand(5) for _ in 1:10_000 ]
# x_train = [ 8.25 .* rand(5) for _ in 1:10_000 ]  results in NaN
# x_train = [ 8 .* rand(5) for _ in 1:10_000 ]  this works
y_train = [ ground_truth(x) + 0.2 .* randn(2) for x in x_train ]

model(x) = W*x .+ b
W = rand(2, 5)
b = rand(2)

function loss(x, y)
  ŷ = model(x)
  sum((y .- ŷ).^2)
end
opt = Descent(0.01)
train_data = zip(x_train, y_train)
ps = params(W, b)

for (x,y) in train_data
  gs = gradient(ps) do
    loss(x,y)
  end
  Flux.Optimise.update!(opt, ps, gs)
end

@show W
@show maximum(abs, W .- W_truth)

Basically, anything between -8.1:8.1 works but using a coefficient outside of that range for x_train results in NaN. Switching to the ADAGrad optimizer solves the problem. I’m guessing it’s an overflow issue but I really don’t know enough to be able to say with any degree of certainty…

2 Likes