Getting NaNs in the hello world example of Flux

I am just starting with Flux.jl, and well, I tried their Hello World example. However, by slightly changing the input parameters used to generate the training data, the code completely fails and produces NaNs. Here is the complete script:

using Flux

W_truth = [1 2 3.1 4 5;
            5 4 300 2.9 1]
b_truth = [-100.78; -2.3]
ground_truth(x) = W_truth*x .+ b_truth

x_train = [((rand(5).*5) .+ 5) for _ in 1:10_000]
y_train = [ ground_truth(x) + 0.2 .* randn(2) for x in x_train ]

model(x) = W*x .+ b

W = rand(2, 5)
b = rand(2)

function loss(x, y)
  pred = model(x)
  # sum(sqrt.((y .- pred).^2))
  sum(((y .- pred).^2))
end

opt = Descent(0.01)

train_data = zip(x_train, y_train)
ps = Flux.params(W, b)

for (x,y) in train_data
  gs = Flux.gradient(ps) do
    loss(x,y)
  end
  Flux.Optimise.update!(opt, ps, gs)
end

println(ps[1] - W_truth)
println(ps[2] - b_truth)
nothing
[NaN NaN NaN NaN NaN; NaN NaN NaN NaN NaN]
[NaN, NaN]

I don’t know exactly what the problem is but it has to do with the x_train values getting too big (in absolute value) and then utilizing the classic gradient descent algorithm. Someone else who knows more will surely be able to explain what’s happening but when I keep the values that are generated in x_train small, I can use gradient descent. When they are larger, I have to switch to another optimizer (e.g., ADAGrad). Below is the code copied/pasted from the Flux docs and then I’ve added some notes around the x_train definition.

using Flux

W_truth = [1 2 3 4 5;
            5 4 3 2 1]
b_truth = [-1.0; -2.0]
ground_truth(x) = W_truth*x .+ b_truth

x_train = [ 5 .* rand(5) for _ in 1:10_000 ]
# x_train = [ 8.25 .* rand(5) for _ in 1:10_000 ]  results in NaN
# x_train = [ 8 .* rand(5) for _ in 1:10_000 ]  this works
y_train = [ ground_truth(x) + 0.2 .* randn(2) for x in x_train ]

model(x) = W*x .+ b
W = rand(2, 5)
b = rand(2)

function loss(x, y)
  ŷ = model(x)
  sum((y .- ŷ).^2)
end
opt = Descent(0.01)
train_data = zip(x_train, y_train)
ps = params(W, b)

for (x,y) in train_data
  gs = gradient(ps) do
    loss(x,y)
  end
  Flux.Optimise.update!(opt, ps, gs)
end

@show W
@show maximum(abs, W .- W_truth)

Basically, anything between -8.1:8.1 works but using a coefficient outside of that range for x_train results in NaN. Switching to the ADAGrad optimizer solves the problem. I’m guessing it’s an overflow issue but I really don’t know enough to be able to say with any degree of certainty…

2 Likes

Indeed, I rewrote the optimizer manually to

        W_step = 0.001*normalize(gs[W])*min(1000,norm(gs[W]))
        b_step = 0.001*normalize(gs[b])*min(1000,norm(gs[b]))
        ##
        W .-= W_step
        b .-= b_step

And it works wonderfully now.

1 Like