# Getting NaNs in the hello world example of Flux

I am just starting with Flux.jl, and well, I tried their Hello World example. However, by slightly changing the input parameters used to generate the training data, the code completely fails and produces NaNs. Here is the complete script:

``````using Flux

W_truth = [1 2 3.1 4 5;
5 4 300 2.9 1]
b_truth = [-100.78; -2.3]
ground_truth(x) = W_truth*x .+ b_truth

x_train = [((rand(5).*5) .+ 5) for _ in 1:10_000]
y_train = [ ground_truth(x) + 0.2 .* randn(2) for x in x_train ]

model(x) = W*x .+ b

W = rand(2, 5)
b = rand(2)

function loss(x, y)
pred = model(x)
# sum(sqrt.((y .- pred).^2))
sum(((y .- pred).^2))
end

opt = Descent(0.01)

train_data = zip(x_train, y_train)
ps = Flux.params(W, b)

for (x,y) in train_data
loss(x,y)
end
Flux.Optimise.update!(opt, ps, gs)
end

println(ps - W_truth)
println(ps - b_truth)
nothing
``````
``````[NaN NaN NaN NaN NaN; NaN NaN NaN NaN NaN]
[NaN, NaN]
``````

I don’t know exactly what the problem is but it has to do with the `x_train` values getting too big (in absolute value) and then utilizing the classic gradient descent algorithm. Someone else who knows more will surely be able to explain what’s happening but when I keep the values that are generated in `x_train` small, I can use gradient descent. When they are larger, I have to switch to another optimizer (e.g., `ADAGrad`). Below is the code copied/pasted from the Flux docs and then I’ve added some notes around the `x_train` definition.

``````using Flux

W_truth = [1 2 3 4 5;
5 4 3 2 1]
b_truth = [-1.0; -2.0]
ground_truth(x) = W_truth*x .+ b_truth

x_train = [ 5 .* rand(5) for _ in 1:10_000 ]
# x_train = [ 8.25 .* rand(5) for _ in 1:10_000 ]  results in NaN
# x_train = [ 8 .* rand(5) for _ in 1:10_000 ]  this works
y_train = [ ground_truth(x) + 0.2 .* randn(2) for x in x_train ]

model(x) = W*x .+ b
W = rand(2, 5)
b = rand(2)

function loss(x, y)
ŷ = model(x)
sum((y .- ŷ).^2)
end
opt = Descent(0.01)
train_data = zip(x_train, y_train)
ps = params(W, b)

for (x,y) in train_data
loss(x,y)
end
Flux.Optimise.update!(opt, ps, gs)
end

@show W
@show maximum(abs, W .- W_truth)
``````

Basically, anything between `-8.1:8.1` works but using a coefficient outside of that range for `x_train` results in `NaN`. Switching to the `ADAGrad` optimizer solves the problem. I’m guessing it’s an overflow issue but I really don’t know enough to be able to say with any degree of certainty…

2 Likes

Indeed, I rewrote the optimizer manually to

``````        W_step = 0.001*normalize(gs[W])*min(1000,norm(gs[W]))
b_step = 0.001*normalize(gs[b])*min(1000,norm(gs[b]))
##
W .-= W_step
b .-= b_step
``````

And it works wonderfully now.

1 Like