Getting NaNs in the hello world example of Flux

NightMachinary · October 28, 2021, 5:42pm

I am just starting with Flux.jl, and well, I tried their Hello World example. However, by slightly changing the input parameters used to generate the training data, the code completely fails and produces NaNs. Here is the complete script:

using Flux

W_truth = [1 2 3.1 4 5;
            5 4 300 2.9 1]
b_truth = [-100.78; -2.3]
ground_truth(x) = W_truth*x .+ b_truth

x_train = [((rand(5).*5) .+ 5) for _ in 1:10_000]
y_train = [ ground_truth(x) + 0.2 .* randn(2) for x in x_train ]

model(x) = W*x .+ b

W = rand(2, 5)
b = rand(2)

function loss(x, y)
  pred = model(x)
  # sum(sqrt.((y .- pred).^2))
  sum(((y .- pred).^2))
end

opt = Descent(0.01)

train_data = zip(x_train, y_train)
ps = Flux.params(W, b)

for (x,y) in train_data
  gs = Flux.gradient(ps) do
    loss(x,y)
  end
  Flux.Optimise.update!(opt, ps, gs)
end

println(ps[1] - W_truth)
println(ps[2] - b_truth)
nothing

[NaN NaN NaN NaN NaN; NaN NaN NaN NaN NaN]
[NaN, NaN]

mthelm85 · October 28, 2021, 7:01pm

I don’t know exactly what the problem is but it has to do with the x_train values getting too big (in absolute value) and then utilizing the classic gradient descent algorithm. Someone else who knows more will surely be able to explain what’s happening but when I keep the values that are generated in x_train small, I can use gradient descent. When they are larger, I have to switch to another optimizer (e.g., ADAGrad). Below is the code copied/pasted from the Flux docs and then I’ve added some notes around the x_train definition.

using Flux

W_truth = [1 2 3 4 5;
            5 4 3 2 1]
b_truth = [-1.0; -2.0]
ground_truth(x) = W_truth*x .+ b_truth

x_train = [ 5 .* rand(5) for _ in 1:10_000 ]
# x_train = [ 8.25 .* rand(5) for _ in 1:10_000 ]  results in NaN
# x_train = [ 8 .* rand(5) for _ in 1:10_000 ]  this works
y_train = [ ground_truth(x) + 0.2 .* randn(2) for x in x_train ]

model(x) = W*x .+ b
W = rand(2, 5)
b = rand(2)

function loss(x, y)
  ŷ = model(x)
  sum((y .- ŷ).^2)
end
opt = Descent(0.01)
train_data = zip(x_train, y_train)
ps = params(W, b)

for (x,y) in train_data
  gs = gradient(ps) do
    loss(x,y)
  end
  Flux.Optimise.update!(opt, ps, gs)
end

@show W
@show maximum(abs, W .- W_truth)

Basically, anything between -8.1:8.1 works but using a coefficient outside of that range for x_train results in NaN. Switching to the ADAGrad optimizer solves the problem. I’m guessing it’s an overflow issue but I really don’t know enough to be able to say with any degree of certainty…

NightMachinary · October 28, 2021, 9:07pm

Indeed, I rewrote the optimizer manually to

        W_step = 0.001*normalize(gs[W])*min(1000,norm(gs[W]))
        b_step = 0.001*normalize(gs[b])*min(1000,norm(gs[b]))
        ##
        W .-= W_step
        b .-= b_step

And it works wonderfully now.

Topic		Replies	Views
How come Flux.jl's network parameters go to NaN? Machine Learning first-steps , flux	10	4048	June 9, 2021
Why does my Flux model return in all NaN? Machine Learning question , flux	2	720	October 9, 2023
Flux training gives NaNs Machine Learning	3	1119	July 3, 2023
Problem with first example with Flux Machine Learning flux	1	302	September 28, 2022
Flux error: Loss is NaN New to Julia flux	5	960	August 4, 2019

Getting NaNs in the hello world example of Flux

Related topics