I am just starting with Flux.jl, and well, I tried their Hello World example. However, by slightly changing the input parameters used to generate the training data, the code completely fails and produces NaNs. Here is the complete script:
using Flux
W_truth = [1 2 3.1 4 5;
5 4 300 2.9 1]
b_truth = [-100.78; -2.3]
ground_truth(x) = W_truth*x .+ b_truth
x_train = [((rand(5).*5) .+ 5) for _ in 1:10_000]
y_train = [ ground_truth(x) + 0.2 .* randn(2) for x in x_train ]
model(x) = W*x .+ b
W = rand(2, 5)
b = rand(2)
function loss(x, y)
pred = model(x)
# sum(sqrt.((y .- pred).^2))
sum(((y .- pred).^2))
end
opt = Descent(0.01)
train_data = zip(x_train, y_train)
ps = Flux.params(W, b)
for (x,y) in train_data
gs = Flux.gradient(ps) do
loss(x,y)
end
Flux.Optimise.update!(opt, ps, gs)
end
println(ps[1] - W_truth)
println(ps[2] - b_truth)
nothing
[NaN NaN NaN NaN NaN; NaN NaN NaN NaN NaN]
[NaN, NaN]
I don’t know exactly what the problem is but it has to do with the x_train
values getting too big (in absolute value) and then utilizing the classic gradient descent algorithm. Someone else who knows more will surely be able to explain what’s happening but when I keep the values that are generated in x_train
small, I can use gradient descent. When they are larger, I have to switch to another optimizer (e.g., ADAGrad
). Below is the code copied/pasted from the Flux docs and then I’ve added some notes around the x_train
definition.
using Flux
W_truth = [1 2 3 4 5;
5 4 3 2 1]
b_truth = [-1.0; -2.0]
ground_truth(x) = W_truth*x .+ b_truth
x_train = [ 5 .* rand(5) for _ in 1:10_000 ]
# x_train = [ 8.25 .* rand(5) for _ in 1:10_000 ] results in NaN
# x_train = [ 8 .* rand(5) for _ in 1:10_000 ] this works
y_train = [ ground_truth(x) + 0.2 .* randn(2) for x in x_train ]
model(x) = W*x .+ b
W = rand(2, 5)
b = rand(2)
function loss(x, y)
ŷ = model(x)
sum((y .- ŷ).^2)
end
opt = Descent(0.01)
train_data = zip(x_train, y_train)
ps = params(W, b)
for (x,y) in train_data
gs = gradient(ps) do
loss(x,y)
end
Flux.Optimise.update!(opt, ps, gs)
end
@show W
@show maximum(abs, W .- W_truth)
Basically, anything between -8.1:8.1
works but using a coefficient outside of that range for x_train
results in NaN
. Switching to the ADAGrad
optimizer solves the problem. I’m guessing it’s an overflow issue but I really don’t know enough to be able to say with any degree of certainty…
2 Likes
Indeed, I rewrote the optimizer manually to
W_step = 0.001*normalize(gs[W])*min(1000,norm(gs[W]))
b_step = 0.001*normalize(gs[b])*min(1000,norm(gs[b]))
##
W .-= W_step
b .-= b_step
And it works wonderfully now.
1 Like