I am just starting with Flux.jl, and well, I tried their Hello World example. However, by slightly changing the input parameters used to generate the training data, the code completely fails and produces NaNs. Here is the complete script:

```
using Flux
W_truth = [1 2 3.1 4 5;
5 4 300 2.9 1]
b_truth = [-100.78; -2.3]
ground_truth(x) = W_truth*x .+ b_truth
x_train = [((rand(5).*5) .+ 5) for _ in 1:10_000]
y_train = [ ground_truth(x) + 0.2 .* randn(2) for x in x_train ]
model(x) = W*x .+ b
W = rand(2, 5)
b = rand(2)
function loss(x, y)
pred = model(x)
# sum(sqrt.((y .- pred).^2))
sum(((y .- pred).^2))
end
opt = Descent(0.01)
train_data = zip(x_train, y_train)
ps = Flux.params(W, b)
for (x,y) in train_data
gs = Flux.gradient(ps) do
loss(x,y)
end
Flux.Optimise.update!(opt, ps, gs)
end
println(ps[1] - W_truth)
println(ps[2] - b_truth)
nothing
```

```
[NaN NaN NaN NaN NaN; NaN NaN NaN NaN NaN]
[NaN, NaN]
```

I don’t know exactly what the problem is but it has to do with the `x_train`

values getting too big (in absolute value) and then utilizing the classic gradient descent algorithm. Someone else who knows more will surely be able to explain what’s happening but when I keep the values that are generated in `x_train`

small, I can use gradient descent. When they are larger, I have to switch to another optimizer (e.g., `ADAGrad`

). Below is the code copied/pasted from the Flux docs and then I’ve added some notes around the `x_train`

definition.

```
using Flux
W_truth = [1 2 3 4 5;
5 4 3 2 1]
b_truth = [-1.0; -2.0]
ground_truth(x) = W_truth*x .+ b_truth
x_train = [ 5 .* rand(5) for _ in 1:10_000 ]
# x_train = [ 8.25 .* rand(5) for _ in 1:10_000 ] results in NaN
# x_train = [ 8 .* rand(5) for _ in 1:10_000 ] this works
y_train = [ ground_truth(x) + 0.2 .* randn(2) for x in x_train ]
model(x) = W*x .+ b
W = rand(2, 5)
b = rand(2)
function loss(x, y)
ŷ = model(x)
sum((y .- ŷ).^2)
end
opt = Descent(0.01)
train_data = zip(x_train, y_train)
ps = params(W, b)
for (x,y) in train_data
gs = gradient(ps) do
loss(x,y)
end
Flux.Optimise.update!(opt, ps, gs)
end
@show W
@show maximum(abs, W .- W_truth)
```

Basically, anything between `-8.1:8.1`

works but using a coefficient outside of that range for `x_train`

results in `NaN`

. Switching to the `ADAGrad`

optimizer solves the problem. I’m guessing it’s an overflow issue but I really don’t know enough to be able to say with any degree of certainty…

2 Likes

Indeed, I rewrote the optimizer manually to

```
W_step = 0.001*normalize(gs[W])*min(1000,norm(gs[W]))
b_step = 0.001*normalize(gs[b])*min(1000,norm(gs[b]))
##
W .-= W_step
b .-= b_step
```

And it works wonderfully now.

1 Like