Why is the loss function increasing when fitting a line?

Hello everyone,

I’m following the “Fitting a Line” from the Flux.jl guide. It works as expected with the given data, as well as when modifying the predicted function (e.g. actual(x) = 3x - 1).

I was trying to change the length of the training dataset (from 6 to 11 for the *_train variables, and from 5 to 10 for the *_test variables). However, when doing so, the loss function (MSE) is increasing instead of decreasing.

Can you please help me to understand what I’m doing wrong? Here is the script I’m using:

using Flux
using Flux: train!
using Statistics

actual(x) = 4x + 2

x_train, x_test = hcat(0:10...), hcat(11:20...)
y_train, y_test = actual.(x_train), actual.(x_test)

loss(model, x, y) = mean(abs2.(model(x) .- y))
predict = Dense(1 => 1)
data = [(x_train, y_train)]
opt = Descent()

# Checking the value of the loss function
predict.weight, predict.bias
## (Float32[-0.6373584;;], Float32[0.0])
loss(predict, x_train, y_train)
## 849.4254f0

# Training the neuron
train!(loss, predict, data, opt)
# Checking the value of the loss function
predict.weight, predict.bias
## (Float32[33.82415;;], Float32[5.0373588])
loss(predict, x_train, y_train)
## 32046.893f0

# Another round?
train!(loss, predict, data, opt)
predict.weight, predict.bias
## (Float32[-177.98224;;], Float32[-25.394264])
loss(predict, x_train, y_train)
## 1.2097169f6

# One last time
train!(loss, predict, data, opt)
predict.weight, predict.bias
## (Float32[1123.2878;;], Float32[162.06683])
loss(predict, x_train, y_train)
## 4.5665416f7

It seems to work with opt = Descent(0.01) instead of 0.1. Probably that means it’s overshooting somehow: it finds the downhill direction in parameter space, but takes such a big step that it ends up on the other side of the valley… and so far up that the gradient is steeper there, resulting in an even bigger step next time.

1 Like

As already mentioned, the step size seems to be too large.

Mathematically, a gradient descent method is only guaranteed to converge, if you do e.g. an Armijo line search. In learning methods, that might be too expensive, so the default is a constant stepwise.

If you have optimisation in mind as “walking in the mountains” (though that would just be 2 parameters) you are going downwards from somewhere with every gradient direction, but if you take a too large step (as a giant) you walk up the hill on the other side of the valley already.

1 Like