Why is the loss function increasing when fitting a line?

_jovian · July 10, 2025, 9:43am

Hello everyone,

I’m following the “Fitting a Line” from the Flux.jl guide. It works as expected with the given data, as well as when modifying the predicted function (e.g. actual(x) = 3x - 1).

I was trying to change the length of the training dataset (from 6 to 11 for the *_train variables, and from 5 to 10 for the *_test variables). However, when doing so, the loss function (MSE) is increasing instead of decreasing.

Can you please help me to understand what I’m doing wrong? Here is the script I’m using:

using Flux
using Flux: train!
using Statistics

actual(x) = 4x + 2

x_train, x_test = hcat(0:10...), hcat(11:20...)
y_train, y_test = actual.(x_train), actual.(x_test)

loss(model, x, y) = mean(abs2.(model(x) .- y))
predict = Dense(1 => 1)
data = [(x_train, y_train)]
opt = Descent()

# Checking the value of the loss function
predict.weight, predict.bias
## (Float32[-0.6373584;;], Float32[0.0])
loss(predict, x_train, y_train)
## 849.4254f0

# Training the neuron
train!(loss, predict, data, opt)
# Checking the value of the loss function
predict.weight, predict.bias
## (Float32[33.82415;;], Float32[5.0373588])
loss(predict, x_train, y_train)
## 32046.893f0

# Another round?
train!(loss, predict, data, opt)
predict.weight, predict.bias
## (Float32[-177.98224;;], Float32[-25.394264])
loss(predict, x_train, y_train)
## 1.2097169f6

# One last time
train!(loss, predict, data, opt)
predict.weight, predict.bias
## (Float32[1123.2878;;], Float32[162.06683])
loss(predict, x_train, y_train)
## 4.5665416f7

mcabbott · July 10, 2025, 1:28pm

It seems to work with opt = Descent(0.01) instead of 0.1. Probably that means it’s overshooting somehow: it finds the downhill direction in parameter space, but takes such a big step that it ends up on the other side of the valley… and so far up that the gradient is steeper there, resulting in an even bigger step next time.

kellertuer · July 10, 2025, 1:37pm

As already mentioned, the step size seems to be too large.

Mathematically, a gradient descent method is only guaranteed to converge, if you do e.g. an Armijo line search. In learning methods, that might be too expensive, so the default is a constant stepwise.

If you have optimisation in mind as “walking in the mountains” (though that would just be 2 parameters) you are going downwards from somewhere with every gradient direction, but if you take a too large step (as a giant) you walk up the hill on the other side of the valley already.

Topic		Replies	Views
Problem understanding Loss function behavior using Flux.jl Machine Learning flux	5	1332	August 7, 2020
Why the Loss function does not decrease significantly in Flux.jl Machine Learning	2	334	February 2, 2023
Why isn't Flux Descent learning? New to Julia	1	498	December 23, 2019
FluxML Basic Custom Layer with Custom Loss Function Machine Learning flux	3	1826	March 19, 2019
Specifying loss functions in Flux.jl Machine Learning question , package , flux	8	1922	August 8, 2020

Why is the loss function increasing when fitting a line?

Related topics