Gradient in simple regression not working

I tried to replicate a simple regression model from Flux’s documentation with a slightly larger dataset. But the gradient descent seems to always diverge. A manual implementation of the algorithm works well.

I would appreciate if anyone can point the problem.

Here is my code using Flux:

using Flux
using Flux.Tracker
using Flux.Tracker: update!
using RDatasets

trees = dataset("datasets", "trees");

X = Matrix(trees[!, [:Girth,:Height]])
y = trees[!, :Volume]
n = length(y)
nfeatures = size(X, 2)

W = rand(nfeatures)
b = rand()

predict(X, W, b) = X*W .+ b

function loss(X, y, W, b)
  ŷ = predict(X, W, b)
  sum((y .- ŷ).^2)

loss(X, y, W, b)
# ~ 27139.27213905282

W = param(W)
b = param(b)
gs = Tracker.gradient(() -> loss(X, y, W, b), params(W, b))
# Update the parameter and reset the gradient
update!(W, -0.0003gs[W])
update!(b, -0.0003gs[b])
loss(X, y, W, b)
# ~ 2.6542565871928763e8 (tracked)

Are you sure your stepsize is good and have the correct sign?

It has the correct sign, following the example in the docs. This is the size I use in a manual implementation of the model and it works. Nevertheless, I have checked an order of magnitude smaller step size and it still diverges.

The best way forward is probably to try and gradually make the flux implementation look more like the manual version, or vice versa, until you figure out what the difference is between the two. You’re already calculating gradients directly, so perhaps check that lines up with your other version?


Thanks for your reply.
I changed the loss function to the following:

function loss(X, y, W, b)
  ŷ = predict(X, W, b)
  sum((y .- ŷ).^2) / (2*length(y))

And the gradients are correct now. However, I am not sure why taking the average of sums of squares is important for the gradients to work.