I think the main thing you’re missing is a non-linear activation function. You can’t replicate the square of a number by simply taking a linear combination of itself.

By using relu, I got decent convergence after 100 epochs with your example.

I also tried using batch gradients, but convergence was very slow… about 10,000 epochs. There’s probably some tuning required to make it faster.

EDIT: Yep, ADAM(0.1) with batch gradients gets very good convergence by 1000 epochs and runs far faster than iterating through each data point.

Like already stated you’ll need a non linear activation like a relu. Additionally I expect that the range of your input data is too large [-10,10] for training and too correlated between successive samples. I would suggest something along the lines of: