I think the main thing you’re missing is a non-linear activation function. You can’t replicate the square of a number by simply taking a linear combination of itself.
By using relu, I got decent convergence after 100 epochs with your example.
I also tried using batch gradients, but convergence was very slow… about 10,000 epochs. There’s probably some tuning required to make it faster.
EDIT: Yep, ADAM(0.1) with batch gradients gets very good convergence by 1000 epochs and runs far faster than iterating through each data point.
Like already stated you’ll need a non linear activation like a relu. Additionally I expect that the range of your input data is too large [-10,10] for training and too correlated between successive samples. I would suggest something along the lines of: