The biggest thing that jumps out to me is that the SGD loop is inside out . Currently you have
for batch in randsamples(data)
for i in 1...nepoch
sgd_step(model, batch, opt)
Whereas the correct ordering should be
for i in 1...nepoch
for batch in randsamples(data)
sgd_step(model, batch, opt)
Conceptually, you can think of the first approach as beating the model over the head with a single batch, then swapping to a completely different one and repeating. There may be some weight adaptation that is conducive to generalization, but most likely you’ll have overfit on the last minibatch because the model is being fed that batch repeatedly for 250 iterations/optimization steps right before evaluation. Unlike the semi-directed random walk on the loss landscape you would expect from SGD, this will look like a series of dramatic jerks without any real sense of direction.