How to train Flux to learn a sequence conditional to some initial "seeds"?

It is strange that the train MSE rises for the first 7 epochs straight (which covers several hundred gradient steps). Does decreasing the learning rate help?

By the way you wrote that the validation error is always smaller than the training error, but according to your plot that is not the case.