I was trying to move from a Python ML/DL stack to Julia (so from something like sklearn+pytorch/tensorflow to MLJ+Flux).
So I decided to rewrite the Kaggle courses in Julia, switching from the Python libraries to the Julia ones too. In particular, I was implementing the notebook regarding underfitting and overfitting, using also early stopping (you can find it opening the relative exercise here: Learn Intro to Deep Learning Tutorials | Kaggle).
I reached the second neural network, the first “deep” neural network without early stopping and I stumbled upon a huge difference in predictive performance: in fact, in just 50 epochs tensorflow is able to achieve a loss of 0.1992 and start from something like 0.29, while the same architecture, with the same loss and the same optimizer in flux starts from something like 90.000 and reaches a loss ranging from 3.43 to 0.736 (depending on rngs, I suppose).
Which could be the problem? I followed the same steps as much as possible and followed the models in the Flux’s model zoo for the specific implementations