I was trying to move from a Python ML/DL stack to Julia (so from something like sklearn+pytorch/tensorflow to MLJ+Flux).
So I decided to rewrite the Kaggle courses in Julia, switching from the Python libraries to the Julia ones too. In particular, I was implementing the notebook regarding underfitting and overfitting, using also early stopping (you can find it opening the relative exercise here: Learn Intro to Deep Learning Tutorials).
I reached the second neural network, the first “deep” neural network without early stopping and I stumbled upon a huge difference in predictive performance: in fact, in just 50 epochs tensorflow is able to achieve a loss of 0.1992 and start from something like 0.29, while the same architecture, with the same loss and the same optimizer in flux starts from something like 90.000 and reaches a loss ranging from 3.43 to 0.736 (depending on rngs, I suppose).
Which could be the problem? I followed the same steps as much as possible and followed the models in the Flux’s model zoo for the specific implementations
I agree - if they fundamentally differ in such a way (even with trying to use the same methods) there’s likely some discrepancy left between your tensorflow code and your Flux code. I wouldn’t primarily place the blame on RNGs here - such a big difference for the “same architecture” shouldn’t happen just based on RNG alone.
Do you mind posting your two versions (tensorflow/Flux) so we can take a look?
I (partially) unveiled the mistery: using MLJ to preprocess the data. If I use a version of a MinMaxScaler implemented by me for MLJModels and OneHotEncoder from MLJ, I get the behavior described above.
Then I implemented both parts “by hands” (using onehotbatch from Flux and manually scaling the fatures for the MinMaxScaler) I achieved performances comparable to Tensorflow.
It is strange to me that MLJ cooperate that bad with Flux, so I’ll investigate further on why this actually happens.
Meanwhile, I would call the problem closed since it’s obvious that Flux is not to be blamed.
Rather than MLJ and Flux not cooperating (if they didn’t, https://github.com/FluxML/MLJFlux.jl wouldn’t exist!), it seems like your by hand implementation may be different than what MLJ is doing. Worth comparing both to verify that’s the case.
That’s why I said that it looks quite strange. Now I’m looking at the output of the preprocessing that I did with MLJ and the one done by hand in order to understand what’s different.