Yes, use Flux.gradient
or pullback
instead of train!
so that you can analyze the gradients and model parameters before each update.
Another point to consider is that Flux and PyTorch initialize Dense layers differently by default. See Initializing Flux weights the same as PyTorch? - #4 by DevJac. If you can verify that a) the initializations are similar, b) the outputs from each intermediate step of the forward pass are similar, and c) the gradients are similar, then I think the behaviour shouldn’t be much different between PyTorch and Flux. If there is a bug (which seems somewhat unlikely since you’re using a plain MLP on CPU), I would imagine it’s somewhere in the backwards pass (and thus will show up in the gradients).