There was a similar question posted here a little while ago, and in that situation, it seemed to be the case that keras was using a batch size of 32 by default and Flux wasn’t, and that was where the difference in behavior was coming from. I wonder if you’re seeing the same thing.
Here’s a link to that thread- the post just below this one has a version of the original author’s code that matches the keras/tf behavior.
Hope that helps!