That’s to be expected since Flux is still doing the work of applying a null bias and no-op activation function.
To your original question, it may be worth also comparing backward pass timings unless you’re planning on loading pre-trained weights from somewhere else. It would help if you could share a little more about your real use case (e.g. the full networks you’re trying to run, whether you need GPU support, how important training speed is if at all).