That’s to be expected since Flux is still doing the work of applying a null bias and no-op activation function.
The faster Julia implementation is the one without the dense()
layers and instead using the abstract matrices and just using flux to initialise the parameters. This means that it’s not doing a null bias operation and the only activation function applied for python and Julia is the softmax.
To your original question, it may be worth also comparing backward pass timings unless you’re planning on loading pre-trained weights from somewhere else
I can compare backwards as well because I am interested in both forward as well as forward + backwards. Any optimisation suggestions would have to account for the backwards pass but I am interested in forward time execution as well.
the full networks you’re trying to run
I’m not going to share the full networks I want to run. I want to know how one would speed this particular layer. I understand there are different ways to speed up the overall network by taking advantage of parallelism’s based on overall structure. But, in this question, I am only interested in how one would speed up this layer (if you can think of a very similar layer that does naturally have an easy way to speed up then I would be interested in this as well).
whether you need GPU support
I want to know CPU speeds. I’m not familiar with the differences between parallelism on GPU vs CPU but my interest is whether what benchmark performance on a CPU I will get.
how important training speed is if at all
Training speed is variably important. It depends on the ratio of relative inference speed up. I could say inference speed is 10x more important. Having said that, I would prefer to see all solutions to inference speed up as long as the backwards pass still works.