Speed Comparison Python v Julia for custom layers

I think it would be fairer to measure without the random number generation, although it is probably negligible.

In this use case I think you’re okay with just the forward pass, at least if you work with a fully trained network from the start.

If you want to get peak performance from your Julia code, you may want to think about allocations. Maybe there is a way to reuse some storage in this computation?
Of course this will also mean that the backward pass with Zygote will fail due to mutation, but if you don’t need it who cares