The problem with Flux.jl was that it was too simple. “Oh look how simple the code is” is cute for parts, but libraries are for the hard part. Library code should be able to be easily understood, but how the internals look is not a selling point to the library itself. I think this is what had previously done incorrectly: it was too much about ascetics and not enough about actually solving the problem.
It’s moving in the right direction now, thanks in large part to @dhairyagandhi96 and @Elrod, making the standard tools of Flux be as optimal as possible. The new versions of the layers are adding @avx and all sorts things to ensure that it all does SIMD. It’s now beating FastDense, which was the DiffEqFlux version that used to be 20x faster than using Dense in a Neural ODE, which was a bit of a hack to get around the fact that we could add the performance enhancements to the Flux library. It would be nice to have SimpleDense just for show, but Dense should be as not simple as possible and even call inline assembly if that’s what it takes to have the fastest kernel. Flux now is moving in this direction, and I think actual users will enjoy the extra performance.
There are some things with this that can be optimized even more than PyTorch due to generation of fused kernels (though not necessarily TensorFlow because it fuses kernels), which is what the Torch.jl library isn’t able to capture.