Multithreaded Julia matrix multiply (specifically, Octavian.jl) “dominates” MKL <100x100, and hangs on in performance
Simply putting LoopVectorization.@tturbo on three for loops manages to do well until we run out of L2 cache. Results will differ by CPU of course. That particular CPU has much more L2 cache than most (18 cores, 1 MiB/core).
MKL starts to run away from the competition around 3000x3000 on that computer.
On AMD systems, the small size performance of Julia-matrix multiply is unmatched:
This was on a Ryzen 4900HS.
The DifferentialEquations ecosystem doesn’t really do any hardware-specific optimizations itself yet, but it (and ModelingToolkit in particular) is another great example of how code generation can be leveraged for better problem solving in Julia.
A long-term goal of mine is to work on an SLP vectorizer it can use, as well as a SPMD compiler like ISPC. DifferentialEquations would be the target for both of these, but they should be usable by interested Julia projects more generally.
But for now, I still have a lot of loop work ahead of me (in particular, modeling much more complicated loops so they can be optimized and still produce correct results).

