LoopVec, Tullio losing to Matrix multiplication

I would also add that optimizing matrix multiplication involves “nonlocal” changes to the code — it’s not simply a matter of taking the naive 3-nested-loop algorithm and vectorizing/multithreading/fine-tuning the loops. The whole structure of the code is changed, e.g. to “block” the algorithm to improve cache performance (with multiple nested levels of “blocking” for multiple levels of the cache, even at the lowest level treating the registers as a kind of ideal cache). This is not the sort of transformation that something like LoopVectorization.jl does.

See also this thread: Julia matrix-multiplication performance

6 Likes