LoopVec, Tullio losing to Matrix multiplication

stevengj · June 18, 2024, 2:54pm

I would also add that optimizing matrix multiplication involves “nonlocal” changes to the code — it’s not simply a matter of taking the naive 3-nested-loop algorithm and vectorizing/multithreading/fine-tuning the loops. The whole structure of the code is changed, e.g. to “block” the algorithm to improve cache performance (with multiple nested levels of “blocking” for multiple levels of the cache, even at the lowest level treating the registers as a kind of ideal cache). This is not the sort of transformation that something like LoopVectorization.jl does.

See also this thread: Julia matrix-multiplication performance

Topic		Replies	Views
Tullio seems two times slower than basic LoopVectorization Performance question , tullio , loopvectorization	3	1060	April 10, 2022
Julia matrix-multiplication performance Performance linearalgebra	20	8667	October 30, 2022
Simple Mat-Vec multiply (understanding performance, without the bugs) Performance tullio	16	3296	August 12, 2020
Realistically, how close is Gaius.jl to becoming a full replacement for BLAS in Julia? Internals & Design tullio , loopvectorization , openblas	13	4715	August 16, 2020
Speed comparison matrix multiplication in Julia Performance question , linearalgebra , optimization , tullio	45	3267	August 19, 2021

LoopVec, Tullio losing to Matrix multiplication

Related topics