LoopVec, Tullio losing to Matrix multiplication

mikmoore · June 18, 2024, 2:04pm

Matrix multiplication is probably the most optimized piece of code that exists within BLAS, which is one of the most optimized libraries that exists. Matrix multiplication has been optimized continuously over decades, and sometimes goes so far as to feature hand-written assembly and architecture-specific implementations to maximize performance.

(I will note that there are multiple BLAS libraries in existence, but all of the common ones are “very good.” Intel’s MKL is sometimes regarded as the most performant, at least on x86 processors. But for this post I will ignore the fact that there are several competing libraries and speak as if they are one.)

This is to say that if anything could decisively beat BLAS at its own game, it would not be long after BLAS devs caught wind that they would match its performance. A common exception to this is small matrices (<10x10, for example), where specialized implementations can often win (see StaticArrays.jl).

LoopVectorization and Tullio fill a different niche. While BLAS addresses a relatively small number of ubiquitous operations, these packages are designed to generate performant kernels for other functions that aren’t necessarily on that short list. They usually generate “best practice” code (or something close to it), – EDIT: Thanks to the poster below! I mistakenly thought that LoopVectorization used blocking. It does not, so it actually leaves a bit on the table when that is profitable. Tullio alleges to do some basic cache-agnostic tiling, so it will do better but still fall a little short – but don’t get the extra profiling, analysis, and individual attention that goes into the few operations performed by BLAS. They are quite good, especially for machine-generated code, but don’t claim to be 100% optimal.

In summary: I would consider it more of an issue if LV/Tullio did beat BLAS. For matrix multiplication, I would use LinearAlgebra.mul! for maximum performance. Use LV/Tullio for other operations that aren’t covered by BLAS. In the special case of small matrices of fixed size (e.g, 3D modeling and simulation) use StaticArrays.

If you plan to multithread at a higher level in the code, then (as suggested above) you should definitely use BLAS.set_num_threads(1) for the same reasons you don’t plan to use @tturbo.

Topic		Replies	Views
Tullio seems two times slower than basic LoopVectorization Performance question , tullio , loopvectorization	3	1055	April 10, 2022
Matrix Multiplication of a Large Number of Small Matrices Performance linearalgebra , tullio , column-major	12	1292	February 8, 2023
Outperformed by Matlab Performance matlab , multithreading , linearalgebra , tullio , loopvectorization	54	4178	November 23, 2021
Unexpected Tullio behavior Performance macros , tullio	6	579	October 31, 2020
I just decided to migrate from Python+Fortran to Julia as Julia was faster in my test Community fortran , performance , python , tullio , loopvectorization	37	7099	June 25, 2021

LoopVec, Tullio losing to Matrix multiplication

Related topics