Hi,
I continue my Julia exploration and I am very impressed by the performance and the expressiveness of the language. The generated binary code looks very good and the SIMD instructions are efficiently used.

Does anyone tried to write a Julia native implementation of dense matrix-matrix multiplication with usual blocking techniques, statically sized blocks for simd and multi-core //ism ?

Why do you want a native Julia implementation if an efficient Fortran library is already built-in in the language (given correctly configured build, of course)? AFAIK, BLAS gains most of its speed not because itâ€™s written in a fast language, but due to a number of very low-level, very specific optimizations.

Optimizing matmul is a lot of work. Even forgetting about parallelization, you have to divide things into correctly sized blocks (which are even non-square) and then have a huge library of CPU-specific optimized building blocks at the lowest level, which are not easily auto-generated.

Mostly for educational purpose. As an emblematic computationally bound problem, the implementation of gemm is suitable to introduce arithmetic intensity, blocking, simd, data alignment and nested parallelismâ€¦
When one wants to learn about how to write efficient code in Julia, it can be interesting to see how far you can go. For example, I wonder what is achievable through the StaticArray package as a tool to control blocking.
MKL or Cuda libraries are the one to be used for large enough problems but it is interesting to compare your own implementation with rock solid references.

Surprisingly enough, for small problems and for certain non square matrix sizes, reference libraries can also be inefficient and surpassed by native implementations.

AFAIK thereâ€™s also some interest in having pure Julia linear algebra routines to enable support of â€śexoticâ€ť number types (e.g. BigFloats), so it might not be completely academic.

However, for many such types (especially BigFloats), the performance is limited by the speed of the scalar operations. So all the tricks for optimizing matmuls are mostly irrelevant.

Float16 are also very interesting. I guess that blocking is still interesting for this type although SIMD instructions does not (yet) exist for all architectures.

Thank you for the link !
Wow, I took a quick look at the Yingbo Ma implementation and it is indeed rather advanced. I wonder if it can be considered as native Julia (I did not knew about SIMD.jl and ccall to malliocâ€¦).
My first idea was to play with StaticArrays.jl for blocking and let LLVM vectorize the result. Also start with square matrices.

I guess that JuliaBlas.jl maybe considered as a strong performance reference. I will benchmark it !
Thank you again.

I thought GenericSVD.jl only implemented the SVD? I donâ€™t see any eigenvalue routines in that module.

(You can easily get the eigenvalues/vectors from the SVD for normal matrices, but I donâ€™t know of an easy way for non-normal matrices, i.e. non-orthogonal eigenvectors.)

Not just big floats. I was thinking complex numbers in 16 bits total. By saving memory, I think matrix multiply should be that much faster 128/16= or 8 times faster. If gemm would support (I believe it only supports single or double).

I think there are many good reasons to want a BLAS replacement due to the limitations it imposes, even for the â€śstandardâ€ť numeric types, e.g.:

hermitian conjugation or plain transpose for complex matrices but no plain complex conjugation in gemm

restriction to stride 1 along first dimension for matrices in gemm

only axpy and no axpby (scalar a times vector x + scalar b times vector y) (though this was changed recently I believe?)

axp(b)y for matrices ?

â€¦

In a legacy language like Fortran, where every method had to be independently be implemented, one probably restricted to the most common cases. But in a powerful language like Julia, you could hopefully have much more generic functionality, with still about the same performance (and itâ€™s always easy to add further specializations for specific cases).

Matrix multiplication seems hard though, but it does not seem impossible to beat these tuned assembly kernels with higher level languages, as this example shows

Mir is an LLVM-Accelerated Generic Numerical Library for Science and Machine Learning. It requires LDC (LLVM D Compiler) for compilation. Mir GLAS (Generic Linear Algebra Subprograms) has a single generic kernel for all CPU targets, all floating point types, and all complex types. It is written completely in D, without any assembler blocks. In addition, Mir GLAS Level 3 kernels are not unrolled and produce tiny binary code, so they put less pressure on the instruction cache in large applications.

I am far from an expert in these matters, so take my opinions with a grain of salt, but I think that BLAS is now an outdated paradigm:

Some of the simpler BLAS operations can be handled fairly effectively by compilers: e.g. Level 1 BLAS operations are mostly covered by Juliaâ€™s broadcasting machinery (but with much more generality).

By bundling multiple bits of functionality into 1 function (e.g. xGEMM handles various adjoint and transpose multiplications), the code has to spend a fair bit of time checking options and whatnot. As a result, BLAS is woefully inefficient for small matrices.

There are some common operations which arenâ€™t easy to express efficiently with BLAS, e.g.

vec^T * mat * vec â†’ scalar,

mat^T * diag mat * mat â†’ mat.

Support for higher-order tensors

Of course, there are other attempts at replacing BLAS functionality, e.g. BLIS and Taco,

Mir looks very interesting while the source code is not trivial to read: I am not fluent in D and I got a shiver each time i read about prefetching : way beyond my skills ! The Eigen bashing on mir github page is a bit annoying because, to me, Eigen traced the route to follow: a native implementation of linear algebra freed from the low level interface of BLAS. I did not found any documentation about the Mir architecture and design principles.

Anyway, I think that it might be a place for a trade-off between generality and expressiveness, the complexity of the code (and maintainability) and the performance. You see often GEMM experts (as Mir author) arguing about a few percents in performance difference. A small, not too complex, but general and elegant library written in native Julia, achieving 50% of MKL for BLAS3 ops would already be highly desirable and convincing. According to my preliminary experiments BLAS2 and BLAS1 ops written in standard Julia are already very good (apart from FP16).

Totally agree, my comment was specifically about re-implementing existing gemm which, as you mentioned, is hard to beat and doesnâ€™t provide much advantage from a practical point of view. Educational argument makes perfect sense to me, as well as optimizing BLAS operations for specific data types / sizes / architectures.

Also thanks for pointing to other numeric libraries, thereâ€™s a few I wasnâ€™t aware of!