[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays

giaf · July 4, 2020, 10:19am

Hi Elrod,

first of all, thanks for trying BLASFEO out, and for reporting the erratic behavior you observed.
Since I haven’t observed that before, I investigated the matter, and it turned out it was due to denormals being present in the memory around the matrices somehow present in your benchmark routines.
In case of the matrix size not multiple of the SIMD width, the kernel would load and perform FMA on the full SIMD width, and disregard the extra element when storing back to the result matrix.
However, it turned out that in case of denormals in the background memory, this would trigger the super-slow computation on denormals.
Now I fixed the issue, and the code in the current BLASFEO master looks like this (for matrix sized 2:24) on my machine (Intel Core i7 4810MQ)

Then, thanks for the interesting introduction to your work!
About the packing of A, I agree that in the implementation of the NN variant of dgemm it is not necessary as long as data fits in L1 or L2 cache, but you start already seeing some improvement for L3.
But it gets more beneficial when you implement other dgemm variants, especially the ones with A transposed. How do you handle such cases in your framework?
And how is such framework handling linear algebra routines with less regular access pattern, such as when one matrix is triangular, or factorizations?

In my work on BLASFEO, I also saw that packing A and/or B is much more beneficial on less powerful architectures than Intel Haswell/Skylake.
There lower cache associativity, smaller TLBs and simpler hardware prefetchers imply that packed matrix format such as the panel-major can give sizeable better performance also for small matrices.

About the egg-and-chicken issue with AVX512, I think there is more that this.
It is also due to Intel’s extreme market segmentation strategy: as an example, contemporary Celeron and Pentium have even AVX disabled, and Atom do not physically have it.
AVX512 is just considered an additional tier on top, with additional market segmentation between 1 and 2 512-bit FMA units.
By the way, you seem to have a nice “desktop” development machine

Topic		Replies	Views
We can write an optimized BLAS library in pure Julia (please skip OP and jump to post 4) Numerics	17	13560	October 30, 2019
Julia matrix-multiplication performance Performance linearalgebra	20	8660	October 30, 2022
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	2906	July 16, 2018
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3616	February 11, 2020
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36470	June 19, 2020

[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays

Related topics