[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays

They have benchmarks:


They don’t have AVX512 but they do have AVX2.
You may see that I compared them to Multi Threaded MKL (MATLAB) and got factor of ~2 on 6 Cores machine for the case up to N = K = M = 400 . I think the results are impressive for a single thread case. Anyhow, they have BLAS API so you can try on yourself. I’d be interested to see a comparison of the Single Thread case made by you.

They also have an interesting packing of data for small matrices which could be useful. Give it a deeper look. It worth it.

Okay, yeah, may be worth taking a look at. Their dgemm_nt performance using the blasfeoapi is very good compared to their theoretical peak, and almost unbelievably constant as a function of matrix size.

Did they count packing time when measuring blasfeoapi?
I strongly suspect that they did not.

Unfortunately, I can’t really benchmark it on my computer, because of course it’ll look bad when running Haswell kernels. On my computer, theoretical peak with a Haswell kernel is 68.8 GFLOPS (4.3 GHz * 2 fma/cycle * 4 numbers / vector * 2 ops/vector/fma) while the theoretical peak is 131.2 GFLOPS (4.1 * 2 * 8 * 2).
I guess I could compare vs theoretical peak, instead of number of GFLOPS, but that also isn’t quite fair, because memory stalls aren’t as much of a problem with slower kernels.

I think that’s Eigen’s problem as well, even though they claim to support avx-512, but that they don’t have proper kernels anyway. OpenBLAS finally seems to have proper support with release 0.3.9 (which I’ve been using, and will ship with Julia 1.5).
One of the problems with these C and assembly libraries is that it makes adding support harder.

Looking briefly at their source, one of the things they do that OpenBLAS does not is adjust their packing strategy based on the input matrix sizes.
PaddedMatrices already does this, as does MKL, starting in 2017. But MKL obviously does it better than PaddedMatrices.

Currently, what it’s doing is:

if K * N  > n_c * k_c
   # Pack A and B into memory-access order
elseif M <= 72
  # do not pack
elseif M <= 2.5m_c && iszero(M % VECTOR_WIDTH)
  # do not pack
  # pack A into column-major order, aligning the columns


M, N = size(C)
K = size(B,1)

and m_c, k_c, and n_c are the tiling parameters (corresponding to the size of the packed parts of the arrays).

In the link I shared about performance you may see the performance of the BLAS flavor which means the API is classic BLAS and the performance takes into account the overhead of packing and unpacking. It is called blasapi while blasfeoapi means it is assumed the data is already in the special packing.

While Skylake is wider with more ports than Haswell, something tuned for Haswell should run very very well on Skylake. Moreover, For them Haswell is a code for AVX2 more than specifically architecture.

You have a system with AVX512 but most of users doesn’t. Moreover currently all modern compilers in their latest versions yield 256 Bit ops even for AVX512 (Utilizing what’s really important in AVX512, its flexibility and not only the doubling of the width).

So I still think you should compile and test its single thread performance against other implementations. In order to make it fair you can disable AVX512 code path in MKL.

1 Like

One additional OSS BLAS implementation that might have useful ideas (although it’s focused on mobile, it’s also pretty small and self-contained) is Google’s ruy.


Performance below a few hundred in each dimension should be better now.
Benchmarks vs SMatrix and MMatrix from StaticArrays (note that MMatrix switches to BLAS when MKN >= 14^3; a Haswell laptop where that BLAS is MKL (AVX2 system):

Skylake laptop where that BLAS is OpenBLAS (AVX2 system):

Skylake/Cascadelake-X where that BLAS is OpenBLAS (AVX512 system):

The blue line is PaddedMatrices.jmul! applied to base Array{Float64,2} arrays.

With AVX2, it was faster than SMatrix starting from 7x7 and beyond, while with AVX512 SMatrix had one last hurrah at 8x8. The statically sized arrays from PaddedMatrices are at least as fast in the sizes I tested, especially with AVX512.
Unfortunately, I don’t support an equivalent of SMatrix yet, meaning escaping FixedSizeArrays will cause heap-allocation.

Most compilers suck at generating AVX512 code, which is why it makes sense not to bother. LLVM is much worse at this than GCC or ICC. For example, dot product benchmarks:

LLVM is very slow at cleaning up non-vectorized remainders, therefore the length of these remainders is the primary driver of performance. If you pick a bunch of sizes at random, cutting the width of vectors in half will also cut the average remainder length in half, probably boosting LLVM’s average performance.
A far better solution to that problem would be not being so slow at handling non-vectorized remainders in the first place.
GCC fails above because it doesn’t break up the dependency chain, even with -funroll-loop. Only the Intel compilers and LoopVectorization actually generate good code for something as simple as a sum or dot product.

A justification for this may be that lots of applications mostly aren’t AVX – lots of logic is hard to vectorize/very branchy – and if you have a few random short loops sprinkled in, you’d be better off not encouraging transitions. This is just speculation on my behalf. I don’t know the reasoning.

LLVM does a bad job of taking advantage of AVX512’s flexibility, regardless of vector width.
Note that on the AVX512 machine that SMatrix{4,4} multiplication was actually slower than on the other two, while FixedSizeMatrix{4,4} multiplication was much faster.

Don’t mistake the compiler’s limitations for hardware limitations.
While Itanium turned out to be the Itanic because of how hard it was to write good compilers for it, I don’t really think that’s the case here.

ruy being optimized for ARM would also make it interesting to look at eventually, given that ARM optimization will become important in Julia once Apple starts shipping ARM MacBooks.


Since there were quite a few posts about BLASFEO (and I’m its main developer), I would like to clarify a little what its intended contribution is.

As the name states, the intended application domain is embedded optimization, hence the focus on performance for small matrices and single threaded applications.
In designing BLASFEO, I took the freedom to depart from the standard BLAS API conventions, and explored how alternative APIs could help in improving performance.
In embedded optimization applications, parallelization is generally better handled at the algorithmic level (instead of the linear algebra level), and it is generally possible to pack all matrices into any arbitrary format before hand, outside the critical loop.
In that sense, BLASFEO defines a matrix format (named panel-major) which is used for all input and output arguments to linear algebra routines. Therefore, all packing of data is performed offline, and there is no need to perform any packing/unpacking between different linear algebra routines calls.
Hence, in such framework it makes no sense to consider the packing time in computing the linear algebra performance, as this is not their practical usage anyway.
If you like, you can find more details in this paper https://arxiv.org/abs/1704.02457

By leveraging my experience in implementing the BLASFEO’s own API, I tried to apply it to the implementation of a standard BLAS API with column-major matrix format for the routines arguments, still focusing on performance for small matrices and single-threaded routines. The main ideas and some results are on another paper https://arxiv.org/abs/1902.08115

The approach used in the BLAS API of BLASFEO (as well as the BLASFEO API itself) turned out to give very good performance on a wide range of computer architectures and ISAs.
The code is in no way optimized to a specific machine (i.e. my computer), and (by focusing on small matrix performance) is also irrespective of e.g. cache size.
The currently “fastest” target is named Haswell, for the simple reason that this was the first computer architecture implementing AVX2 and FMA, and the code would have been optimized in the exact same way on more recent computer architectures such as Intel Skylake (client) or AMD Zen.
About AVX512 support, the availability of this ISA is still very limited (especially in embedded devices), and compared to AVX2+FMA you would get any speed up only on a processor with two 512-bit FMA units, which as of now is limited to some high-end Xeon processors (and no laptop or desktop processor).
Therefore (and unfortunately), after 7 years from its introduction the Haswell target is still the best target on the vast majority of computers.

As already mentioned by RoyiAvital, BLASFEO is open source on a github repo, so I would be very happy if you were trying out yourself, I’m fully confident that the results will be as expected. :wink:


What’s the difference between Gaius.jl, PaddedMatrices.jl and OpenBlas?
When should we use one or the other?

Which one uses less memory for the same operation?

I’m the author of Gaius.jl. As it says on the tin, don’t use it for anything important. It was really just a little experiment I ran where I wanted to understand what it would take to get multi-threading working with LoopVectorization.jl matmul kernels. Once things got sufficiently complicated and I got sufficiently busy, I backed off and stopped working on the project.

@Elrod’s work on PaddedMatrices.jl is strictly more promising than Gaius.jl, even if it’s multi-threading isn’t as good as Gaius.jl (last I checked, maybe this changed?). I don’t expect that there are any fundamental problems with it’s approach to multi-threading and with some elbow grease, it should be great across many cores.

OpenBLAS is an external project to Julia, but Julia farms out it’s linear algebra work to OpenBLAS when advantageous. There is at least some interest in replacing OpenBLAS with something pure julia, mostly out of ideological convictions that Julia solves the two language problem and so we shouldn’t need something like OpenBLAS. That said, there are practical reasons to want to get rid of OpenBLAS despite it’s fantastic performance. Linking to OpenBLAS contributes at least somewhat to the time it takes to start up a Julia session, plus it does not handle arbitrary matrix eltypes, only hardware floating point numbers and complex numbers.

For the time being, I would not expect any of these pure julia implementations to replace OpenBLAS at least for a couple of years. These projects are more precursors of things to come, as well as existance proofs that LoopVectorization.jl is scalable to rather hard problems.


Hi, the desktop I use to run most of my benchmarks has 2 AVX512 FMA units/core. For this reason, single threaded dgemm results over the size range 2-300 were about as expected:

For reference, this computer runs AVX512 code at 4.1 GHz and AVX2 code at 4.3 GHz, making the respective theoretical peaks 131.2 and 68.8 double precision GFLOPS.

I used default compilation options (except that I set MACRO_LEVEL=2 in the Makefile.local). I compiled with gcc 10.1.1.
The wrapper is here. I was using the BLAS API (i.e., using column major arguments rather than panel major).
I’m sure performance would have improved a bit via using BLASFEO API.

Maybe you could create a BLASFEO_jll library for distributing binaries, and then write a wrapper library providing a convenient Julia API ;).

I am a fan of the idea of using performance considerations for choosing the default data layout of our data structures. It would be nice to have broader support for panel-major arrays.

I also have a Haswell laptop:

Same compilation options, except I used gcc 9.3.0.
Here, BLASFEO did much better relative to the competition than on my desktop, but performance was highly erratic for 55x55 matrices and below.

I can only tune the performance on computers I have access to, and appreciate it when folks point out that they’re seeing something different than I am, so that I can investigate it.
That said, because everyone compiles the code locally, I can make use of hardware features queried by CpuId.jl to guide optimization decisions in the library. E.g., the M_c, K_c, and N_c blocking parameters are a function of the cache sizes and vector width:

julia> PaddedMatrices.matmul_params(Float64)
(240, 328, 4941)

Are these optimal? Far from it. Even on my own computers, I’ve only tuned haphazardly.
LoopVectorization.jl chooses M_r and N_r.

AVX512 doubles the size and number of registers, which allows you to use a larger microkernel. This means a wider micro-tile from B, thus reducing the number of times you have to pass over the elements of the A-tile, reducing bandwidth requirements.

This is kind of a self-fulfilling prophecy. AVX512 isn’t common because it doesn’t provide much performance benefit for most workloads, and most workloads don’t support AVX512 because it isn’t common.
MacBooks have recently begun shipping with it (until their switch to ARM).

I don’t know much about embedded, but I’m going to guess x86 isn’t a popular choice in that area.

The main contribution of PaddedMatrices.jl is that this is what the basic macrokernel looks like:

function loopmul!(C, A, B, ::Val{1}, ::Val{0}, (M, K, N) = matmul_sizes(C, A, B))
    @avx for m ∈ 1:M, n ∈ 1:N
        Cₘₙ = zero(eltype(C))
        for k ∈ 1:K
            Cₘₙ += A[m,k] * B[k,n]
        C[m,n] = Cₘₙ

While I’m not convinced it’s a good idea, I also have macro-kernels that pack A:

function packaloopmul!(
    ::Val{1}, ::Val{0}, (M, K, N) = matmul_sizes(C, A, B)
) where {Mc,Kc,Nc}
    Nᵣrange = VectorizationBase.StaticUnitRange{1,nᵣ}()
    @avx for m ∈ 1:M, n ∈ Nᵣrange
        Cₘₙ = zero(eltype(C))
        for k ∈ 1:K
            Aₘₖ = A[m,k]
            Cₘₙ += Aₘₖ * B[k,n]
            Ãₚ[m,k] = Aₘₖ 
        C[m,n] = Cₘₙ
    Nrange = VectorizationBase.StaticLowerUnitRange{1+nᵣ}(N)
    @avx for m ∈ 1:M, n ∈ Nrange
        Cₘₙ = zero(eltype(C))
        for k ∈ 1:K
            Cₘₙ += Ãₚ[m,k] * B[k,n]
        C[m,n] = Cₘₙ

Depending on the size of the arrays, it chooses between not packing, packing A, or packing both A and B.

The primary point of interest here is as a proof of concept for LoopVectorization.jl’s ability to generate reasonably macro-kernels, before applying it to other problems. It is (however) still a little limited on the kinds of loops it supports, but it is a work in progress.

It’s also not too difficult to achieve better performance than competing BLAS libraries at small sizes, e.g. via not packing at all. Most of my own work involves matrices in that size range, so optimizing for it and still getting reasonable performance elsewhere is another part of the motivation.

I experimented with tile-packing, e.g:

function packaloopmul!(
    α, β
) where {Mᵣ,Mᵢ,K,Nᵣ,Nᵢ}
    Mᵣrange = VectorizationBase.StaticUnitRange{1,Mᵣ}()
    Nᵣrange = VectorizationBase.StaticUnitRange{1,Nᵣ}()
    @avx for mᵢ ∈ axes(Ãₚ,3), mᵣ ∈ Mᵣrange, nᵣ ∈ Nᵣrange
        Cₘₙ = zero(eltype(C))
        for k ∈ axes(Ãₚ,2)
            Aᵣₖᵢ = A[mᵣ,mᵢ,k]
            Cₘₙ += Aᵣₖᵢ * B[nᵣ,k]
            Ãₚ[mᵣ,k,mᵢ] = Aᵣₖᵢ
        C[mᵣ,mᵢ,nᵣ] = α * Cₘₙ + β * C[mᵣ,mᵢ,nᵣ]
    Nᵢrange = VectorizationBase.StaticLowerUnitRange{2}(size(C,4))
    @avx for nᵢ ∈ Nᵢrange, mᵢ ∈ axes(Ãₚ,3), mᵣ ∈ Mᵣrange, nᵣ ∈ Nᵣrange
        Cₘₙ = zero(eltype(C))
        for k ∈ axes(Ãₚ,2)
            Cₘₙ += Ãₚ[mᵣ,k,mᵢ] * B[nᵣ,k,nᵢ]
        C[mᵣ,mᵢ,nᵣ,nᵢ] = α * Cₘₙ + β * C[mᵣ,mᵢ,nᵣ,nᵢ]

(I’m treating panel-major matrices as 3-d arrays. MxKpanel_widthxK*number_panels.)

However, I’d need to make changes somewhere for the above to work as intended, because around a month ago LoopVectorization began checking for large strides between memory accesses, and inserting prefetches if it found them. It also incorporates the cost of those prefetches into choosing M_r and N_r. If memory accesses are contiguous, it implicitly assumes the hardware prefetcher should handle it, and ends up with very square-microkernels, because it models vloadp(s/d) as equally costly as vbroadcasts(s/d), and therefore chooses parameters to yield a roughly equal number of each.
I’m inclined to not think that’s quite correct, because while it may maximize arithmetic instructions / load instructions, it decreases the width of the B-micropanels, increasing the number of times it has to pass over the A-micropanel, thus increasing memory pressure. LoopVectorization currently doesn’t incorporate memory into its cost modeling.
The implication for the above tiling code is that it might choose different blocking parameters for both loops, which I don’t really intend.

But maybe it wont, I haven’t looked at it. I could probably also be reasonable and not combine packing A with multiplication, since I don’t really think that improves performance much - if at all - anyway. Maybe there’s a better way.

Anyway, one of my conclusions has been that at small sizes, column major performs very well. If the data already fits in cache, the strides between memory accesses aren’t so important.


It’s changed…in that PaddedMatrices no longer supports multi-threading.
I may add multithreaded support again, but that would probably be motivated by the same “proof of concept” desire of seeing how efficient I could get it, with gold standards to optimize against.

MKL’s multithreading is extremely efficient, and also kicks in at a very small size. Importantly, it also helps when it does, unlike OpenBLAS which tends to trip over its own threads.


Hi Elrod,

first of all, thanks for trying BLASFEO out, and for reporting the erratic behavior you observed.
Since I haven’t observed that before, I investigated the matter, and it turned out it was due to denormals being present in the memory around the matrices somehow present in your benchmark routines.
In case of the matrix size not multiple of the SIMD width, the kernel would load and perform FMA on the full SIMD width, and disregard the extra element when storing back to the result matrix.
However, it turned out that in case of denormals in the background memory, this would trigger the super-slow computation on denormals.
Now I fixed the issue, and the code in the current BLASFEO master looks like this (for matrix sized 2:24) on my machine (Intel Core i7 4810MQ)

Then, thanks for the interesting introduction to your work!
About the packing of A, I agree that in the implementation of the NN variant of dgemm it is not necessary as long as data fits in L1 or L2 cache, but you start already seeing some improvement for L3.
But it gets more beneficial when you implement other dgemm variants, especially the ones with A transposed. How do you handle such cases in your framework?
And how is such framework handling linear algebra routines with less regular access pattern, such as when one matrix is triangular, or factorizations?

In my work on BLASFEO, I also saw that packing A and/or B is much more beneficial on less powerful architectures than Intel Haswell/Skylake.
There lower cache associativity, smaller TLBs and simpler hardware prefetchers imply that packed matrix format such as the panel-major can give sizeable better performance also for small matrices.

About the egg-and-chicken issue with AVX512, I think there is more that this.
It is also due to Intel’s extreme market segmentation strategy: as an example, contemporary Celeron and Pentium have even AVX disabled, and Atom do not physically have it.
AVX512 is just considered an additional tier on top, with additional market segmentation between 1 and 2 512-bit FMA units.
By the way, you seem to have a nice “desktop” development machine :wink:


@giaf, This is a nice progress!

By the way AVX512 is now available on the latest generation of Laptops and will be available in the next generation of Desktop CPU’s (Intel’s).

Though I still think AMD’s approach is better, just have 2x ports of 256 Vector Units and build a bigger wider Front End to handle the extra ops pressure.

Regarding BLASFEO, I think what’s missing is a Batch Mode.
In that batch mode if we can send batch of small Matrix Multiplication and BLASFEO will handle it with different threads it will be a great way to enjoy BLASFEO great single thread performance in the case of multiplying many matrices.

Then it could be incorporated into Julia broadcasting of such case.

What do you think?

Ah, now I noticed a regression in my latest benchmark results and fixed them:

There was a fairly expensive check that used to be evaluated at compile time, but was instead evaluated at runtime in PaddedMatrices v0.1.6. I just released v0.1.7, which moved the check back to compile time. Overhead of the check is O(1), but that it was fairly expensive at small sizes.

I pulled blasfeo master, and reran the benchmarks:
i9-10980XE (Cascadelake-X):

i5-8350U CPU (Skylake):

i3-4010U CPU (Haswell):

Now PaddedMatrices is much faster over the 2:24 size range on the three computers I tested.

due to denormals being present in the memory around the matrices somehow present in your benchmark routines.

This is normal in Julia. If you read out of bounds memory, it will probably be junk. If you interpret the memory as floating point numbers, denormals are likely.
I use @llvm.masked.load and @llvm.masked.store intrinsics to avoid touching out of bounds memory, while still using SIMD instructions. These get lowered to vmaskmovp* instructions with AVX. The mask consumes a floating point register. With AVX512, it uses normal vmovup* instructions, but applies a bitmask, using one of the opmask registers instead of a floating point register.

Also, when it comes to such small sizes (e.g., 2:24) I’d recommend taking advantage of Julia’s JIT if possible.
Benchmarks that allowed specializing on the size of the arrays:













Of course, I don’t think Julia/a massive runtime is a good choice for embedded devices. But for someone already using Julia, it can provide a nice benefit.

I did not include NT because BLASFEO produced an incorrect result at 13x13.

The relevant code from PaddedMatrices does the following:

  1. If the first stride of A is not contiguous, pack.
  2. Heuristically, it will start packing if 73 > M with AVX512F, or 53 > M without it, unless the base of the array is aligned and stride(A,2) is also a multiple of the SIMD vector width.
  3. if mc * kc > M * K, where A is M x K and mc and kc are the blocking parameters, do pack A.

If it is packing A, it will also pack B if kc * nc ≤ K * N.

Therefore, it will generally pack A in T*. The exception there is if A's size is known at compile time (e.g., the FixedSizeArray), in which case it wont pack A if the number of rows is less than or equal to twice the SIMD vector width. I should probably make that check more sophisticated, e.g. with respect to the number of times the same elements from A will actually have to be reloaded.

When I get there, I’ll decide how data should be laid out. On the code-gen side of things, I still need to add support for triangular loops and handling dependencies between iterations.
But this focus on code gen means that once LoopVectorization understands these things, the macrokernels should be easily specified via simple loops, making details such as memory layout relatively easy to change.

I am a fan of the idea of alternative layouts. I’ve slowly been working on a DSL for probabilistic programming. One optimization I’d like to support some day is having it choose the memory layout of all underlying data structures to optimze performance.
Perhaps I’d be better off making internal arrays default to some sort of tile-major layout instead of column-major?

That is unfortunate, but I like the idea of people having the ability to choose something aligned with their usecase. E.g., reviews are currently overwhelmingly better for AMD, but I am happy to still have the choice to buy AVX512, because I can write software to actually take advantage of it.

But most people don’t/won’t, so why should they pay for silicon they’re not using?
But in terms of segmentation, I’d love to see something like a broad adoption of ARM’s scalar vector extensions (at least, broadly adopted among ARM CPUs), with different segments simply supporting different vector widths. Those who need HPC can buy wide vectors, and those who don’t can buy 128-bit, all within the same instruction set.
Unfortunately, as far as I know, the A64FX is the only CPU currently supporting SVE.

Thanks. The CPU is a 10980XE, which is marketed as “high end desktop”. It hits >2.1 terraflops with MKL’s dgemm, can build software from source quickly, and looks stellar in all the heavily-SIMD benchmarks I run, so IMO worth the expense for a hobbyist. AMD CPUs do well on compiling software, but none currently come close to the >120 double precision GFLOPS/core it achieves.
It’s worth more than my car, if you’re curious where my priorities are :wink:.