[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays

LOL. Thanks for trying it out and sharing the result. Let’s hope it’d be fixed at some point…

I was mentioning \tilde B_p because, IIUC, sharing \tilde B_p introduces a non-parallelized part of the code (or one more extra synchronization if it makes sense to parallelize copying), is that right? I wondered if it’d be nice to somehow avoid it. I may be saying something very stupid as this is only O(KN) but, since loading cold memory is much slower than floating arithmetics, and using more threads may not help with it, I thought I’d ask about it.

I usually fetch or wait on tasks for synchronization. I guess it may be possible to come up with a better queue than Julia is using that is specific to the program you have. I don’t know what it’s like though.

Do you mean nested parallelism or something else? (IIUC nested parallelism doesn’t matter for your case because your function is at the “leaf” of the call tree. OTOH, something like threaded mapreduce needs to care about it since it may be calling something like threaded GEMM.)

1 Like

I made tweaks, that unfortunately appear to have caused a performance regression for larger arrays.

I did however add specialized methods for arrays with static strides of 2 and 4, and added PtrArray to the sized benchmarks. (EDIT: These don’t actually seem to be better than just what LLVM does with unrolled code, so I’ve turned these back off on my local master).
As far as I understand, I should have more or less identical performance to PaddedArray. As you can see from the graph, my understanding is limited.

For 10,000 x 10,000 arrays, I’m not currently beating the GFLOPS I get for an 8x8 PtrArray.
But the larger kernel being used for those arrays should be much more efficient.
I think I need to add real prefetch support to LoopVectorization, so I can have it add them at better places. That’s the current reason I don’t have prefetching with AVX2 yet.

I also experimented a little with rearranging the elements of A while packing, into an order so that we load elements contiguously, hoping that triggers the hardware prefetcher and eliminates the need for software prefetching. Performance wasn’t as good.

I think we should parallelize the packing. Each thread that shares a \tilde{B}_p could pack a chunk, and then they must synchronize before resuming the multiplication.
If we don’t parallelize packing, we’d still need a synchronization step for the other threads to wait until packing is complete before reading from it.

The synchronization wouldn’t be needed if each thread had their own \tilde{B}_p, without packing. But synchronization should be cheaper than packing \tilde{A}_p, which is why I suspect this to be the better approach.

Packing \tilde{B}_p is O(KN), while packing \tilde{A}_p is O(MKN). While A itself is only MxK elements, we repack for for each \tilde{B}_p – hence, the desire to make \tilde{B}_p larger.

I should really spend some time look through some of the opensource BLAS libraries and see if I can figure out how they do it. I’m probably doing some things wrong already (I’m slower), and especially MKL have exceptional multithreaded performance.

Hmm. How will this work with subdividing the problem?

Referring back to packing \tilde{B}_p, I’d like several threads to be able to pack it, then proceed to start packing \tilde{A}_p before checking to confirm everyone has finished their chunk of \tilde{B}_p, and continuing on with calculating \tilde{A}_p \times \tilde{B}_p.

I could YOLO it and just assume that of-course they’ll be done by the time \tilde{A}_p is done, but that’d bound to cause problems eventually.

If instead one thread runs things by @spawn and then fetch/wait, and then @spawns for the next task, those new @spawns may not end up on the same physical core?
Because \tilde{B}_p is supposed to live in (shared-between-cores) L3 instead of (core-local) L2, maybe that isn’t much of problem. We just won’t be able to pack \tilde{A}_p before checking that \tilde{B}_p is done.

Alternatively, maybe I could implement it with a few refs or atomic variables, and busy-wait with while loops.

Yeah, you’re right. The only real concern there is that we’d want it to be able to nest inside other parallel code. The other parallel code may also call parallel instances of jmult! that shouldn’t step on each other. But it doesn’t sound like that should be a real problem. The preallocated memory for packing is already set up to be thread local, for example.

1 Like

Nim’s laser package might be the easiest reading:

4 Likes

Yeah, sounds like a great approach than hearing my guesses :+1:

Is \tilde A_p an array that concatenates \tilde A_i? So the plan is to create \tilde A_p as well as \tilde B_p for each iteration of the 3rd loop, while a sub-array \tilde A_i is loaded to L1 for each iteration of the 2nd loop? If that’s the case, why not spawn the tasks for creating \tilde A_p and \tilde B_p at once and then wait for them (2 * nthreads() tasks, I guess)?

There is no public API for even pinning task to a thread. I guess you can write your own “task pool” to give some hints to partr:

julia> function demo()
           chans = map(1:Threads.nthreads()) do _
               Channel(1, spawn=true) do ch
                   for f in ch
                       f === nothing || f()
                   end
               end
           end
           push!(chans[1], () -> @show Threads.threadid())
           push!(chans[1], () -> @show Threads.threadid())
           return chans
       end
demo (generic function with 1 method)

julia> chans = demo()
       sleep(1)
Threads.threadid() = 1
Threads.threadid() = 1

julia> foreach(close, chans)

Though my impression is that Channel has higher overhead than @spawn/fetch.

1 Like

Thanks, worth taking a look at. @YingboMa has been walking me through the OpenBLAS source. On my desktop, it does very well over the size range 200-10,000.
MKL eventually overtakes it (and achieves mysteriously impressive speeds below 190x190, insensitive to factors such as memory layout), but OpenBLAS’s performance over that range in single threaded performance captured our interest/focus.

To clarify, the BLIS algorithm, which I’ve been using, is:

# C is M x N
# A is M x K
# B is K x N
for n in 1:n_c:N
    for k in 1:k_c:K
        # pack B_p
        for m in 1:m_c:M
            # pack A_p
            # call macrokernel
        end
    end
end

B_p is in L3, and A_p in L_2. The macrokernel iterates over slices of B_p that fit in the L1 cache.
It’s because B_p is preferably very large, taking up more than 1/num_cores % of the shared L3 cache, that it makes sense to let it be large, and share it among threads.

Because A_p is in the L2 cache (and thus local to one core), these should probably be unique per thread.

However, this is not the algorithm OpenBLAS uses. It seems to user finer grained packing, closer to the microkernel. Aside from getting the best single threaded performance (at least on this computer, over a large size range), this is also more compatible with your suggestions. I’ll have to look more closely, but it looks like they pack A_p into L2, and then pack B_p into L1 within the macrokernel (and don’t pack B outside of it).
Although, I haven’t actually looked at OpenBLAS’s threading code yet. Also worth pointing out that MKL’s threading is far better than OpenBLAS’s (while OpenBLAS’s would also be extremely difficult to match).

My primary interests here are

  1. Small-matrix performance
  2. Not suffering at large sizes, so that someone looking for small matrix performance doesn’t have to be concerned with a major trade-off.
  3. Enabling high-performance Julia Linear Algebra libraries well suited for smaller problems, such as RecursiveFactorization.jl
  4. Proof of concept that these libraries can perform well.

In terms of low hanging fruit, large-matrix linear algebra provides limited picking.

YingoboMa has been working on MaBLAS.jl.
Benchmarks over the range 2:256:


Rather than being based on LoopVectorization (which it uses for clean up loops), MaBLAS.jl uses custom code to generate kernels. It has an API making it easy to specify different kernels or BLIS-style blocking factors and fine grained control over packing in the API.
I added three instances of MaBLAS to the mix, each with different kernels, as well as Gaius, OpenBLAS, MKL, and PaddedMatrices.
MaBLAS will probably switch to just using the clean up loops below a certain size, as the LoopVectorization-kernels do very well at handling dynamic sizes.

MKL is ahead from 40-190, while PaddedMatrices stays in second through this range, and first through 256. MaBLAS’s kernels generate better assembly than LoopVectorization does. LoopVectorization’s kernels load pointers and/or offsets from memory on every iteration, and then use these to load the elements being operated on, while MaBLAS’s kernels do exactly what you want: no reloading of pointers, no incrementing of any integers aside from the loop induction variable.
For the life of me, I haven’t been able to get LLVM to do this correctly, but it doesn’t seem to be too big a deal.
The performance is also notably higher than it was in the opening post, e.g. PaddedMatrices now staying above 90 GFLOPS from 200-256, instead of dipping into the low 80s.

Larger arrays:


It’s hard to separate most of them over most of this range, while they hover around the 100 GFLOPS mark. OpenBLAS comes out ahead, while the 32x6 and 40x5 MaBLAS routines aren’t far behind.

I think this is a great showing from Julia BLAS!

9 Likes

This is a great progress.
I wonder how much of the wonder of @avx should be imported into @simd (Which in my opinion should be more conservative).

Regarding optimized kernels for Small Matrices, have a look at BLASFEO.
For single threaded they are much better than MKL up to sizes of ~400.

  1. I’d have to see benchmarks to believe that.
  2. They don’t support my architecture, so I’d be forced to use Haswell, which limit my CPU to a theoretical peak of 70 GFLOPS, below the above results.

EDIT:
I’d like to add:
I tuned those libraries on the computer I ran the benchmarks on.
How well will they generalize to other computers? Hopefully reasonably well, but I wouldn’t necessarily expect them to so closely match performance over that range with OpenBLAS and MKL on different architecures. Similarly, everyone’s benchmarks claim they’re the fastest. Polly claims they can optimize gemm to be close to OpenBLAS, but in my benchmarks polly took an eternity to compile, maybe achieving a third the speed of OpenBLAS, not even matching icc or ifort applied to triple loops and gemm-detection disabled (and also having more or less instantaneous compile times), so I quit bothering with it and removed Polly from my LoopVectorization benchmarks. Similarly, Eigen claims to have competitive performance, but doesn’t do well at all either. Unlike Polly, Eigen is pleasant to compile, so it gets to stay in the LoopVectorization benchmarks.
I’d have to run benchmarks locally to “believe” them. Same should apply to you or anyone reading these. Sure, I got good performance on my computer after tuning them for it, but will they get good performance on yours? I’d be skeptical until these have matured much more.

1 Like

They have benchmarks:

image

They don’t have AVX512 but they do have AVX2.
You may see that I compared them to Multi Threaded MKL (MATLAB) and got factor of ~2 on 6 Cores machine for the case up to N = K = M = 400 . I think the results are impressive for a single thread case. Anyhow, they have BLAS API so you can try on yourself. I’d be interested to see a comparison of the Single Thread case made by you.

They also have an interesting packing of data for small matrices which could be useful. Give it a deeper look. It worth it.

Okay, yeah, may be worth taking a look at. Their dgemm_nt performance using the blasfeoapi is very good compared to their theoretical peak, and almost unbelievably constant as a function of matrix size.

Did they count packing time when measuring blasfeoapi?
I strongly suspect that they did not.

Unfortunately, I can’t really benchmark it on my computer, because of course it’ll look bad when running Haswell kernels. On my computer, theoretical peak with a Haswell kernel is 68.8 GFLOPS (4.3 GHz * 2 fma/cycle * 4 numbers / vector * 2 ops/vector/fma) while the theoretical peak is 131.2 GFLOPS (4.1 * 2 * 8 * 2).
I guess I could compare vs theoretical peak, instead of number of GFLOPS, but that also isn’t quite fair, because memory stalls aren’t as much of a problem with slower kernels.

I think that’s Eigen’s problem as well, even though they claim to support avx-512, but that they don’t have proper kernels anyway. OpenBLAS finally seems to have proper support with release 0.3.9 (which I’ve been using, and will ship with Julia 1.5).
One of the problems with these C and assembly libraries is that it makes adding support harder.

Looking briefly at their source, one of the things they do that OpenBLAS does not is adjust their packing strategy based on the input matrix sizes.
PaddedMatrices already does this, as does MKL, starting in 2017. But MKL obviously does it better than PaddedMatrices.

Currently, what it’s doing is:

if K * N  > n_c * k_c
   # Pack A and B into memory-access order
elseif M <= 72
  # do not pack
elseif M <= 2.5m_c && iszero(M % VECTOR_WIDTH)
  # do not pack
else
  # pack A into column-major order, aligning the columns
end

Where

M, N = size(C)
K = size(B,1)

and m_c, k_c, and n_c are the tiling parameters (corresponding to the size of the packed parts of the arrays).

In the link I shared about performance you may see the performance of the BLAS flavor which means the API is classic BLAS and the performance takes into account the overhead of packing and unpacking. It is called blasapi while blasfeoapi means it is assumed the data is already in the special packing.

While Skylake is wider with more ports than Haswell, something tuned for Haswell should run very very well on Skylake. Moreover, For them Haswell is a code for AVX2 more than specifically architecture.

You have a system with AVX512 but most of users doesn’t. Moreover currently all modern compilers in their latest versions yield 256 Bit ops even for AVX512 (Utilizing what’s really important in AVX512, its flexibility and not only the doubling of the width).

So I still think you should compile and test its single thread performance against other implementations. In order to make it fair you can disable AVX512 code path in MKL.

1 Like

One additional OSS BLAS implementation that might have useful ideas (although it’s focused on mobile, it’s also pretty small and self-contained) is Google’s ruy.

2 Likes

Performance below a few hundred in each dimension should be better now.
Benchmarks vs SMatrix and MMatrix from StaticArrays (note that MMatrix switches to BLAS when MKN >= 14^3; a Haswell laptop where that BLAS is MKL (AVX2 system):


Skylake laptop where that BLAS is OpenBLAS (AVX2 system):

Skylake/Cascadelake-X where that BLAS is OpenBLAS (AVX512 system):

The blue line is PaddedMatrices.jmul! applied to base Array{Float64,2} arrays.

With AVX2, it was faster than SMatrix starting from 7x7 and beyond, while with AVX512 SMatrix had one last hurrah at 8x8. The statically sized arrays from PaddedMatrices are at least as fast in the sizes I tested, especially with AVX512.
Unfortunately, I don’t support an equivalent of SMatrix yet, meaning escaping FixedSizeArrays will cause heap-allocation.

Most compilers suck at generating AVX512 code, which is why it makes sense not to bother. LLVM is much worse at this than GCC or ICC. For example, dot product benchmarks:


LLVM is very slow at cleaning up non-vectorized remainders, therefore the length of these remainders is the primary driver of performance. If you pick a bunch of sizes at random, cutting the width of vectors in half will also cut the average remainder length in half, probably boosting LLVM’s average performance.
A far better solution to that problem would be not being so slow at handling non-vectorized remainders in the first place.
GCC fails above because it doesn’t break up the dependency chain, even with -funroll-loop. Only the Intel compilers and LoopVectorization actually generate good code for something as simple as a sum or dot product.

A justification for this may be that lots of applications mostly aren’t AVX – lots of logic is hard to vectorize/very branchy – and if you have a few random short loops sprinkled in, you’d be better off not encouraging transitions. This is just speculation on my behalf. I don’t know the reasoning.

LLVM does a bad job of taking advantage of AVX512’s flexibility, regardless of vector width.
Note that on the AVX512 machine that SMatrix{4,4} multiplication was actually slower than on the other two, while FixedSizeMatrix{4,4} multiplication was much faster.

Don’t mistake the compiler’s limitations for hardware limitations.
While Itanium turned out to be the Itanic because of how hard it was to write good compilers for it, I don’t really think that’s the case here.

EDIT:
ruy being optimized for ARM would also make it interesting to look at eventually, given that ARM optimization will become important in Julia once Apple starts shipping ARM MacBooks.

14 Likes

Since there were quite a few posts about BLASFEO (and I’m its main developer), I would like to clarify a little what its intended contribution is.

As the name states, the intended application domain is embedded optimization, hence the focus on performance for small matrices and single threaded applications.
In designing BLASFEO, I took the freedom to depart from the standard BLAS API conventions, and explored how alternative APIs could help in improving performance.
In embedded optimization applications, parallelization is generally better handled at the algorithmic level (instead of the linear algebra level), and it is generally possible to pack all matrices into any arbitrary format before hand, outside the critical loop.
In that sense, BLASFEO defines a matrix format (named panel-major) which is used for all input and output arguments to linear algebra routines. Therefore, all packing of data is performed offline, and there is no need to perform any packing/unpacking between different linear algebra routines calls.
Hence, in such framework it makes no sense to consider the packing time in computing the linear algebra performance, as this is not their practical usage anyway.
If you like, you can find more details in this paper https://arxiv.org/abs/1704.02457

By leveraging my experience in implementing the BLASFEO’s own API, I tried to apply it to the implementation of a standard BLAS API with column-major matrix format for the routines arguments, still focusing on performance for small matrices and single-threaded routines. The main ideas and some results are on another paper https://arxiv.org/abs/1902.08115

The approach used in the BLAS API of BLASFEO (as well as the BLASFEO API itself) turned out to give very good performance on a wide range of computer architectures and ISAs.
The code is in no way optimized to a specific machine (i.e. my computer), and (by focusing on small matrix performance) is also irrespective of e.g. cache size.
The currently “fastest” target is named Haswell, for the simple reason that this was the first computer architecture implementing AVX2 and FMA, and the code would have been optimized in the exact same way on more recent computer architectures such as Intel Skylake (client) or AMD Zen.
About AVX512 support, the availability of this ISA is still very limited (especially in embedded devices), and compared to AVX2+FMA you would get any speed up only on a processor with two 512-bit FMA units, which as of now is limited to some high-end Xeon processors (and no laptop or desktop processor).
Therefore (and unfortunately), after 7 years from its introduction the Haswell target is still the best target on the vast majority of computers.

As already mentioned by RoyiAvital, BLASFEO is open source on a github repo, so I would be very happy if you were trying out yourself, I’m fully confident that the results will be as expected. :wink:

13 Likes

What’s the difference between Gaius.jl, PaddedMatrices.jl and OpenBlas?
When should we use one or the other?

Which one uses less memory for the same operation?

I’m the author of Gaius.jl. As it says on the tin, don’t use it for anything important. It was really just a little experiment I ran where I wanted to understand what it would take to get multi-threading working with LoopVectorization.jl matmul kernels. Once things got sufficiently complicated and I got sufficiently busy, I backed off and stopped working on the project.

@Elrod’s work on PaddedMatrices.jl is strictly more promising than Gaius.jl, even if it’s multi-threading isn’t as good as Gaius.jl (last I checked, maybe this changed?). I don’t expect that there are any fundamental problems with it’s approach to multi-threading and with some elbow grease, it should be great across many cores.

OpenBLAS is an external project to Julia, but Julia farms out it’s linear algebra work to OpenBLAS when advantageous. There is at least some interest in replacing OpenBLAS with something pure julia, mostly out of ideological convictions that Julia solves the two language problem and so we shouldn’t need something like OpenBLAS. That said, there are practical reasons to want to get rid of OpenBLAS despite it’s fantastic performance. Linking to OpenBLAS contributes at least somewhat to the time it takes to start up a Julia session, plus it does not handle arbitrary matrix eltypes, only hardware floating point numbers and complex numbers.

For the time being, I would not expect any of these pure julia implementations to replace OpenBLAS at least for a couple of years. These projects are more precursors of things to come, as well as existance proofs that LoopVectorization.jl is scalable to rather hard problems.

13 Likes

Hi, the desktop I use to run most of my benchmarks has 2 AVX512 FMA units/core. For this reason, single threaded dgemm results over the size range 2-300 were about as expected:

For reference, this computer runs AVX512 code at 4.1 GHz and AVX2 code at 4.3 GHz, making the respective theoretical peaks 131.2 and 68.8 double precision GFLOPS.

I used default compilation options (except that I set MACRO_LEVEL=2 in the Makefile.local). I compiled with gcc 10.1.1.
The wrapper is here. I was using the BLAS API (i.e., using column major arguments rather than panel major).
I’m sure performance would have improved a bit via using BLASFEO API.

Maybe you could create a BLASFEO_jll library for distributing binaries, and then write a wrapper library providing a convenient Julia API ;).

I am a fan of the idea of using performance considerations for choosing the default data layout of our data structures. It would be nice to have broader support for panel-major arrays.

I also have a Haswell laptop:

Same compilation options, except I used gcc 9.3.0.
Here, BLASFEO did much better relative to the competition than on my desktop, but performance was highly erratic for 55x55 matrices and below.

I can only tune the performance on computers I have access to, and appreciate it when folks point out that they’re seeing something different than I am, so that I can investigate it.
That said, because everyone compiles the code locally, I can make use of hardware features queried by CpuId.jl to guide optimization decisions in the library. E.g., the M_c, K_c, and N_c blocking parameters are a function of the cache sizes and vector width:

julia> PaddedMatrices.matmul_params(Float64)
(240, 328, 4941)

Are these optimal? Far from it. Even on my own computers, I’ve only tuned haphazardly.
LoopVectorization.jl chooses M_r and N_r.

AVX512 doubles the size and number of registers, which allows you to use a larger microkernel. This means a wider micro-tile from B, thus reducing the number of times you have to pass over the elements of the A-tile, reducing bandwidth requirements.

This is kind of a self-fulfilling prophecy. AVX512 isn’t common because it doesn’t provide much performance benefit for most workloads, and most workloads don’t support AVX512 because it isn’t common.
MacBooks have recently begun shipping with it (until their switch to ARM).

I don’t know much about embedded, but I’m going to guess x86 isn’t a popular choice in that area.

The main contribution of PaddedMatrices.jl is that this is what the basic macrokernel looks like:

function loopmul!(C, A, B, ::Val{1}, ::Val{0}, (M, K, N) = matmul_sizes(C, A, B))
    @avx for m ∈ 1:M, n ∈ 1:N
        Cₘₙ = zero(eltype(C))
        for k ∈ 1:K
            Cₘₙ += A[m,k] * B[k,n]
        end
        C[m,n] = Cₘₙ
    end
    nothing
end

While I’m not convinced it’s a good idea, I also have macro-kernels that pack A:

function packaloopmul!(
    C::AbstractStrideMatrix{Mc,Nc},
    Ãₚ::AbstractStrideMatrix{Mc,Kc},
    A::AbstractStrideMatrix{Mc,Kc},
    B::AbstractStrideMatrix{Kc,Nc},
    ::Val{1}, ::Val{0}, (M, K, N) = matmul_sizes(C, A, B)
) where {Mc,Kc,Nc}
    Nᵣrange = VectorizationBase.StaticUnitRange{1,nᵣ}()
    @avx for m ∈ 1:M, n ∈ Nᵣrange
        Cₘₙ = zero(eltype(C))
        for k ∈ 1:K
            Aₘₖ = A[m,k]
            Cₘₙ += Aₘₖ * B[k,n]
            Ãₚ[m,k] = Aₘₖ 
        end
        C[m,n] = Cₘₙ
    end
    Nrange = VectorizationBase.StaticLowerUnitRange{1+nᵣ}(N)
    @avx for m ∈ 1:M, n ∈ Nrange
        Cₘₙ = zero(eltype(C))
        for k ∈ 1:K
            Cₘₙ += Ãₚ[m,k] * B[k,n]
        end
        C[m,n] = Cₘₙ
    end    
    nothing
end

Depending on the size of the arrays, it chooses between not packing, packing A, or packing both A and B.

The primary point of interest here is as a proof of concept for LoopVectorization.jl’s ability to generate reasonably macro-kernels, before applying it to other problems. It is (however) still a little limited on the kinds of loops it supports, but it is a work in progress.

It’s also not too difficult to achieve better performance than competing BLAS libraries at small sizes, e.g. via not packing at all. Most of my own work involves matrices in that size range, so optimizing for it and still getting reasonable performance elsewhere is another part of the motivation.

I experimented with tile-packing, e.g:

function packaloopmul!(
    C::AbstractStrideArray{Tuple{Mᵣ,Mᵢ,Nᵣ,Nᵢ}},
    Ãₚ::AbstractStrideArray{Tuple{Mᵣ,K,Mᵢ}},
    A::AbstractStrideArray{Tuple{Mᵣ,Mᵢ,K}},
    B::AbstractStrideArray{Tuple{Nᵣ,K,Nᵢ}},
    α, β
) where {Mᵣ,Mᵢ,K,Nᵣ,Nᵢ}
    Mᵣrange = VectorizationBase.StaticUnitRange{1,Mᵣ}()
    Nᵣrange = VectorizationBase.StaticUnitRange{1,Nᵣ}()
    @avx for mᵢ ∈ axes(Ãₚ,3), mᵣ ∈ Mᵣrange, nᵣ ∈ Nᵣrange
        Cₘₙ = zero(eltype(C))
        for k ∈ axes(Ãₚ,2)
            Aᵣₖᵢ = A[mᵣ,mᵢ,k]
            Cₘₙ += Aᵣₖᵢ * B[nᵣ,k]
            Ãₚ[mᵣ,k,mᵢ] = Aᵣₖᵢ
        end
        C[mᵣ,mᵢ,nᵣ] = α * Cₘₙ + β * C[mᵣ,mᵢ,nᵣ]
    end
    Nᵢrange = VectorizationBase.StaticLowerUnitRange{2}(size(C,4))
    @avx for nᵢ ∈ Nᵢrange, mᵢ ∈ axes(Ãₚ,3), mᵣ ∈ Mᵣrange, nᵣ ∈ Nᵣrange
        Cₘₙ = zero(eltype(C))
        for k ∈ axes(Ãₚ,2)
            Cₘₙ += Ãₚ[mᵣ,k,mᵢ] * B[nᵣ,k,nᵢ]
        end
        C[mᵣ,mᵢ,nᵣ,nᵢ] = α * Cₘₙ + β * C[mᵣ,mᵢ,nᵣ,nᵢ]
    end
    nothing
end

(I’m treating panel-major matrices as 3-d arrays. MxK -> panel_widthxK*number_panels.)

However, I’d need to make changes somewhere for the above to work as intended, because around a month ago LoopVectorization began checking for large strides between memory accesses, and inserting prefetches if it found them. It also incorporates the cost of those prefetches into choosing M_r and N_r. If memory accesses are contiguous, it implicitly assumes the hardware prefetcher should handle it, and ends up with very square-microkernels, because it models vloadp(s/d) as equally costly as vbroadcasts(s/d), and therefore chooses parameters to yield a roughly equal number of each.
I’m inclined to not think that’s quite correct, because while it may maximize arithmetic instructions / load instructions, it decreases the width of the B-micropanels, increasing the number of times it has to pass over the A-micropanel, thus increasing memory pressure. LoopVectorization currently doesn’t incorporate memory into its cost modeling.
The implication for the above tiling code is that it might choose different blocking parameters for both loops, which I don’t really intend.

But maybe it wont, I haven’t looked at it. I could probably also be reasonable and not combine packing A with multiplication, since I don’t really think that improves performance much - if at all - anyway. Maybe there’s a better way.

Anyway, one of my conclusions has been that at small sizes, column major performs very well. If the data already fits in cache, the strides between memory accesses aren’t so important.

3 Likes

It’s changed…in that PaddedMatrices no longer supports multi-threading.
I may add multithreaded support again, but that would probably be motivated by the same “proof of concept” desire of seeing how efficient I could get it, with gold standards to optimize against.

MKL’s multithreading is extremely efficient, and also kicks in at a very small size. Importantly, it also helps when it does, unlike OpenBLAS which tends to trip over its own threads.

4 Likes

Hi Elrod,

first of all, thanks for trying BLASFEO out, and for reporting the erratic behavior you observed.
Since I haven’t observed that before, I investigated the matter, and it turned out it was due to denormals being present in the memory around the matrices somehow present in your benchmark routines.
In case of the matrix size not multiple of the SIMD width, the kernel would load and perform FMA on the full SIMD width, and disregard the extra element when storing back to the result matrix.
However, it turned out that in case of denormals in the background memory, this would trigger the super-slow computation on denormals.
Now I fixed the issue, and the code in the current BLASFEO master looks like this (for matrix sized 2:24) on my machine (Intel Core i7 4810MQ)

Then, thanks for the interesting introduction to your work!
About the packing of A, I agree that in the implementation of the NN variant of dgemm it is not necessary as long as data fits in L1 or L2 cache, but you start already seeing some improvement for L3.
But it gets more beneficial when you implement other dgemm variants, especially the ones with A transposed. How do you handle such cases in your framework?
And how is such framework handling linear algebra routines with less regular access pattern, such as when one matrix is triangular, or factorizations?

In my work on BLASFEO, I also saw that packing A and/or B is much more beneficial on less powerful architectures than Intel Haswell/Skylake.
There lower cache associativity, smaller TLBs and simpler hardware prefetchers imply that packed matrix format such as the panel-major can give sizeable better performance also for small matrices.

About the egg-and-chicken issue with AVX512, I think there is more that this.
It is also due to Intel’s extreme market segmentation strategy: as an example, contemporary Celeron and Pentium have even AVX disabled, and Atom do not physically have it.
AVX512 is just considered an additional tier on top, with additional market segmentation between 1 and 2 512-bit FMA units.
By the way, you seem to have a nice “desktop” development machine :wink:

3 Likes

@giaf, This is a nice progress!

By the way AVX512 is now available on the latest generation of Laptops and will be available in the next generation of Desktop CPU’s (Intel’s).

Though I still think AMD’s approach is better, just have 2x ports of 256 Vector Units and build a bigger wider Front End to handle the extra ops pressure.

Regarding BLASFEO, I think what’s missing is a Batch Mode.
In that batch mode if we can send batch of small Matrix Multiplication and BLASFEO will handle it with different threads it will be a great way to enjoy BLASFEO great single thread performance in the case of multiplying many matrices.

Then it could be incorporated into Julia broadcasting of such case.

What do you think?

Ah, now I noticed a regression in my latest benchmark results and fixed them:

There was a fairly expensive check that used to be evaluated at compile time, but was instead evaluated at runtime in PaddedMatrices v0.1.6. I just released v0.1.7, which moved the check back to compile time. Overhead of the check is O(1), but that it was fairly expensive at small sizes.

I pulled blasfeo master, and reran the benchmarks:
i9-10980XE (Cascadelake-X):


i5-8350U CPU (Skylake):

i3-4010U CPU (Haswell):

Now PaddedMatrices is much faster over the 2:24 size range on the three computers I tested.

due to denormals being present in the memory around the matrices somehow present in your benchmark routines.

This is normal in Julia. If you read out of bounds memory, it will probably be junk. If you interpret the memory as floating point numbers, denormals are likely.
I use @llvm.masked.load and @llvm.masked.store intrinsics to avoid touching out of bounds memory, while still using SIMD instructions. These get lowered to vmaskmovp* instructions with AVX. The mask consumes a floating point register. With AVX512, it uses normal vmovup* instructions, but applies a bitmask, using one of the opmask registers instead of a floating point register.

Also, when it comes to such small sizes (e.g., 2:24) I’d recommend taking advantage of Julia’s JIT if possible.
Benchmarks that allowed specializing on the size of the arrays:

NN

CascadelakeX:


Skylake:

Haswell:

TN

CascadelakeX:


Skylake:

Haswell:

TT

CascadelakeX:


Skylake:

Haswell:

Of course, I don’t think Julia/a massive runtime is a good choice for embedded devices. But for someone already using Julia, it can provide a nice benefit.

I did not include NT because BLASFEO produced an incorrect result at 13x13.

The relevant code from PaddedMatrices does the following:

  1. If the first stride of A is not contiguous, pack.
  2. Heuristically, it will start packing if 73 > M with AVX512F, or 53 > M without it, unless the base of the array is aligned and stride(A,2) is also a multiple of the SIMD vector width.
  3. if mc * kc > M * K, where A is M x K and mc and kc are the blocking parameters, do pack A.

If it is packing A, it will also pack B if kc * nc ≤ K * N.

Therefore, it will generally pack A in T*. The exception there is if A's size is known at compile time (e.g., the FixedSizeArray), in which case it wont pack A if the number of rows is less than or equal to twice the SIMD vector width. I should probably make that check more sophisticated, e.g. with respect to the number of times the same elements from A will actually have to be reloaded.

When I get there, I’ll decide how data should be laid out. On the code-gen side of things, I still need to add support for triangular loops and handling dependencies between iterations.
But this focus on code gen means that once LoopVectorization understands these things, the macrokernels should be easily specified via simple loops, making details such as memory layout relatively easy to change.

I am a fan of the idea of alternative layouts. I’ve slowly been working on a DSL for probabilistic programming. One optimization I’d like to support some day is having it choose the memory layout of all underlying data structures to optimze performance.
Perhaps I’d be better off making internal arrays default to some sort of tile-major layout instead of column-major?

That is unfortunate, but I like the idea of people having the ability to choose something aligned with their usecase. E.g., reviews are currently overwhelmingly better for AMD, but I am happy to still have the choice to buy AVX512, because I can write software to actually take advantage of it.

But most people don’t/won’t, so why should they pay for silicon they’re not using?
But in terms of segmentation, I’d love to see something like a broad adoption of ARM’s scalar vector extensions (at least, broadly adopted among ARM CPUs), with different segments simply supporting different vector widths. Those who need HPC can buy wide vectors, and those who don’t can buy 128-bit, all within the same instruction set.
Unfortunately, as far as I know, the A64FX is the only CPU currently supporting SVE.

Thanks. The CPU is a 10980XE, which is marketed as “high end desktop”. It hits >2.1 terraflops with MKL’s dgemm, can build software from source quickly, and looks stellar in all the heavily-SIMD benchmarks I run, so IMO worth the expense for a hobbyist. AMD CPUs do well on compiling software, but none currently come close to the >120 double precision GFLOPS/core it achieves.
It’s worth more than my car, if you’re curious where my priorities are :wink:.

8 Likes