Julia matrix-multiplication performance

Julia’s built in matrix multiplication seems to be much faster than this LoopVectorization snippet?

julia> r1 = rand(1000, 1000); r2 = rand(1000, 1000); c = zeros(1000, 1000);
julia> @belapsed $c .= $r1 * $r2
0.0257447
julia> @belapsed LinearAlgebra.mul!($c, $r1, $r2)
0.0233083
julia> @belapsed A_mul_B!($c, $r1, $r2)
0.0959804

Is the takeaway that Julia’s “normal” matrix multiplication calls very carefully tuned BLAS code, but LoopVectorization makes it surprisingly easy to get close to that performance?

2 Likes

Note that for 1000x1000, BLAS will be using all the threads on your computer. If you try on a smaller size (say 100x100) where BLAS will still only use 1 thread, the performance should be comparable. It’s possible to get Julia to thread this with Tullio, but it’s still a good example.

2 Likes

Yes, which leaves me with a good feeling, whenever I do something, for which there is no underlying BLAS routine. Many thanks to @Elrod and everyone else contributing to LoopVectorization and related packages :wink:.

3 Likes

And for all those memory bound operations you come pretty close to the performance of BLAS quite independent of the problem size.

1 Like

Well, it turns out that if you set Chris Elrod loose on the problem (via Octavian.jl), you can outstrip BLAS:

julia> using LinearAlgebra, BenchmarkTools, Octavian

julia> r1 = rand(1000, 1000); r2 = rand(1000, 1000); c = zeros(1000, 1000);

julia> @btime mul!($c, $r1, $r2);
  9.309 ms (0 allocations: 0 bytes)

julia> @btime Octavian.matmul!($c, $r1, $r2);
  7.671 ms (0 allocations: 0 bytes)
17 Likes

I was not aware. thank you for posting

LoopVectorization can produce a near perfect microkernel, but it’s not just that it’s missing multithreading to beat BLAS for large matrices. It turns out that for large enough matrices, multiplication is so expensive that there’s a lot of tricks that can be very profitable that LoopVectorization won’t do for you. In this context, you’d still want LoopVecotrization for your microkernel, but there are outer loops and tricks you’d want to do on top of that to compete with BLAS even for single threaded matmul of large matrices.

PaddedMatrices.jl and it’s successor(?) Octavian.jl do these tricks on top of a LoopVectorization.jl microkernel and it shows in their performance numbers,

7 Likes

For one thing, you need to optimize the memory-access patterns to improve cache re-use (e.g. by blocking the matrix multiply) as soon as the matrices exceed the cache size.

1 Like

This is getting off topic, but I’ve enjoyed the little bits and spurts of pure Julia BLAS like codes being created, but it doesn’t seem there is a centralized place to follow the development occuring in various packages. It’d be fun to know where it’s discussed (Slack thread?).

Of course I’m not an expert in any of these things and don’t understand it so this is purely my idle curiosity and I suspect it’s just that there aren’t many public discussions about it.

This gets discussed from time to time on Zulip. E.g.

sum benchmarks

stack allocated arrays

More of the BLAS specific stuff ends up in the repos of the various packages though. E.g. this thread is a good read:

https://github.com/JuliaLinearAlgebra/Octavian.jl/issues/24

4 Likes

For example, here is a little recursive implementation of a cache-oblivious matrix multiplication that stays within a factor of 2 of the single-threaded OpenBLAS performance on my laptop up to 3000×3000 matrices:

function add_matmul_rec!(m,n,p, i0,j0,k0, C,A,B)
    if m+n+p <= 256   # base case: naive matmult for sufficiently large matrices
        @avx for i = 1:m, k = 1:p
            c = zero(eltype(C))
            for j = 1:n
                @inbounds c += A[i0+i,j0+j] * B[j0+j,k0+k]
            end
            @inbounds C[i0+i,k0+k] += c
        end
    else
        m2 = m ÷ 2; n2 = n ÷ 2; p2 = p ÷ 2
        add_matmul_rec!(m2, n2, p2, i0, j0, k0, C, A, B)
        
        add_matmul_rec!(m-m2, n2, p2, i0+m2, j0, k0, C, A, B)
        add_matmul_rec!(m2, n-n2, p2, i0, j0+n2, k0, C, A, B)
        add_matmul_rec!(m2, n2, p-p2, i0, j0, k0+p2, C, A, B)
        
        add_matmul_rec!(m-m2, n-n2, p2, i0+m2, j0+n2, k0, C, A, B)
        add_matmul_rec!(m2, n-n2, p-p2, i0, j0+n2, k0+p2, C, A, B)
        add_matmul_rec!(m-m2, n2, p-p2, i0+m2, j0, k0+p2, C, A, B)
        
        add_matmul_rec!(m-m2, n-n2, p-p2, i0+m2, j0+n2, k0+p2, C, A, B)
    end
    return C
end

function matmul_rec!(C, A, B)
    m,n = size(A)
    n,p = size(B)
    size(C) == (m,p) || error("incorrect dimensions ", size(C), " ≠ $m × $p")
    fill!(C, zero(eltype(C)))
    return add_matmul_rec!(m,n,p, 0,0,0, C,A,B)
end


matmul_rec(A, B) = matmul_rec!(Array{promote_type(eltype(A), eltype(B))}(undef,
                                     size(A,1), size(B,2)),
                               A, B)

In comparison, Octavian.jl is ~100× more code, and OpenBLAS is > 1000× more code (and only supports 4 arithmetic types, whereas this code is type-generic).

Here larger numbers = faster (vertical axis is rate of floating-point arithmetic operations), timings are for double precision, “naive” is a 3-loop Julia version with @avx, and BLAS is mul! with OpenBLAS:

12 Likes

As an interested but uneducated (in this domain) observer, is all this effort (LoopVectorization, Gaius, Tullio, Octavian, etc. etc.) building toward something? If the right people keep plugging away are folks eventual gonna arrive at a pure Julia BLAS package that rivals other solutions (OpenBLAS, MKL) on most processors/matrix sizes? Or is it all just fun tuning that can win in some cases on some CPUs or single threaded but isn’t actually going to compete broadly?

I don’t think Julia needs such a thing, but it would be so cool!

1 Like

I can only speak from my personal experience, but I have been using these tools. I had an instance, where a method did something, which could not be expressed by a BLAS routine and the standard code was terribly slow due to the fact that Complex numbers were involved (serial number crunching and no packed operations). In the end I could speed up this one method by a factor of 12, which would not have been possible without the effort by these experts.

4 Likes

Hard to say. If it is leading to a real pure julia BLAS that replaces OpenBLAS for us, that’s at the very least a couple years off. There would be advantages though if we did manage to do it.

  • We could potentially support a huge variety of julia datatypes with such a BLAS library. LoopVectorization currently only supports our machine types, but my understanding is that when the rewrite happens, a lot of thought will be put into trying to make a design that properly supports semi-arbitrary structs. This could be a huge advantage

  • Size. OpenBLAS is a hefty binary dependancy and it being linked to julia accounts for a non-trivial amount of julia’s startup time.

  • Latency and interproceedural optimizations. Having BLAS in julia rather than OpenBLAS means that the same compiler sees more of your code. More optimization opportunities.

  • Sovereignty. Sometimes, it’s nice to have control over your dependencies. For example, OpenBLAS’ multithreading is not composable with julia’s multithreading. If you use both simultaneously, you might find yourself with worse than single threaded performance. If we control our BLAS, we have more options to rectify this (not strictly required. See for example FFTW which is composable with julia from what I understand)

It’s easy to overstate the importance of these points. They’re all nice, but none are actually super compelling reasons to devote a ton of energy to this. But there certainly are potential advantages


Multithreading is already good in Tullio, and better than anyone else in Octavian for all but the most gigantic of matrices. Octavian also seems so far to work well on quite a broad range of CPUs, so I’d say this is already debunked.

10 Likes

LoopVectorization and Tullio, in particular, handle situations that are not addressed by BLAS — they aren’t limited to accelerating BLAS operations.

10 Likes

Something worth pointing out is that single threaded BLAS is really useful for lots of HPC. A lot of problems have high level parallelism, so you turn off BLAS threading anyway.

7 Likes

StrideArrays.jl is the official successor of PaddedMatrices.jl, because PaddedMatrices really didn’t have anything to do with padding matrices anymore.

However, in the process, I also factored out the matrix multiplication algorithm from StrideArrays and into Octavian.jl, which StrideArrays depends on.
The algorithm has improved a lot since I last posted about it on discourse, and as you note, most of the discussion is in GitHub issues like some initial benchmark comparisons.

And the compile times to match…I’ve often had around 20s as the “time to first matmul”.
Based on the sizes of the arrays, it switches between a base case, packing A, and packing both A and B.
Packing here means copying sub-blocks of memory. For A, these blocks are stack allocated, while B uses a heap allocated const global (with a lock around it to avoid multiple matmuls in different threads using the same preallocated memory at the same time). [For 32-bit Julia, A blocks are also heap allocated, because 32-bit Julia seems to have a smaller stack? Stack allocation was causing stack overflow errors on 32-bit builds – not often I see those without recursion].

When threading, for “small” matrices (compared to the amount of L3 cache), it subdivides blocks of C among the cores. If the corresponding blocks of A are large compared to the L2, these will be packed.
For matrices large compared to the L3, threading gets more complicated as it needs to synchronize.

That’s a brief overview of why there’s 100x more code: lots of branches on the size of the arrays to find a good approach.

The “aggressive subdivision” approaches don’t need cache-tuning, since it’s for when the matrices fit in the caches. When using multiple cores, that region is much larger; 8x the cores means 8x the L1 and L2 caches.
But beyond that, I tried to develop a model to dynamically choose block sizes based on the hardware info and sizes of the matrices.
Inputs are a hardware summary (sizes of the caches, SIMD-vector width), dimensions of the matrices, and 4 Float64 tuning parameters that you can tune with gradient free optimization methods like a GP Sampler or Particle Swarm.
The model needs a lot of work before I can incorporate something like it into LoopVectorization.
Different CPUs like strikingly different tuning parameters, meaning there seems to something missing, that the model + hardware information isn’t capturing.

A fully general algorithm that can take array sizes, loop lengths, and hardware details and return an optimal blocking strategy is the ultimate goal here, since then we’d be able to get near optimal performance with all sorts of random tensor-style operations like Tullio does.
One of the nice things about matrix multiplication. It’s a problem with great solutions you can use as a reference.
Extending to other tensor operations would then mean checking that we don’t lose GFLOPs compared to matmul.

Other libraries like OpenBLAS and BLIS use the CPU architecture to determine block sizes, e.g. BLIS has separate files for Haswell, Skylake-X, Zen, Zen2, etc…

So at least what we have so far is more general than that.

Octavian wasn’t tuned for Zen2 before producing this plot; I think @stillyslalom reported getting another 20 GFLOPS or so for larger matrices after some tuning:


Intel 7980XE, which was used to tune the parameters in the first place also still eventually falls behind MKL, but hangs in there with OpenBLAS and BLIS despite using column-major packing instead of tile-major:

20 Likes

I wasn’t criticizing Octavian.jl! Getting the last factor of 2 in performance is often really hard, not to mention the complication of multi-threading.

14 Likes

I just like talking about matrix multiplication and jumped on the opportunity to talk about what the code is doing. :slight_smile:

29 Likes

A naive question: why m+n+p <= 256 there instead of max(m * n, n * p) <= something? seems to me the latter is related to the cache size