Note that for 1000x1000
, BLAS will be using all the threads on your computer. If you try on a smaller size (say 100x100
) where BLAS will still only use 1 thread, the performance should be comparable. It’s possible to get Julia to thread this with Tullio
, but it’s still a good example.
Yes, which leaves me with a good feeling, whenever I do something, for which there is no underlying BLAS
routine. Many thanks to @Elrod and everyone else contributing to LoopVectorization
and related packages .
And for all those memory bound operations you come pretty close to the performance of BLAS
quite independent of the problem size.
Well, it turns out that if you set Chris Elrod loose on the problem (via Octavian.jl), you can outstrip BLAS:
julia> using LinearAlgebra, BenchmarkTools, Octavian
julia> r1 = rand(1000, 1000); r2 = rand(1000, 1000); c = zeros(1000, 1000);
julia> @btime mul!($c, $r1, $r2);
9.309 ms (0 allocations: 0 bytes)
julia> @btime Octavian.matmul!($c, $r1, $r2);
7.671 ms (0 allocations: 0 bytes)
I was not aware. thank you for posting
LoopVectorization can produce a near perfect microkernel, but it’s not just that it’s missing multithreading to beat BLAS for large matrices. It turns out that for large enough matrices, multiplication is so expensive that there’s a lot of tricks that can be very profitable that LoopVectorization won’t do for you. In this context, you’d still want LoopVecotrization for your microkernel, but there are outer loops and tricks you’d want to do on top of that to compete with BLAS even for single threaded matmul of large matrices.
PaddedMatrices.jl and it’s successor(?) Octavian.jl do these tricks on top of a LoopVectorization.jl microkernel and it shows in their performance numbers,
For one thing, you need to optimize the memoryaccess patterns to improve cache reuse (e.g. by blocking the matrix multiply) as soon as the matrices exceed the cache size.
This is getting off topic, but I’ve enjoyed the little bits and spurts of pure Julia BLAS like codes being created, but it doesn’t seem there is a centralized place to follow the development occuring in various packages. It’d be fun to know where it’s discussed (Slack thread?).
Of course I’m not an expert in any of these things and don’t understand it so this is purely my idle curiosity and I suspect it’s just that there aren’t many public discussions about it.
This gets discussed from time to time on Zulip. E.g.
More of the BLAS specific stuff ends up in the repos of the various packages though. E.g. this thread is a good read:
For example, here is a little recursive implementation of a cacheoblivious matrix multiplication that stays within a factor of 2 of the singlethreaded OpenBLAS performance on my laptop up to 3000×3000 matrices:
function add_matmul_rec!(m,n,p, i0,j0,k0, C,A,B)
if m+n+p <= 256 # base case: naive matmult for sufficiently large matrices
@avx for i = 1:m, k = 1:p
c = zero(eltype(C))
for j = 1:n
@inbounds c += A[i0+i,j0+j] * B[j0+j,k0+k]
end
@inbounds C[i0+i,k0+k] += c
end
else
m2 = m ÷ 2; n2 = n ÷ 2; p2 = p ÷ 2
add_matmul_rec!(m2, n2, p2, i0, j0, k0, C, A, B)
add_matmul_rec!(mm2, n2, p2, i0+m2, j0, k0, C, A, B)
add_matmul_rec!(m2, nn2, p2, i0, j0+n2, k0, C, A, B)
add_matmul_rec!(m2, n2, pp2, i0, j0, k0+p2, C, A, B)
add_matmul_rec!(mm2, nn2, p2, i0+m2, j0+n2, k0, C, A, B)
add_matmul_rec!(m2, nn2, pp2, i0, j0+n2, k0+p2, C, A, B)
add_matmul_rec!(mm2, n2, pp2, i0+m2, j0, k0+p2, C, A, B)
add_matmul_rec!(mm2, nn2, pp2, i0+m2, j0+n2, k0+p2, C, A, B)
end
return C
end
function matmul_rec!(C, A, B)
m,n = size(A)
n,p = size(B)
size(C) == (m,p)  error("incorrect dimensions ", size(C), " ≠ $m × $p")
fill!(C, zero(eltype(C)))
return add_matmul_rec!(m,n,p, 0,0,0, C,A,B)
end
matmul_rec(A, B) = matmul_rec!(Array{promote_type(eltype(A), eltype(B))}(undef,
size(A,1), size(B,2)),
A, B)
In comparison, Octavian.jl is ~100× more code, and OpenBLAS is > 1000× more code (and only supports 4 arithmetic types, whereas this code is typegeneric).
Here larger numbers = faster (vertical axis is rate of floatingpoint arithmetic operations), timings are for double precision, “naive” is a 3loop Julia version with @avx
, and BLAS is mul!
with OpenBLAS:
As an interested but uneducated (in this domain) observer, is all this effort (LoopVectorization, Gaius, Tullio, Octavian, etc. etc.) building toward something? If the right people keep plugging away are folks eventual gonna arrive at a pure Julia BLAS package that rivals other solutions (OpenBLAS, MKL) on most processors/matrix sizes? Or is it all just fun tuning that can win in some cases on some CPUs or single threaded but isn’t actually going to compete broadly?
I don’t think Julia needs such a thing, but it would be so cool!
I can only speak from my personal experience, but I have been using these tools. I had an instance, where a method did something, which could not be expressed by a BLAS
routine and the standard code was terribly slow due to the fact that Complex
numbers were involved (serial number crunching and no packed operations). In the end I could speed up this one method by a factor of 12, which would not have been possible without the effort by these experts.
Hard to say. If it is leading to a real pure julia BLAS that replaces OpenBLAS for us, that’s at the very least a couple years off. There would be advantages though if we did manage to do it.

We could potentially support a huge variety of julia datatypes with such a BLAS library. LoopVectorization currently only supports our machine types, but my understanding is that when the rewrite happens, a lot of thought will be put into trying to make a design that properly supports semiarbitrary structs. This could be a huge advantage

Size. OpenBLAS is a hefty binary dependancy and it being linked to julia accounts for a nontrivial amount of julia’s startup time.

Latency and interproceedural optimizations. Having BLAS in julia rather than OpenBLAS means that the same compiler sees more of your code. More optimization opportunities.

Sovereignty. Sometimes, it’s nice to have control over your dependencies. For example, OpenBLAS’ multithreading is not composable with julia’s multithreading. If you use both simultaneously, you might find yourself with worse than single threaded performance. If we control our BLAS, we have more options to rectify this (not strictly required. See for example FFTW which is composable with julia from what I understand)
It’s easy to overstate the importance of these points. They’re all nice, but none are actually super compelling reasons to devote a ton of energy to this. But there certainly are potential advantages
Multithreading is already good in Tullio, and better than anyone else in Octavian for all but the most gigantic of matrices. Octavian also seems so far to work well on quite a broad range of CPUs, so I’d say this is already debunked.
LoopVectorization
and Tullio
, in particular, handle situations that are not addressed by BLAS — they aren’t limited to accelerating BLAS operations.
Something worth pointing out is that single threaded BLAS is really useful for lots of HPC. A lot of problems have high level parallelism, so you turn off BLAS threading anyway.
StrideArrays.jl is the official successor of PaddedMatrices.jl, because PaddedMatrices really didn’t have anything to do with padding matrices anymore.
However, in the process, I also factored out the matrix multiplication algorithm from StrideArrays and into Octavian.jl, which StrideArrays depends on.
The algorithm has improved a lot since I last posted about it on discourse, and as you note, most of the discussion is in GitHub issues like some initial benchmark comparisons.
And the compile times to match…I’ve often had around 20s as the “time to first matmul”.
Based on the sizes of the arrays, it switches between a base case, packing A
, and packing both A
and B
.
Packing here means copying subblocks of memory. For A
, these blocks are stack allocated, while B
uses a heap allocated const
global (with a lock around it to avoid multiple matmuls in different threads using the same preallocated memory at the same time). [For 32bit Julia, A
blocks are also heap allocated, because 32bit Julia seems to have a smaller stack? Stack allocation was causing stack overflow errors on 32bit builds – not often I see those without recursion].
When threading, for “small” matrices (compared to the amount of L3 cache), it subdivides blocks of C
among the cores. If the corresponding blocks of A are large compared to the L2, these will be packed.
For matrices large compared to the L3, threading gets more complicated as it needs to synchronize.
That’s a brief overview of why there’s 100x more code: lots of branches on the size of the arrays to find a good approach.
The “aggressive subdivision” approaches don’t need cachetuning, since it’s for when the matrices fit in the caches. When using multiple cores, that region is much larger; 8x the cores means 8x the L1 and L2 caches.
But beyond that, I tried to develop a model to dynamically choose block sizes based on the hardware info and sizes of the matrices.
Inputs are a hardware summary (sizes of the caches, SIMDvector width), dimensions of the matrices, and 4 Float64
tuning parameters that you can tune with gradient free optimization methods like a GP Sampler or Particle Swarm.
The model needs a lot of work before I can incorporate something like it into LoopVectorization.
Different CPUs like strikingly different tuning parameters, meaning there seems to something missing, that the model + hardware information isn’t capturing.
A fully general algorithm that can take array sizes, loop lengths, and hardware details and return an optimal blocking strategy is the ultimate goal here, since then we’d be able to get near optimal performance with all sorts of random tensorstyle operations like Tullio does.
One of the nice things about matrix multiplication. It’s a problem with great solutions you can use as a reference.
Extending to other tensor operations would then mean checking that we don’t lose GFLOPs compared to matmul.
Other libraries like OpenBLAS and BLIS use the CPU architecture to determine block sizes, e.g. BLIS has separate files for Haswell, SkylakeX, Zen, Zen2, etc…
So at least what we have so far is more general than that.
Octavian wasn’t tuned for Zen2 before producing this plot; I think @stillyslalom reported getting another 20 GFLOPS or so for larger matrices after some tuning:
Intel 7980XE, which was used to tune the parameters in the first place also still eventually falls behind MKL, but hangs in there with OpenBLAS and BLIS despite using columnmajor packing instead of tilemajor:
I wasn’t criticizing Octavian.jl! Getting the last factor of 2 in performance is often really hard, not to mention the complication of multithreading.
I just like talking about matrix multiplication and jumped on the opportunity to talk about what the code is doing.
A naive question: why m+n+p <= 256 there instead of max(m * n, n * p) <= something? seems to me the latter is related to the cache size
The base case is completely unrelated to the cache size. You want the base case to be just large enough that the recursion overhead is negligible in comparison, but much smaller than any cache.
Probably it should be m * n * p <= something
, therefore, since the relevant factor is the cost of the base case, which scales like \Theta(mnp). But since I was looking mostly at square matrices it didn’t really matter too much exactly how we implemented the criterion as long as we did a little tuning of the cutoff value.