Realistically, how close is Gaius.jl to becoming a full replacement for BLAS in Julia?

I the above and is amazed. I always thought that BLAS, MKL etc are so complex and so well optimised that they are pretty much irreplaceable.

Even if Gauis is slower for cases, how close are we to doing away with BLAS altogether even as an experiment.

E.g can I compile Julia from scratch but remove all the Blas bits and call Gauis instead.

It might be slow, but I want to know how close are we to that? Is that even achievable in the foreseeable future?

5 Likes

It’s still missing a bunch of kernels, but it’s quite usable today for the kernels it has. https://github.com/YingboMa/MaBLAS.jl and https://github.com/chriselrod/PaddedMatrices.jl are two other examples as well. It’ll take quite awhile to make these into a full BLAS mostly because no one is working on that full time. That said, not all of the traditional BLAS kernels need to be implemented for a Julia BLAS, because things like BLAS1 are just broadcast so it’s somewhat better to not do them (so they can fuse), so you can probably cut it down to just a few core functions of which matrix multiplication is the key and it’s already done.

16 Likes

You might want to check out [ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays for some truly impressive benchmark results + discussion of PaddedMatrices.jl

4 Likes

Regarding Gaius.jl, I wrote a bit about this here: [ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays

TLDR: I’m not working on Gaius right now and I have no real plans to. It was never a serious project, more just an exploration of what was possible to do at a high level without nitty gritty knowledge of BLAS kernels.


To be honest, I have no real interest in writing the sort of super specialized low level code that would be required to do this right. My intention with Gaius was to see how well you can do with the sorts of high level tools available to me at the time: LoopVectorization.jl and julia’s composable multi-threading which allowed for a multi-threaded recursive divide and conquer strategy.

Projects like PaddedMatrices.jl and MaBLAS.jl are much more promising avenues with a greater likelyhood of materializing into something that could actually be used as a real BLAS library, but these not the sorts of projects I can see myself being well enough equipped or motivated to help out with. I just don’t know enough about computer hardware or fancy BLAS iteration schemes to be able to help with those projects.

My real hope is that someone can devise a index notation based multidimensional tensor contraction library where BLAS just comes out as a special case, rather than building up the tensor contractions out of BLAS kernels. As far as I know, such a library is pretty speculative and rather far off.

16 Likes

After the new update to @mcabbott’s wonderful Tullio.jl, I really feel foolish for not at least giving it a shout-out above, especially in light of https://github.com/mcabbott/Tullio.jl/pull/16.

I’m sure there are many corner-cases to work out, but with the latest version, Tullio.jl has basically satisfied my above hope:

My real hope is that someone can devise a index notation based multidimensional tensor contraction library where BLAS just comes out as a special case, rather than building up the tensor contractions out of BLAS kernels. As far as I know, such a library is pretty speculative and rather far off.

Looks like I was perhaps overly pessimistic with the “rather far off” remark. On my machine, a multi-threaded Tullio is beating multi-threaded OpenBLAS over a gigantic range of matrix sizes, and even when it starts to lose at very large sizes, it’s doing so quite gracefully, scaling much better than Gaius.jl ever did.

julia> using Tullio, LoopVectorization #Note: explicitly using LoopVectorization is essential for performance here

julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j]
tmul! (generic function with 1 method)

julia> foreach((2, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10_000)) do N
           A, B = rand(N, N + 1), rand(N + 1, N + 2)
           @show N
           @btime tmul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with Tullio.jl
           @btime  mul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with OpenBLAS
       end
N = 2
  115.501 ns (0 allocations: 0 bytes)
  125.680 ns (0 allocations: 0 bytes)
N = 10
  277.834 ns (0 allocations: 0 bytes)
  371.707 ns (0 allocations: 0 bytes)
N = 50
  10.580 μs (0 allocations: 0 bytes)
  13.109 μs (0 allocations: 0 bytes)
N = 100
  24.370 μs (67 allocations: 3.80 KiB)
  38.840 μs (0 allocations: 0 bytes)
N = 200
  110.819 μs (100 allocations: 5.98 KiB)
  162.049 μs (0 allocations: 0 bytes)
N = 500
  1.623 ms (227 allocations: 9.95 KiB)
  2.003 ms (0 allocations: 0 bytes)
N = 1000
  13.422 ms (1236 allocations: 41.48 KiB)
  13.964 ms (0 allocations: 0 bytes)
N = 2000
  106.081 ms (9300 allocations: 293.48 KiB)
  104.685 ms (0 allocations: 0 bytes)
N = 5000
  1.750 s (147542 allocations: 4.51 MiB)
  1.589 s (0 allocations: 0 bytes)
N = 10000
  14.569 s (1179735 allocations: 36.01 MiB)
  12.745 s (0 allocations: 0 bytes)

Kudos to @mcabbott for this amazing library and kudos to @Elrod for making the plumbing that’s required for this library to get to this level of performance with LoopVectorization.jl.

But wait, there’s more! Not only is Tullio.jl able to produce efficient multi-threaded CPU code, but it actually ‘just works’ on the GPU as well (though it’s not yet nearly as fast as CuBLAS):

julia> using CUDA, KernelAbstractions # Necessary for GPU codegen

julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j] # redefine the function now that we've loaded GPU packages
tmul! (generic function with 1 method)

julia> foreach((2, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10_000)) do N
           C, A, B = cu(zeros(N, N + 2)), cu(rand(N, N + 1)), cu(rand(N + 1, N + 2))
           @show N
           CUDA.@time tmul!(C, A, B) # Matmul with Tullio.jl
           CUDA.@time  mul!(C, A, B) # Matmul with CuBLAS
       end
N = 2
  0.000184 seconds (132 CPU allocations: 4.375 KiB)
  0.000064 seconds (17 CPU allocations: 464 bytes)
N = 10
  0.000075 seconds (136 CPU allocations: 4.438 KiB)
  0.000043 seconds (14 CPU allocations: 400 bytes)
N = 50
  0.000078 seconds (154 CPU allocations: 4.719 KiB)
  0.000042 seconds (14 CPU allocations: 400 bytes)
N = 100
  0.000088 seconds (180 CPU allocations: 5.125 KiB)
  0.000035 seconds (14 CPU allocations: 400 bytes)
N = 200
  0.000161 seconds (324 CPU allocations: 7.375 KiB)
  0.000040 seconds (14 CPU allocations: 400 bytes)
N = 500
  0.001257 seconds (2.32 k CPU allocations: 38.500 KiB)
  0.000191 seconds (14 CPU allocations: 400 bytes)
N = 1000
  0.013693 seconds (26.43 k CPU allocations: 415.609 KiB)
  0.000689 seconds (14 CPU allocations: 400 bytes)
N = 2000
  0.172323 seconds (341.57 k CPU allocations: 5.215 MiB)
  0.004034 seconds (14 CPU allocations: 400 bytes)
N = 5000
  2.642689 seconds (5.27 M CPU allocations: 80.372 MiB, 0.34% gc time)
  0.040553 seconds (14 CPU allocations: 400 bytes)
N = 10000
 21.101418 seconds (33.87 M CPU allocations: 516.814 MiB, 0.26% gc time)
  0.359977 seconds (17 CPU allocations: 464 bytes)

After seeing the improvements to Tullio on the CPU, I have a lot of trust we’ll see it eventually become a GPU powerhouse as well (this will require low-level improvements to KernelAbstractions.jl, just like how the CPU version is very reliant on LoopVectorization.jl)

If anyone is interested in reproducing my timings, here’s my setup:

julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc* (2020-08-01 23:44 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 6

(@v1.5) pkg> st Tullio LoopVectorization CUDA KernelAbstractions
Status `~/.julia/environments/v1.5/Project.toml`
  [052768ef] CUDA v1.2.1
  [63c18a36] KernelAbstractions v0.3.0
  [bdcacae8] LoopVectorization v0.8.24
  [bc48ee85] Tullio v0.2.1

and I’ve set LinearAlgebra.BLAS.set_num_threads(6) to get optimal multi-threaded performance out of OpenBLAS.


To answer the original question in this thread: no, I don’t think Gaius.jl will ever be a full BLAS replacement because I see no reason to work on it given what @mcabbott and @Elrod has achieved here.

Tullio has basically achieved the goal. Today. On the CPU side, it seems that all that’s needed next is clean up, polish and battle testing. The speculative side is basically done. We know now that it can beat OpenBLAS.

If you want to try making a julia binary that uses Tullio.jl rather than OpenBLAS, you’ll need to basically follow the proceedure that the MKL.jl folks did, except rather than just redirecting the low level APIs to an MKL binary, you’ll want to create julia functions using Tullio.jl that reproduce the expected BLAS functions.

34 Likes

Tullio? (WHOOPS I should have read farther down the page where you said that already )

1 Like

so Tullio.jl doesn’t use any BLAS calls underneath the hood at all?

1 Like

Yes

2 Likes

Why does Tullio allocate that much for largish arrays? I guess addressing allocations would improve performance?

1 Like

My understanding is that those allocations are inherent to multi-threading, and that’s why they only appear for large arrays (i.e. multi-threading is only turned on for sufficiently large arrays). If I understand correctly, OpenBLAS will be making the same sort of heap allocations for it’s multi-threading, but that’s not visible to things like @btime

If we turn off multi-threading, we see that the allocations go away, so the algorithm is properly in-place but of course performance suffers.

julia> let N = 1000
           X = rand(N, N); Y = rand(N, N);
           mul1!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j] threads=false
           mul2!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j]
       
           @btime $mul1!(C, $X, $Y) setup = (C = zeros($N, $N))
           @btime $mul2!(C, $X, $Y) setup = (C = zeros($N, $N))
       end
  1.359 s (0 allocations: 0 bytes)
  203.294 ms (1240 allocations: 41.63 KiB)

That said, improvements to Julia’s multi-threading (perhaps lowering it’s memory footprint?) can indeed improve the performance of Tullio.

3 Likes

I see a larger drop:

julia> foreach((2, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10_000)) do N
           A, B = rand(N, N + 1), rand(N + 1, N + 2)
           @show N
           @btime tmul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with Tullio.jl
           @btime  mul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with OpenBLAS
       end
N = 2
  94.021 ns (0 allocations: 0 bytes)
  113.121 ns (0 allocations: 0 bytes)
N = 10
  137.812 ns (0 allocations: 0 bytes)
  207.589 ns (0 allocations: 0 bytes)
N = 50
  2.776 μs (0 allocations: 0 bytes)
  3.966 μs (0 allocations: 0 bytes)
N = 100
  19.314 μs (70 allocations: 3.16 KiB)
  24.613 μs (0 allocations: 0 bytes)
N = 200
  39.291 μs (333 allocations: 15.97 KiB)
  58.184 μs (0 allocations: 0 bytes)
N = 500
  263.154 μs (496 allocations: 21.06 KiB)
  371.641 μs (0 allocations: 0 bytes)
N = 1000
  1.778 ms (2007 allocations: 68.28 KiB)
  1.714 ms (0 allocations: 0 bytes)
N = 2000
  14.587 ms (14107 allocations: 446.41 KiB)
  10.510 ms (0 allocations: 0 bytes)
N = 5000
  318.725 ms (221465 allocations: 6.76 MiB)
  147.421 ms (0 allocations: 0 bytes)
N = 10000
  2.292 s (1769756 allocations: 54.01 MiB)
  1.095 s (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.6.0-DEV.643
Commit e30e6e2790* (2020-08-14 21:17 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)

But it has a sizeable advantage at 500, and is more or less tied at 1,000.
The relative results will vary by architecture.
I could try on my Haswell laptop, but would expect to see a larger drop there. Ryzen in particular seems to choke less on memory (this also goes for Zen2, with full-speed avx2).

Most of the applications I’ve been interested in historically involved huge numbers of operations with matrices smaller than these.

I think Tullio is very aggressive in launching tasks?
@mcabbott, how does it partition the iteration space?

4 Likes

Thanks Mason, of course much of the credit for this being faster than @einsum is due to @Elrod’s work, this is largely a convenient front-end!

On my machines it does not beat OpenBLAS quite so well, and doesn’t always stay within a factor of 2 of MKL. The comparison still seems like a useful diagnostic – nothing is written specially for the case of matrix multiplication, and the package aims to be most useful for operations not covered by BLAS.

What it does at the moment is this (perhaps inspired by Gaius.jl originally):

  1. Divide the longest output dimension in half, spawning a new thread, until either it has used up nthreads(), or the total number of iterations falls below a threshold (which depends on whether the expression contains expensive functions like log).
  2. Divide the longest remaining index in two, recursively, until the number of iterations is less than Tullio.TILE[], then call a kernel (which uses @avx).

At present threads=false turns off both of these.

There are some unavoidable allocations from @spawn, but the large allocation numbers for large arrays come from the second step. And I am not completely sure why to be honest, I didn’t think recursion should be expensive, I meant to look into this some more. The actual division is like this (in step 2):

julia> @btime Tullio.cleave(z[])  setup=(z=Ref((1:100, 1:100, 1:100)))
  11.989 ns (0 allocations: 0 bytes)
((1:48, 1:100, 1:100), (49:100, 1:100, 1:100))

This I also meant to have another look at, IIRC it was quite a bit slower than KernelAbstractions.jl’s example and I didn’t figure out why.

9 Likes

Don’t be too modest. @Elrod’s work is incredibly important, but I think we can celebrate what you’ve managed to achieve without diminishing his work :smile:

The way I see it, you’re attacking the problem from two opposite sides and have now essentially met in the middle. LoopVectorization.jl can automatically produce microkernels that rival those hand-written in assembly, and Tullio.jl can automatically figure out the correct array partitioning to make sure those kernels are running on all the available threads and aren’t waiting too long for data from the cache.

Both are vital for performance, and neither are something that we want random package developers doing themselves by hand (and that’s before even getting into the other goodies Tullio has!)

17 Likes

I second the praise here for Tullio and LoopVectorization. I tried it out on a whim and ended up with huge speed gains… I’m sticking with it.

15 Likes