Realistically, how close is Gaius.jl to becoming a full replacement for BLAS in Julia?

After the new update to @mcabbott’s wonderful Tullio.jl, I really feel foolish for not at least giving it a shout-out above, especially in light of Slightly better multi-threading by mcabbott · Pull Request #16 · mcabbott/Tullio.jl · GitHub.

I’m sure there are many corner-cases to work out, but with the latest version, Tullio.jl has basically satisfied my above hope:

My real hope is that someone can devise a index notation based multidimensional tensor contraction library where BLAS just comes out as a special case, rather than building up the tensor contractions out of BLAS kernels. As far as I know, such a library is pretty speculative and rather far off.

Looks like I was perhaps overly pessimistic with the “rather far off” remark. On my machine, a multi-threaded Tullio is beating multi-threaded OpenBLAS over a gigantic range of matrix sizes, and even when it starts to lose at very large sizes, it’s doing so quite gracefully, scaling much better than Gaius.jl ever did.

julia> using Tullio, LoopVectorization #Note: explicitly using LoopVectorization is essential for performance here

julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j]
tmul! (generic function with 1 method)

julia> foreach((2, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10_000)) do N
           A, B = rand(N, N + 1), rand(N + 1, N + 2)
           @show N
           @btime tmul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with Tullio.jl
           @btime  mul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with OpenBLAS
       end
N = 2
  115.501 ns (0 allocations: 0 bytes)
  125.680 ns (0 allocations: 0 bytes)
N = 10
  277.834 ns (0 allocations: 0 bytes)
  371.707 ns (0 allocations: 0 bytes)
N = 50
  10.580 μs (0 allocations: 0 bytes)
  13.109 μs (0 allocations: 0 bytes)
N = 100
  24.370 μs (67 allocations: 3.80 KiB)
  38.840 μs (0 allocations: 0 bytes)
N = 200
  110.819 μs (100 allocations: 5.98 KiB)
  162.049 μs (0 allocations: 0 bytes)
N = 500
  1.623 ms (227 allocations: 9.95 KiB)
  2.003 ms (0 allocations: 0 bytes)
N = 1000
  13.422 ms (1236 allocations: 41.48 KiB)
  13.964 ms (0 allocations: 0 bytes)
N = 2000
  106.081 ms (9300 allocations: 293.48 KiB)
  104.685 ms (0 allocations: 0 bytes)
N = 5000
  1.750 s (147542 allocations: 4.51 MiB)
  1.589 s (0 allocations: 0 bytes)
N = 10000
  14.569 s (1179735 allocations: 36.01 MiB)
  12.745 s (0 allocations: 0 bytes)

Kudos to @mcabbott for this amazing library and kudos to @Elrod for making the plumbing that’s required for this library to get to this level of performance with LoopVectorization.jl.

But wait, there’s more! Not only is Tullio.jl able to produce efficient multi-threaded CPU code, but it actually ‘just works’ on the GPU as well (though it’s not yet nearly as fast as CuBLAS):

julia> using CUDA, KernelAbstractions # Necessary for GPU codegen

julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j] # redefine the function now that we've loaded GPU packages
tmul! (generic function with 1 method)

julia> foreach((2, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10_000)) do N
           C, A, B = cu(zeros(N, N + 2)), cu(rand(N, N + 1)), cu(rand(N + 1, N + 2))
           @show N
           CUDA.@time tmul!(C, A, B) # Matmul with Tullio.jl
           CUDA.@time  mul!(C, A, B) # Matmul with CuBLAS
       end
N = 2
  0.000184 seconds (132 CPU allocations: 4.375 KiB)
  0.000064 seconds (17 CPU allocations: 464 bytes)
N = 10
  0.000075 seconds (136 CPU allocations: 4.438 KiB)
  0.000043 seconds (14 CPU allocations: 400 bytes)
N = 50
  0.000078 seconds (154 CPU allocations: 4.719 KiB)
  0.000042 seconds (14 CPU allocations: 400 bytes)
N = 100
  0.000088 seconds (180 CPU allocations: 5.125 KiB)
  0.000035 seconds (14 CPU allocations: 400 bytes)
N = 200
  0.000161 seconds (324 CPU allocations: 7.375 KiB)
  0.000040 seconds (14 CPU allocations: 400 bytes)
N = 500
  0.001257 seconds (2.32 k CPU allocations: 38.500 KiB)
  0.000191 seconds (14 CPU allocations: 400 bytes)
N = 1000
  0.013693 seconds (26.43 k CPU allocations: 415.609 KiB)
  0.000689 seconds (14 CPU allocations: 400 bytes)
N = 2000
  0.172323 seconds (341.57 k CPU allocations: 5.215 MiB)
  0.004034 seconds (14 CPU allocations: 400 bytes)
N = 5000
  2.642689 seconds (5.27 M CPU allocations: 80.372 MiB, 0.34% gc time)
  0.040553 seconds (14 CPU allocations: 400 bytes)
N = 10000
 21.101418 seconds (33.87 M CPU allocations: 516.814 MiB, 0.26% gc time)
  0.359977 seconds (17 CPU allocations: 464 bytes)

After seeing the improvements to Tullio on the CPU, I have a lot of trust we’ll see it eventually become a GPU powerhouse as well (this will require low-level improvements to KernelAbstractions.jl, just like how the CPU version is very reliant on LoopVectorization.jl)

If anyone is interested in reproducing my timings, here’s my setup:

julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc* (2020-08-01 23:44 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 6

(@v1.5) pkg> st Tullio LoopVectorization CUDA KernelAbstractions
Status `~/.julia/environments/v1.5/Project.toml`
  [052768ef] CUDA v1.2.1
  [63c18a36] KernelAbstractions v0.3.0
  [bdcacae8] LoopVectorization v0.8.24
  [bc48ee85] Tullio v0.2.1

and I’ve set LinearAlgebra.BLAS.set_num_threads(6) to get optimal multi-threaded performance out of OpenBLAS.


To answer the original question in this thread: no, I don’t think Gaius.jl will ever be a full BLAS replacement because I see no reason to work on it given what @mcabbott and @Elrod has achieved here.

Tullio has basically achieved the goal. Today. On the CPU side, it seems that all that’s needed next is clean up, polish and battle testing. The speculative side is basically done. We know now that it can beat OpenBLAS.

If you want to try making a julia binary that uses Tullio.jl rather than OpenBLAS, you’ll need to basically follow the proceedure that the MKL.jl folks did, except rather than just redirecting the low level APIs to an MKL binary, you’ll want to create julia functions using Tullio.jl that reproduce the expected BLAS functions.

35 Likes