After the new update to @mcabbottās wonderful Tullio.jl, I really feel foolish for not at least giving it a shout-out above, especially in light of Slightly better multi-threading by mcabbott Ā· Pull Request #16 Ā· mcabbott/Tullio.jl Ā· GitHub.
Iām sure there are many corner-cases to work out, but with the latest version, Tullio.jl has basically satisfied my above hope:
My real hope is that someone can devise a index notation based multidimensional tensor contraction library where BLAS just comes out as a special case, rather than building up the tensor contractions out of BLAS kernels. As far as I know, such a library is pretty speculative and rather far off.
Looks like I was perhaps overly pessimistic with the ārather far offā remark. On my machine, a multi-threaded Tullio is beating multi-threaded OpenBLAS over a gigantic range of matrix sizes, and even when it starts to lose at very large sizes, itās doing so quite gracefully, scaling much better than Gaius.jl ever did.
julia> using Tullio, LoopVectorization #Note: explicitly using LoopVectorization is essential for performance here
julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j]
tmul! (generic function with 1 method)
julia> foreach((2, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10_000)) do N
A, B = rand(N, N + 1), rand(N + 1, N + 2)
@show N
@btime tmul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with Tullio.jl
@btime mul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with OpenBLAS
end
N = 2
115.501 ns (0 allocations: 0 bytes)
125.680 ns (0 allocations: 0 bytes)
N = 10
277.834 ns (0 allocations: 0 bytes)
371.707 ns (0 allocations: 0 bytes)
N = 50
10.580 Ī¼s (0 allocations: 0 bytes)
13.109 Ī¼s (0 allocations: 0 bytes)
N = 100
24.370 Ī¼s (67 allocations: 3.80 KiB)
38.840 Ī¼s (0 allocations: 0 bytes)
N = 200
110.819 Ī¼s (100 allocations: 5.98 KiB)
162.049 Ī¼s (0 allocations: 0 bytes)
N = 500
1.623 ms (227 allocations: 9.95 KiB)
2.003 ms (0 allocations: 0 bytes)
N = 1000
13.422 ms (1236 allocations: 41.48 KiB)
13.964 ms (0 allocations: 0 bytes)
N = 2000
106.081 ms (9300 allocations: 293.48 KiB)
104.685 ms (0 allocations: 0 bytes)
N = 5000
1.750 s (147542 allocations: 4.51 MiB)
1.589 s (0 allocations: 0 bytes)
N = 10000
14.569 s (1179735 allocations: 36.01 MiB)
12.745 s (0 allocations: 0 bytes)
Kudos to @mcabbott for this amazing library and kudos to @Elrod for making the plumbing thatās required for this library to get to this level of performance with LoopVectorization.jl.
But wait, thereās more! Not only is Tullio.jl able to produce efficient multi-threaded CPU code, but it actually ājust worksā on the GPU as well (though itās not yet nearly as fast as CuBLAS):
julia> using CUDA, KernelAbstractions # Necessary for GPU codegen
julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j] # redefine the function now that we've loaded GPU packages
tmul! (generic function with 1 method)
julia> foreach((2, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10_000)) do N
C, A, B = cu(zeros(N, N + 2)), cu(rand(N, N + 1)), cu(rand(N + 1, N + 2))
@show N
CUDA.@time tmul!(C, A, B) # Matmul with Tullio.jl
CUDA.@time mul!(C, A, B) # Matmul with CuBLAS
end
N = 2
0.000184 seconds (132 CPU allocations: 4.375 KiB)
0.000064 seconds (17 CPU allocations: 464 bytes)
N = 10
0.000075 seconds (136 CPU allocations: 4.438 KiB)
0.000043 seconds (14 CPU allocations: 400 bytes)
N = 50
0.000078 seconds (154 CPU allocations: 4.719 KiB)
0.000042 seconds (14 CPU allocations: 400 bytes)
N = 100
0.000088 seconds (180 CPU allocations: 5.125 KiB)
0.000035 seconds (14 CPU allocations: 400 bytes)
N = 200
0.000161 seconds (324 CPU allocations: 7.375 KiB)
0.000040 seconds (14 CPU allocations: 400 bytes)
N = 500
0.001257 seconds (2.32 k CPU allocations: 38.500 KiB)
0.000191 seconds (14 CPU allocations: 400 bytes)
N = 1000
0.013693 seconds (26.43 k CPU allocations: 415.609 KiB)
0.000689 seconds (14 CPU allocations: 400 bytes)
N = 2000
0.172323 seconds (341.57 k CPU allocations: 5.215 MiB)
0.004034 seconds (14 CPU allocations: 400 bytes)
N = 5000
2.642689 seconds (5.27 M CPU allocations: 80.372 MiB, 0.34% gc time)
0.040553 seconds (14 CPU allocations: 400 bytes)
N = 10000
21.101418 seconds (33.87 M CPU allocations: 516.814 MiB, 0.26% gc time)
0.359977 seconds (17 CPU allocations: 464 bytes)
After seeing the improvements to Tullio on the CPU, I have a lot of trust weāll see it eventually become a GPU powerhouse as well (this will require low-level improvements to KernelAbstractions.jl, just like how the CPU version is very reliant on LoopVectorization.jl)
If anyone is interested in reproducing my timings, hereās my setup:
julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc* (2020-08-01 23:44 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 5 2600 Six-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, znver1)
Environment:
JULIA_NUM_THREADS = 6
(@v1.5) pkg> st Tullio LoopVectorization CUDA KernelAbstractions
Status `~/.julia/environments/v1.5/Project.toml`
[052768ef] CUDA v1.2.1
[63c18a36] KernelAbstractions v0.3.0
[bdcacae8] LoopVectorization v0.8.24
[bc48ee85] Tullio v0.2.1
and Iāve set LinearAlgebra.BLAS.set_num_threads(6)
to get optimal multi-threaded performance out of OpenBLAS.
To answer the original question in this thread: no, I donāt think Gaius.jl will ever be a full BLAS replacement because I see no reason to work on it given what @mcabbott and @Elrod has achieved here.
Tullio has basically achieved the goal. Today. On the CPU side, it seems that all thatās needed next is clean up, polish and battle testing. The speculative side is basically done. We know now that it can beat OpenBLAS.
If you want to try making a julia binary that uses Tullio.jl rather than OpenBLAS, youāll need to basically follow the proceedure that the MKL.jl folks did, except rather than just redirecting the low level APIs to an MKL binary, youāll want to create julia functions using Tullio.jl that reproduce the expected BLAS functions.