Does anyone know if it is possible to access the GPU of the Apple M1 chip from Julia?
I really just need Float32
matrix-matrix multiplication (SGEMM). Anything else would be a bonus.
Does anyone know if it is possible to access the GPU of the Apple M1 chip from Julia?
I really just need Float32
matrix-matrix multiplication (SGEMM). Anything else would be a bonus.
Unfortunately not really, at least for the moment.
I found a temporary solution to my specific problem.
There is an undocumented and unregistered package by @Elrod that provides an interface to a few of the Apple Accelerate routines, It has a function AppleAccelerateLinAlgWrapper.gemm(A,B)
that does A*B
, which is all I need at the moment.
It seems to be neither GPU nor CPU, but some hardware accelerator that Apple isn’t talking about?
PRs adding to docs or README always welcome.
But yes, it’s their undocumented matrix accelerator.
There seems to be one per chip for the M1, and M1pro/max, so they’re all about equal in matmul performance when using this instruction.
Since my use-case is a bit different, I created a new package, SetBlasInt, based on the idea in AppleAccelerateLinAlgWrapper, but which just replaces Julia’s BLAS methods instead of keeping the Int32 versions separate. This way, it is possible to switch to Accelerate without modifying existing code.
This is just a hack, but it does what I need it to do, for now.
Example:
using LinearAlgebra, BenchmarkTools, SetBlasInt
BLAS.lbt_forward("/System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate")
A = rand(Float32, 1000, 1000)
@btime A*A # 8.954 ms (2 allocations: 3.81 MiB)
setblasint(Int32, :sgemm)
@btime A*A # 1.592 ms (2 allocations: 3.81 MiB)
I find it interesting that Accelerate only seems to use the hardware accelerator when called from the M1 build of Julia. When called from x86 Julia (on an M1 processor), it runs on the CPU and is much slower.
That’s an enormous change. Are there any known downsides to this, like reduced accuracy?
julia> A = rand(Float32, 1000, 1000);
julia> @btime $A * $A;
min 1.459 ms, mean 1.758 ms (2 allocations, 3.81 MiB. GC mean 5.29%) # 1.8 + SetBlasInt
min 8.734 ms, mean 11.492 ms (2 allocations, 3.81 MiB. GC mean 1.96%) # 1.8 native
min 14.179 ms, mean 17.008 ms (2 allocations, 3.81 MiB. GC mean 0.18%) # 1.7 rosetta
julia> B = randn(100, 100);
julia> @btime $B * $B;
min 9.542 μs, mean 18.177 μs (2 allocations, 78.17 KiB. GC mean 17.99%)
min 63.458 μs, mean 167.130 μs (2 allocations, 78.17 KiB. GC mean 3.05%)
min 86.709 μs, mean 172.516 μs (2 allocations, 78.17 KiB. GC mean 0.60%)
I had not even thought about that. But yes, when I compare the result of Float32 matrix-matrix multiplication (using the OpenBlas Float64 result as reference), Accelerate seems to be less accurate than OpenBLAS.
Edit: After posting, I also did some tests using BigFloat as the reference, like this:
A = randn(Float32, 1000, 1000)
Aref = Float32.(big.(A)*big.(A))
sqrt(sum(abs2, Aref .- A*A)/length(A)) / eps(Float32)
The results were 87.5 and 87.6 for two different runs of OpenBLAS and 149.6 and 150.1 for the same inputs with Accelerate. So, while OpenBlas is clearly better, there’s not a huge difference.
Note that when you use SetBlasInt with Julia 1.8, you are replacing those BLAS routines with code that I copied from Julia 1.7.0. This may or may not work.
Ok. Here’s another accuracy check, which I think shows almost no change:
julia> function countepsfrom(x::T, xtrue) where {T<:AbstractFloat}
target = T(xtrue)
for n in Iterators.flatten(zip(0:100, -1:-1:-100))
nextfloat(x, n) === target && return n
end
return round(Int, (target - x) / eps(x))
end;
julia> function count_star(A::Matrix)
B = A * A
Btrue = big.(A) * big.(A)
Cs = abs.(countepsfrom.(B, Btrue))
(; max=maximum(Cs), mean=mean(Cs))
end;
julia> using Statistics
julia> count_star(rand(100, 100))
(max = 8, mean = 1.5457) # max varies a bit; mean 1.2383 on 1.7 + rosetta
julia> count_star(rand(Float32, 100, 100))
(max = 7, mean = 1.5359)
It looks like this ignores BLAS.set_num_threads
, so one place it can stumble (a little bit) is a multi-threaded loop of single-thread multiplications:
julia> C = randn(Float64, 100,100,50);
julia> @btime NNlib.batched_mul($C, $C);
min 1.016 ms, mean 1.254 ms (38 allocations, 3.82 MiB. GC mean 7.85%) # 1.8 + SetBlasInt
min 835.083 μs, mean 1.331 ms (38 allocations, 3.82 MiB. GC mean 12.26%) # 1.8 native
min 1.711 ms, mean 1.790 ms (38 allocations, 3.82 MiB. GC mean 1.60%) # 1.7 rosetta
julia> 50 .* (18.177, 167.130, 172.516) # extrapolate from means for B above
(908.85, 8356.5, 8625.8)
Yes. Since there is just one co-processor serving all four performance cores, there’s no benefit to multi-threading. (I read here that there is also a second, slower, one supporting the efficiency cores.)
But, since the co-processor outperforms multithreaded CPU-bound code all on its own, one can just write single-threaded code and not worry about it.
I’ve opened an issue here on how we can support this in julia itself: https://github.com/JuliaLang/julia/issues/43304
This is very important. I have written about it here: Convert a Current Installation of Julia to Use `BlasInt = `Int32`.
I’m seeing your timings
julia> A=rand(Float32,1000,1000);
julia> @btime AppleAccelerateLinAlgWrapper.gemm($A,$A)
1.597 ms (2 allocations: 3.81 MiB)
but is there some reason that you can’t simply do
julia> import Base.*
julia> *(A::Matrix,B::Matrix) = AppleAccelerateLinAlgWrapper.gemm(A,B);
without making a separate package and copying blas from base?
When I redefine * I get
julia> @btime $A*$A;
1.472 ms (2 allocations: 3.81 MiB)
Leading to a question for @Elrod … is there a way your package can overload * and lu(!) automatically? Or Is it safer to let people like me do that on our own?
Overloading Base.*
would work, but would only effect code that calls that function. A lot of code will use LinearAlgebra.mul!
instead, and then still use 64-bit blas. By replacing methods at the end of the chain, we get everything. (As it happens, some code that I’m using calls BLAS.syrk!
directly, so I needed SetBlasInt to cover that.)
SetBlasInt turned AppleAccelerateLinAlgWrappe into a quick fix for the problem that I had. It’s quite nice to be able to replace parts of a standard library without even having to re-compile Julia.
I thought I’d see how the results hold up for higher dimensions
and Float64. For matrix multiply the benefits decline, but are
still there. I also tried lu!.
For all the runs tAA is the time with the AppleAccelerateLinAlgWrapper functions and topen the timings with OpenBLAS running in native mode.
Everything is in 1.8-beta3 on a 2020 M1 MacBook Pro.
The bottom line seems to be that AppleAccelerate is a lot better
for small problems and, especially for lu! more or less the same
for large problems.
The matrix multiply timings come from
function mmsweep(T = Float64)
println("Report for mm at precision $T")
for n in [128, 256, 512, 1024, 2048, 4096, 8192]
mmtst(n, T)
end
end
function mmtst(n, T = Float64)
A = rand(T, n, n)
B = copy(A)
topen = @belapsed $A * $B
tAA = @belapsed AppleAccelerateLinAlgWrapper.gemm($A, $B)
rpre = topen / tAA
@printf( "dim = %5d tAA = %5.2e topen = %5.2e ratio = %5.2e \n",
n, tAA, topen, rpre)
end
and I got
julia> mmsweep(Float32)
Report for mm at precision Float32
dim = 128 tAA = 4.54e-06 topen = 5.15e-05 ratio = 1.13e+01
dim = 256 tAA = 2.66e-05 topen = 1.78e-04 ratio = 6.69e+00
dim = 512 tAA = 2.00e-04 topen = 1.16e-03 ratio = 5.78e+00
dim = 1024 tAA = 1.73e-03 topen = 8.07e-03 ratio = 4.67e+00
dim = 2048 tAA = 1.89e-02 topen = 5.74e-02 ratio = 3.03e+00
dim = 4096 tAA = 1.53e-01 topen = 3.35e-01 ratio = 2.19e+00
dim = 8192 tAA = 1.23e+00 topen = 2.75e+00 ratio = 2.23e+00
julia> mmsweep()
Report for mm at precision Float64
dim = 128 tAA = 1.36e-05 topen = 8.46e-05 ratio = 6.23e+00
dim = 256 tAA = 9.79e-05 topen = 3.71e-04 ratio = 3.79e+00
dim = 512 tAA = 8.05e-04 topen = 2.57e-03 ratio = 3.19e+00
dim = 1024 tAA = 7.92e-03 topen = 1.91e-02 ratio = 2.42e+00
dim = 2048 tAA = 6.83e-02 topen = 1.39e-01 ratio = 2.03e+00
dim = 4096 tAA = 5.76e-01 topen = 7.06e-01 ratio = 1.23e+00
dim = 8192 tAA = 4.68e+00 topen = 5.61e+00 ratio = 1.20e+00
This looks pretty good for single precision matrix multiply.
For lu! I did pretty much the same thing
function lusweep(T = Float64)
println("Report for lu! at precision $T")
for n in [128, 256, 512, 1024, 2048, 4096, 8192]
lutst(n, T)
end
end
function lutst(n, T = Float64)
A = rand(T, n, n)
B = copy(A)
topen = @belapsed lu!($A)
tAA = @belapsed AppleAccelerateLinAlgWrapper.lu!($B)
rpre = topen / tAA
@printf( "dim = %5d tAA = %5.2e topen = %5.2e ratio = %5.2e \n",
n, tAA, topen, rpre)
end
The results are
julia> lusweep(Float32)
Report for lu! at precision Float32
dim = 128 tAA = 2.41e-05 topen = 5.85e-05 ratio = 2.43e+00
dim = 256 tAA = 9.68e-05 topen = 4.74e-04 ratio = 4.89e+00
dim = 512 tAA = 4.26e-04 topen = 1.28e-03 ratio = 3.00e+00
dim = 1024 tAA = 2.02e-03 topen = 5.06e-03 ratio = 2.50e+00
dim = 2048 tAA = 1.40e-02 topen = 2.95e-02 ratio = 2.12e+00
dim = 4096 tAA = 1.77e-01 topen = 1.68e-01 ratio = 9.50e-01
dim = 8192 tAA = 9.77e-01 topen = 1.03e+00 ratio = 1.06e+00
julia> lusweep()
Report for lu! at precision Float64
dim = 128 tAA = 3.01e-05 topen = 2.54e-04 ratio = 8.43e+00
dim = 256 tAA = 1.41e-04 topen = 5.52e-04 ratio = 3.90e+00
dim = 512 tAA = 8.17e-04 topen = 1.59e-03 ratio = 1.95e+00
dim = 1024 tAA = 4.85e-03 topen = 8.66e-03 ratio = 1.78e+00
dim = 2048 tAA = 4.11e-02 topen = 5.43e-02 ratio = 1.32e+00
dim = 4096 tAA = 2.67e-01 topen = 3.04e-01 ratio = 1.14e+00
dim = 8192 tAA = 2.08e+00 topen = 2.09e+00 ratio = 1.00e+00
This apple dev thread has some discussion about this. I’m not convinced that the AMX engine is a whole lot better than conventional multithreading. However, see 43304 for an indication that Apple’s LAPACK is seriously out-of-date, which could also be the issue, especially on a new architecture.
We could be a long way from getting all the M1 (2, 3, …) has to offer.
A few comments
gflops(n,t) = 2n^3/(t*1.e9)
gives 516 GFlops for n=8192
and t=2.08
s. It is easier for me to remember performance in GFLOPS (compared to times)Would be so cool to have full access to apple silicon HW power in Julia ! Although I have seen that the metal API documentation is really scarce (I had the feeling that apple people do not really care about developers which are much more welcomed by other commercial firms (blue or green)).
Just a quick update on Metal.jl: We’ve reached a milestone today making it possible to execute simple broadcasts on the GPU:
julia> using Metal
julia> a = MtlArray([1])
1-element MtlArray{Int64, 1}:
1
julia> b = a .+ 1;
julia> b
1-element MtlArray{Int64, 1}:
2
Seems simple, but a lot was involved to get this working. It of course doesn’t provide a simple high-performance interface just like that, as (in the context of this thread) we don’t have an optimized GEMM ready to go, but it seemed relevant anyway. See https://github.com/JuliaGPU/Metal.jl/pull/13 for some details.
Any update on this? I can’t get Metal.jl to work with Flux.jl to train a simple model on the GPU. I keep getting ERROR: ArgumentError: cannot take the CPU address of a MtlMatrix{Float32}
. Any tips?