Apple M1 GPU from Julia?

Does anyone know if it is possible to access the GPU of the Apple M1 chip from Julia?

I really just need Float32 matrix-matrix multiplication (SGEMM). Anything else would be a bonus.


Unfortunately not really, at least for the moment.

1 Like

I found a temporary solution to my specific problem.

There is an undocumented and unregistered package by @Elrod that provides an interface to a few of the Apple Accelerate routines, It has a function AppleAccelerateLinAlgWrapper.gemm(A,B) that does A*B, which is all I need at the moment.

It seems to be neither GPU nor CPU, but some hardware accelerator that Apple isn’t talking about?


PRs adding to docs or README always welcome.

But yes, it’s their undocumented matrix accelerator.
There seems to be one per chip for the M1, and M1pro/max, so they’re all about equal in matmul performance when using this instruction.


Since my use-case is a bit different, I created a new package, SetBlasInt, based on the idea in AppleAccelerateLinAlgWrapper, but which just replaces Julia’s BLAS methods instead of keeping the Int32 versions separate. This way, it is possible to switch to Accelerate without modifying existing code.

This is just a hack, but it does what I need it to do, for now.


using LinearAlgebra, BenchmarkTools, SetBlasInt
A = rand(Float32, 1000, 1000)
@btime A*A  #  8.954 ms (2 allocations: 3.81 MiB)
setblasint(Int32, :sgemm)
@btime A*A  #  1.592 ms (2 allocations: 3.81 MiB)

I find it interesting that Accelerate only seems to use the hardware accelerator when called from the M1 build of Julia. When called from x86 Julia (on an M1 processor), it runs on the CPU and is much slower.

That’s an enormous change. Are there any known downsides to this, like reduced accuracy?

julia> A = rand(Float32, 1000, 1000);

julia> @btime $A * $A;
  min 1.459 ms, mean 1.758 ms (2 allocations, 3.81 MiB. GC mean 5.29%)   # 1.8 + SetBlasInt
  min 8.734 ms, mean 11.492 ms (2 allocations, 3.81 MiB. GC mean 1.96%)  # 1.8 native
  min 14.179 ms, mean 17.008 ms (2 allocations, 3.81 MiB. GC mean 0.18%) # 1.7 rosetta

julia> B = randn(100, 100);

julia> @btime $B * $B;
  min 9.542 μs, mean 18.177 μs (2 allocations, 78.17 KiB. GC mean 17.99%)
  min 63.458 μs, mean 167.130 μs (2 allocations, 78.17 KiB. GC mean 3.05%)
  min 86.709 μs, mean 172.516 μs (2 allocations, 78.17 KiB. GC mean 0.60%)

I had not even thought about that. But yes, when I compare the result of Float32 matrix-matrix multiplication (using the OpenBlas Float64 result as reference), Accelerate seems to be less accurate than OpenBLAS.

Edit: After posting, I also did some tests using BigFloat as the reference, like this:

A = randn(Float32, 1000, 1000)
Aref = Float32.(big.(A)*big.(A))
sqrt(sum(abs2, Aref .- A*A)/length(A)) / eps(Float32)  

The results were 87.5 and 87.6 for two different runs of OpenBLAS and 149.6 and 150.1 for the same inputs with Accelerate. So, while OpenBlas is clearly better, there’s not a huge difference.

Note that when you use SetBlasInt with Julia 1.8, you are replacing those BLAS routines with code that I copied from Julia 1.7.0. This may or may not work.

Ok. Here’s another accuracy check, which I think shows almost no change:

julia> function countepsfrom(x::T, xtrue) where {T<:AbstractFloat}
           target = T(xtrue)
           for n in Iterators.flatten(zip(0:100, -1:-1:-100))
               nextfloat(x, n) === target && return n
           return round(Int, (target - x) / eps(x))

julia> function count_star(A::Matrix)
          B = A * A
          Btrue = big.(A) * big.(A)
          Cs = abs.(countepsfrom.(B, Btrue))
          (; max=maximum(Cs), mean=mean(Cs))

julia> using Statistics

julia> count_star(rand(100, 100))
(max = 8, mean = 1.5457)   # max varies a bit; mean 1.2383 on 1.7 + rosetta

julia> count_star(rand(Float32, 100, 100))
(max = 7, mean = 1.5359)

It looks like this ignores BLAS.set_num_threads, so one place it can stumble (a little bit) is a multi-threaded loop of single-thread multiplications:

julia> C = randn(Float64, 100,100,50);

julia> @btime NNlib.batched_mul($C, $C);
  min 1.016 ms, mean 1.254 ms (38 allocations, 3.82 MiB. GC mean 7.85%)    # 1.8 + SetBlasInt
  min 835.083 μs, mean 1.331 ms (38 allocations, 3.82 MiB. GC mean 12.26%) # 1.8 native
  min 1.711 ms, mean 1.790 ms (38 allocations, 3.82 MiB. GC mean 1.60%)    # 1.7 rosetta

julia> 50 .* (18.177, 167.130, 172.516)  # extrapolate from means for B above
(908.85, 8356.5, 8625.8)
1 Like

Yes. Since there is just one co-processor serving all four performance cores, there’s no benefit to multi-threading. (I read here that there is also a second, slower, one supporting the efficiency cores.)

But, since the co-processor outperforms multithreaded CPU-bound code all on its own, one can just write single-threaded code and not worry about it.

I’ve opened an issue here on how we can support this in julia itself: Make USE_BLAS64 a runtime option? · Issue #43304 · JuliaLang/julia · GitHub


This is very important. I have written about it here: Convert a Current Installation of Julia to Use `BlasInt = `Int32`.

I’m seeing your timings

julia> A=rand(Float32,1000,1000);

julia> @btime AppleAccelerateLinAlgWrapper.gemm($A,$A)
  1.597 ms (2 allocations: 3.81 MiB)

but is there some reason that you can’t simply do

julia> import Base.*

julia> *(A::Matrix,B::Matrix) = AppleAccelerateLinAlgWrapper.gemm(A,B);

without making a separate package and copying blas from base?

When I redefine * I get

julia> @btime $A*$A;
  1.472 ms (2 allocations: 3.81 MiB)

Leading to a question for @Elrod … is there a way your package can overload * and lu(!) automatically? Or Is it safer to let people like me do that on our own?

Overloading Base.* would work, but would only effect code that calls that function. A lot of code will use LinearAlgebra.mul! instead, and then still use 64-bit blas. By replacing methods at the end of the chain, we get everything. (As it happens, some code that I’m using calls BLAS.syrk! directly, so I needed SetBlasInt to cover that.)

SetBlasInt turned AppleAccelerateLinAlgWrappe into a quick fix for the problem that I had. It’s quite nice to be able to replace parts of a standard library without even having to re-compile Julia. :slight_smile:

I thought I’d see how the results hold up for higher dimensions
and Float64. For matrix multiply the benefits decline, but are
still there. I also tried lu!.

For all the runs tAA is the time with the AppleAccelerateLinAlgWrapper functions and topen the timings with OpenBLAS running in native mode.

Everything is in 1.8-beta3 on a 2020 M1 MacBook Pro.

The bottom line seems to be that AppleAccelerate is a lot better
for small problems and, especially for lu! more or less the same
for large problems.

The matrix multiply timings come from

function mmsweep(T = Float64)
    println("Report for mm at precision $T")
    for n in [128, 256, 512, 1024, 2048, 4096, 8192]
        mmtst(n, T)

function mmtst(n, T = Float64)
    A = rand(T, n, n)
    B = copy(A)
    topen = @belapsed $A * $B
    tAA = @belapsed AppleAccelerateLinAlgWrapper.gemm($A, $B)
    rpre = topen / tAA
    @printf( "dim = %5d    tAA = %5.2e    topen = %5.2e   ratio = %5.2e \n",
        n, tAA, topen, rpre)

and I got

julia> mmsweep(Float32)
Report for mm at precision Float32
dim =   128    tAA = 4.54e-06    topen = 5.15e-05   ratio = 1.13e+01 
dim =   256    tAA = 2.66e-05    topen = 1.78e-04   ratio = 6.69e+00 
dim =   512    tAA = 2.00e-04    topen = 1.16e-03   ratio = 5.78e+00 
dim =  1024    tAA = 1.73e-03    topen = 8.07e-03   ratio = 4.67e+00 
dim =  2048    tAA = 1.89e-02    topen = 5.74e-02   ratio = 3.03e+00 
dim =  4096    tAA = 1.53e-01    topen = 3.35e-01   ratio = 2.19e+00 
dim =  8192    tAA = 1.23e+00    topen = 2.75e+00   ratio = 2.23e+00 

julia> mmsweep()
Report for mm at precision Float64
dim =   128    tAA = 1.36e-05    topen = 8.46e-05   ratio = 6.23e+00 
dim =   256    tAA = 9.79e-05    topen = 3.71e-04   ratio = 3.79e+00 
dim =   512    tAA = 8.05e-04    topen = 2.57e-03   ratio = 3.19e+00 
dim =  1024    tAA = 7.92e-03    topen = 1.91e-02   ratio = 2.42e+00 
dim =  2048    tAA = 6.83e-02    topen = 1.39e-01   ratio = 2.03e+00 
dim =  4096    tAA = 5.76e-01    topen = 7.06e-01   ratio = 1.23e+00 
dim =  8192    tAA = 4.68e+00    topen = 5.61e+00   ratio = 1.20e+00 

This looks pretty good for single precision matrix multiply.

For lu! I did pretty much the same thing

function lusweep(T = Float64)
    println("Report for lu! at precision $T")
    for n in [128, 256, 512, 1024, 2048, 4096, 8192]
        lutst(n, T)

function lutst(n, T = Float64)
    A = rand(T, n, n)
    B = copy(A)
    topen = @belapsed lu!($A)
    tAA = @belapsed!($B)
    rpre = topen / tAA
    @printf( "dim = %5d    tAA = %5.2e    topen = %5.2e   ratio = %5.2e \n",
        n, tAA, topen, rpre)

The results are

julia> lusweep(Float32)
Report for lu! at precision Float32
dim =   128    tAA = 2.41e-05    topen = 5.85e-05   ratio = 2.43e+00 
dim =   256    tAA = 9.68e-05    topen = 4.74e-04   ratio = 4.89e+00 
dim =   512    tAA = 4.26e-04    topen = 1.28e-03   ratio = 3.00e+00 
dim =  1024    tAA = 2.02e-03    topen = 5.06e-03   ratio = 2.50e+00 
dim =  2048    tAA = 1.40e-02    topen = 2.95e-02   ratio = 2.12e+00 
dim =  4096    tAA = 1.77e-01    topen = 1.68e-01   ratio = 9.50e-01 
dim =  8192    tAA = 9.77e-01    topen = 1.03e+00   ratio = 1.06e+00 

julia> lusweep()
Report for lu! at precision Float64
dim =   128    tAA = 3.01e-05    topen = 2.54e-04   ratio = 8.43e+00 
dim =   256    tAA = 1.41e-04    topen = 5.52e-04   ratio = 3.90e+00 
dim =   512    tAA = 8.17e-04    topen = 1.59e-03   ratio = 1.95e+00 
dim =  1024    tAA = 4.85e-03    topen = 8.66e-03   ratio = 1.78e+00 
dim =  2048    tAA = 4.11e-02    topen = 5.43e-02   ratio = 1.32e+00 
dim =  4096    tAA = 2.67e-01    topen = 3.04e-01   ratio = 1.14e+00 
dim =  8192    tAA = 2.08e+00    topen = 2.09e+00   ratio = 1.00e+00 

This apple dev thread has some discussion about this. I’m not convinced that the AMX engine is a whole lot better than conventional multithreading. However, see 43304 for an indication that Apple’s LAPACK is seriously out-of-date, which could also be the issue, especially on a new architecture.

We could be a long way from getting all the M1 (2, 3, …) has to offer.

A few comments

  • The thread title maybe modified since the AMX engine is not the GPU
  • I have measured up to 500 GFLOPS for dgemm and 2TFLOPS for sgemm using AMX on my M1 Pro : this is coherent with your timings where gflops(n,t) = 2n^3/(t*1.e9) gives 516 GFlops for n=8192 and t=2.08s. It is easier for me to remember performance in GFLOPS (compared to times)
  • More importantly, AMX based bench seemed to consume much less energy than MT CPU solutions (It should be measured with a watt meter since I just rely on the fan that start to spin (CPU) or not (AMX).
  • Last, I have made some tests for sparse matrix solvers which are also much faster using metal than what I can achieve in Julia…

Would be so cool to have full access to apple silicon HW power in Julia ! Although I have seen that the metal API documentation is really scarce (I had the feeling that apple people do not really care about developers which are much more welcomed by other commercial firms (blue or green)).


Just a quick update on Metal.jl: We’ve reached a milestone today making it possible to execute simple broadcasts on the GPU:

julia> using Metal

julia> a = MtlArray([1])
1-element MtlArray{Int64, 1}:

julia> b = a .+ 1;

julia> b
1-element MtlArray{Int64, 1}:

Seems simple, but a lot was involved to get this working. It of course doesn’t provide a simple high-performance interface just like that, as (in the context of this thread) we don’t have an optimized GEMM ready to go, but it seemed relevant anyway. See Rework metadata generation by maleadt · Pull Request #13 · JuliaGPU/Metal.jl · GitHub for some details.