Does anyone know if it is possible to access the GPU of the Apple M1 chip from Julia?

I really just need `Float32`

matrix-matrix multiplication (SGEMM). Anything else would be a bonus.

Does anyone know if it is possible to access the GPU of the Apple M1 chip from Julia?

I really just need `Float32`

matrix-matrix multiplication (SGEMM). Anything else would be a bonus.

9 Likes

Unfortunately not really, at least for the moment.

1 Like

I found a temporary solution to my specific problem.

There is an undocumented and unregistered package by @Elrod that provides an interface to a few of the Apple Accelerate routines, It has a function `AppleAccelerateLinAlgWrapper.gemm(A,B)`

that does `A*B`

, which is all I need at the moment.

It seems to be neither GPU nor CPU, but some hardware accelerator that Apple isn’t talking about?

4 Likes

PRs adding to docs or README always welcome.

But yes, it’s their undocumented matrix accelerator.

There seems to be one per chip for the M1, and M1pro/max, so they’re all about equal in matmul performance when using this instruction.

2 Likes

Since my use-case is a bit different, I created a new package, SetBlasInt, based on the idea in AppleAccelerateLinAlgWrapper, but which just replaces Julia’s BLAS methods instead of keeping the Int32 versions separate. This way, it is possible to switch to Accelerate without modifying existing code.

This is just a hack, but it does what I need it to do, for now.

Example:

```
using LinearAlgebra, BenchmarkTools, SetBlasInt
BLAS.lbt_forward("/System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate")
A = rand(Float32, 1000, 1000)
@btime A*A # 8.954 ms (2 allocations: 3.81 MiB)
setblasint(Int32, :sgemm)
@btime A*A # 1.592 ms (2 allocations: 3.81 MiB)
```

3 Likes

I find it interesting that Accelerate only seems to use the hardware accelerator when called from the M1 build of Julia. When called from x86 Julia (on an M1 processor), it runs on the CPU and is much slower.

That’s an enormous change. Are there any known downsides to this, like reduced accuracy?

```
julia> A = rand(Float32, 1000, 1000);
julia> @btime $A * $A;
min 1.459 ms, mean 1.758 ms (2 allocations, 3.81 MiB. GC mean 5.29%) # 1.8 + SetBlasInt
min 8.734 ms, mean 11.492 ms (2 allocations, 3.81 MiB. GC mean 1.96%) # 1.8 native
min 14.179 ms, mean 17.008 ms (2 allocations, 3.81 MiB. GC mean 0.18%) # 1.7 rosetta
julia> B = randn(100, 100);
julia> @btime $B * $B;
min 9.542 μs, mean 18.177 μs (2 allocations, 78.17 KiB. GC mean 17.99%)
min 63.458 μs, mean 167.130 μs (2 allocations, 78.17 KiB. GC mean 3.05%)
min 86.709 μs, mean 172.516 μs (2 allocations, 78.17 KiB. GC mean 0.60%)
```

I had not even thought about that. But yes, when I compare the result of Float32 matrix-matrix multiplication (using the OpenBlas Float64 result as reference), Accelerate seems to be less accurate than OpenBLAS.

Edit: After posting, I also did some tests using BigFloat as the reference, like this:

```
A = randn(Float32, 1000, 1000)
Aref = Float32.(big.(A)*big.(A))
sqrt(sum(abs2, Aref .- A*A)/length(A)) / eps(Float32)
```

The results were 87.5 and 87.6 for two different runs of OpenBLAS and 149.6 and 150.1 for the same inputs with Accelerate. So, while OpenBlas is clearly better, there’s not a huge difference.

Note that when you use SetBlasInt with Julia 1.8, you are replacing those BLAS routines with code that I copied from Julia 1.7.0. This may or may not work.

Ok. Here’s another accuracy check, which I think shows almost no change:

```
julia> function countepsfrom(x::T, xtrue) where {T<:AbstractFloat}
target = T(xtrue)
for n in Iterators.flatten(zip(0:100, -1:-1:-100))
nextfloat(x, n) === target && return n
end
return round(Int, (target - x) / eps(x))
end;
julia> function count_star(A::Matrix)
B = A * A
Btrue = big.(A) * big.(A)
Cs = abs.(countepsfrom.(B, Btrue))
(; max=maximum(Cs), mean=mean(Cs))
end;
julia> using Statistics
julia> count_star(rand(100, 100))
(max = 8, mean = 1.5457) # max varies a bit; mean 1.2383 on 1.7 + rosetta
julia> count_star(rand(Float32, 100, 100))
(max = 7, mean = 1.5359)
```

It looks like this ignores `BLAS.set_num_threads`

, so one place it can stumble (a little bit) is a multi-threaded loop of single-thread multiplications:

```
julia> C = randn(Float64, 100,100,50);
julia> @btime NNlib.batched_mul($C, $C);
min 1.016 ms, mean 1.254 ms (38 allocations, 3.82 MiB. GC mean 7.85%) # 1.8 + SetBlasInt
min 835.083 μs, mean 1.331 ms (38 allocations, 3.82 MiB. GC mean 12.26%) # 1.8 native
min 1.711 ms, mean 1.790 ms (38 allocations, 3.82 MiB. GC mean 1.60%) # 1.7 rosetta
julia> 50 .* (18.177, 167.130, 172.516) # extrapolate from means for B above
(908.85, 8356.5, 8625.8)
```

1 Like

Yes. Since there is just one co-processor serving all four performance cores, there’s no benefit to multi-threading. (I read here that there is also a second, slower, one supporting the efficiency cores.)

But, since the co-processor outperforms multithreaded CPU-bound code all on its own, one can just write single-threaded code and not worry about it.

I’ve opened an issue here on how we can support this in julia itself: Make USE_BLAS64 a runtime option? · Issue #43304 · JuliaLang/julia · GitHub

2 Likes

This is very important. I have written about it here: Convert a Current Installation of Julia to Use `BlasInt = `Int32`.

I’m seeing your timings

```
julia> A=rand(Float32,1000,1000);
julia> @btime AppleAccelerateLinAlgWrapper.gemm($A,$A)
1.597 ms (2 allocations: 3.81 MiB)
```

but is there some reason that you can’t simply do

```
julia> import Base.*
julia> *(A::Matrix,B::Matrix) = AppleAccelerateLinAlgWrapper.gemm(A,B);
```

without making a separate package and copying blas from base?

When I redefine * I get

```
julia> @btime $A*$A;
1.472 ms (2 allocations: 3.81 MiB)
```

Leading to a question for @Elrod … is there a way your package can overload * and lu(!) automatically? Or Is it safer to let people like me do that on our own?

Overloading `Base.*`

would work, but would only effect code that calls that function. A lot of code will use `LinearAlgebra.mul!`

instead, and then still use 64-bit blas. By replacing methods at the end of the chain, we get everything. (As it happens, some code that I’m using calls `BLAS.syrk!`

directly, so I needed SetBlasInt to cover that.)

SetBlasInt turned AppleAccelerateLinAlgWrappe into a quick fix for the problem that I had. It’s quite nice to be able to replace parts of a standard library without even having to re-compile Julia.

I thought I’d see how the results hold up for higher dimensions

and Float64. For matrix multiply the benefits decline, but are

still there. I also tried lu!.

For all the runs tAA is the time with the AppleAccelerateLinAlgWrapper functions and topen the timings with OpenBLAS running in native mode.

Everything is in 1.8-beta3 on a 2020 M1 MacBook Pro.

The bottom line seems to be that AppleAccelerate is a lot better

for small problems and, especially for lu! more or less the same

for large problems.

The matrix multiply timings come from

```
function mmsweep(T = Float64)
println("Report for mm at precision $T")
for n in [128, 256, 512, 1024, 2048, 4096, 8192]
mmtst(n, T)
end
end
function mmtst(n, T = Float64)
A = rand(T, n, n)
B = copy(A)
topen = @belapsed $A * $B
tAA = @belapsed AppleAccelerateLinAlgWrapper.gemm($A, $B)
rpre = topen / tAA
@printf( "dim = %5d tAA = %5.2e topen = %5.2e ratio = %5.2e \n",
n, tAA, topen, rpre)
end
```

and I got

```
julia> mmsweep(Float32)
Report for mm at precision Float32
dim = 128 tAA = 4.54e-06 topen = 5.15e-05 ratio = 1.13e+01
dim = 256 tAA = 2.66e-05 topen = 1.78e-04 ratio = 6.69e+00
dim = 512 tAA = 2.00e-04 topen = 1.16e-03 ratio = 5.78e+00
dim = 1024 tAA = 1.73e-03 topen = 8.07e-03 ratio = 4.67e+00
dim = 2048 tAA = 1.89e-02 topen = 5.74e-02 ratio = 3.03e+00
dim = 4096 tAA = 1.53e-01 topen = 3.35e-01 ratio = 2.19e+00
dim = 8192 tAA = 1.23e+00 topen = 2.75e+00 ratio = 2.23e+00
julia> mmsweep()
Report for mm at precision Float64
dim = 128 tAA = 1.36e-05 topen = 8.46e-05 ratio = 6.23e+00
dim = 256 tAA = 9.79e-05 topen = 3.71e-04 ratio = 3.79e+00
dim = 512 tAA = 8.05e-04 topen = 2.57e-03 ratio = 3.19e+00
dim = 1024 tAA = 7.92e-03 topen = 1.91e-02 ratio = 2.42e+00
dim = 2048 tAA = 6.83e-02 topen = 1.39e-01 ratio = 2.03e+00
dim = 4096 tAA = 5.76e-01 topen = 7.06e-01 ratio = 1.23e+00
dim = 8192 tAA = 4.68e+00 topen = 5.61e+00 ratio = 1.20e+00
```

This looks pretty good for single precision matrix multiply.

For lu! I did pretty much the same thing

```
function lusweep(T = Float64)
println("Report for lu! at precision $T")
for n in [128, 256, 512, 1024, 2048, 4096, 8192]
lutst(n, T)
end
end
function lutst(n, T = Float64)
A = rand(T, n, n)
B = copy(A)
topen = @belapsed lu!($A)
tAA = @belapsed AppleAccelerateLinAlgWrapper.lu!($B)
rpre = topen / tAA
@printf( "dim = %5d tAA = %5.2e topen = %5.2e ratio = %5.2e \n",
n, tAA, topen, rpre)
end
```

The results are

```
julia> lusweep(Float32)
Report for lu! at precision Float32
dim = 128 tAA = 2.41e-05 topen = 5.85e-05 ratio = 2.43e+00
dim = 256 tAA = 9.68e-05 topen = 4.74e-04 ratio = 4.89e+00
dim = 512 tAA = 4.26e-04 topen = 1.28e-03 ratio = 3.00e+00
dim = 1024 tAA = 2.02e-03 topen = 5.06e-03 ratio = 2.50e+00
dim = 2048 tAA = 1.40e-02 topen = 2.95e-02 ratio = 2.12e+00
dim = 4096 tAA = 1.77e-01 topen = 1.68e-01 ratio = 9.50e-01
dim = 8192 tAA = 9.77e-01 topen = 1.03e+00 ratio = 1.06e+00
julia> lusweep()
Report for lu! at precision Float64
dim = 128 tAA = 3.01e-05 topen = 2.54e-04 ratio = 8.43e+00
dim = 256 tAA = 1.41e-04 topen = 5.52e-04 ratio = 3.90e+00
dim = 512 tAA = 8.17e-04 topen = 1.59e-03 ratio = 1.95e+00
dim = 1024 tAA = 4.85e-03 topen = 8.66e-03 ratio = 1.78e+00
dim = 2048 tAA = 4.11e-02 topen = 5.43e-02 ratio = 1.32e+00
dim = 4096 tAA = 2.67e-01 topen = 3.04e-01 ratio = 1.14e+00
dim = 8192 tAA = 2.08e+00 topen = 2.09e+00 ratio = 1.00e+00
```

2 Likes

This apple dev thread has some discussion about this. I’m not convinced that the AMX engine is a whole lot better than conventional multithreading. However, see 43304 for an indication that Apple’s LAPACK is seriously out-of-date, which could also be the issue, especially on a new architecture.

We could be a long way from getting all the M1 (2, 3, …) has to offer.

A few comments

- The thread title maybe modified since the AMX engine is not the GPU
- I have measured up to 500 GFLOPS for dgemm and 2TFLOPS for sgemm using AMX on my M1 Pro : this is coherent with your timings where
`gflops(n,t) = 2n^3/(t*1.e9)`

gives 516 GFlops for`n=8192`

and`t=2.08`

s. It is easier for me to remember performance in GFLOPS (compared to times) - More importantly, AMX based bench seemed to consume much less energy than MT CPU solutions (It should be measured with a watt meter since I just rely on the fan that start to spin (CPU) or not (AMX).
- Last, I have made some tests for sparse matrix solvers which are also much faster using metal than what I can achieve in Julia…

Would be so cool to have full access to apple silicon HW power in Julia ! Although I have seen that the metal API documentation is really scarce (I had the feeling that apple people do not really care about developers which are much more welcomed by other commercial firms (blue or green)).

2 Likes

Just a quick update on Metal.jl: We’ve reached a milestone today making it possible to execute simple broadcasts on the GPU:

```
julia> using Metal
julia> a = MtlArray([1])
1-element MtlArray{Int64, 1}:
1
julia> b = a .+ 1;
julia> b
1-element MtlArray{Int64, 1}:
2
```

Seems simple, but a lot was involved to get this working. It of course doesn’t provide a simple high-performance interface just like that, as (in the context of this thread) we don’t have an optimized GEMM ready to go, but it seemed relevant anyway. See Rework metadata generation by maleadt · Pull Request #13 · JuliaGPU/Metal.jl · GitHub for some details.

18 Likes