Taking advantage of Apple M1?

LaurentPlagne · February 16, 2023, 7:40am

128Gb with M1 ultra and 96 with M2s. But this is clearly a drawback of this SOC integrated architecture.

Vitaliy_Yakovchuk · February 18, 2023, 5:02pm

And Apple M2 Max results (30% faster compare to M1 Max):

julia> include("SingleSpring.jl")
27.5     GFLOPS
132.0    GB/s
  7.324295 seconds (1.40 M allocations: 1.162 GiB, 0.77% gc time, 1.46% compilation time)

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.1.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_EDITOR = code

But what is interesting it significantly better with Julia 1.9.0 beta 4:

julia> include("SingleSpring.jl")
33.5	 GFLOPS
161.0	 GB/s
  6.690864 seconds (1.30 M allocations: 1.165 GiB, 0.87% gc time, 4.50% compilation time)

julia> versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787ff (2023-02-07 21:53 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1

Vitaliy_Yakovchuk · February 18, 2023, 5:21pm

and the

using BenchmarkTools

n=500000;

x=rand(n);

y=zeros(n);

function threaded_exp!(y,x)
           Threads.@threads for i in eachindex(x)
               @inbounds y[i]=@inline exp(x[i])
           end
       end


function sequential_exp!(y,x)
           for i in eachindex(x)
               @inbounds y[i]=@inline exp(x[i])
           end
       end


tseq = @belapsed sequential_exp!(y,x)

tmt  = @belapsed threaded_exp!(y,x)


SpUp = tseq/tmt; Threads.nthreads()


@show tseq,tmt,SpUp;

gives me:

(tseq, tmt, SpUp) = (0.000863083, 0.000123042, 7.014539750654248)
(0.000863083, 0.000123042, 7.014539750654248)

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.1.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores

LaurentPlagne · February 18, 2023, 5:28pm

Thanks for the feedback : the bandwidth increases again !

I would launch the SingleSpring.jl test twice to ensure that no compilation is included in the timing.
Is the GLMakie animation smooth ?

Vitaliy_Yakovchuk · February 18, 2023, 5:43pm

Yes, it is smooth.

I’m working with more real Julia code, which generally uses a lot of memory and is 2x-4x faster than my Intel i9 2019 MacBook Pro (without proving ). Just CPU, mostly Float32 Flux operations, and Intel i9 Macbook is significantly hotter and noisier for the case.

ndinsmore · March 11, 2023, 5:01pm

@Ronis_BR can you share what tools you are using to take advantage of the shared memory?

Ronis_BR · March 11, 2023, 5:02pm

Hi @ndinsmore !

Nothing special, just Metal.jl. The only important thing is creating the arrays memory aligned so that you can use the same memory region in CPU and GPU.

Aakhash_Sundaresan · November 10, 2023, 8:45pm

Just use the Threads.@threads macro before a for loop that you want to use in your code…Be mindful of the data-race problem. Initialize the arrays with the appropriate size and manipulate data with multi-threading in a particular memory location of the array elements.

Topic		Replies	Views
Does Mac M1 in multithreads is slower that in single thread? Performance mac-m1	10	3605	May 16, 2021
Apple M1, M1 pro M1 Max and Julia developpers Offtopic	17	5481	November 1, 2021
Apple M1 GPU from Julia? GPU question	20	5957	March 31, 2023
Apple silicon full power Performance hardware , apple	19	6811	November 18, 2021
JuMP.jl and DifferentialEquation.jl benchmarks on M1 Max Julia 1.7.0 x89 vs ARM. (spoiler: ARM is 1.5-2x faster) General Usage jump , diffeq , apple	12	2800	December 5, 2021

Taking advantage of Apple M1?

Related topics