Taking advantage of Apple M1?

128Gb with M1 ultra and 96 with M2s. But this is clearly a drawback of this SOC integrated architecture.

And Apple M2 Max results (30% faster compare to M1 Max):

julia> include("SingleSpring.jl")
27.5     GFLOPS
132.0    GB/s
  7.324295 seconds (1.40 M allocations: 1.162 GiB, 0.77% gc time, 1.46% compilation time)

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.1.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_EDITOR = code

But what is interesting it significantly better with Julia 1.9.0 beta 4:

julia> include("SingleSpring.jl")
33.5	 GFLOPS
161.0	 GB/s
  6.690864 seconds (1.30 M allocations: 1.165 GiB, 0.87% gc time, 4.50% compilation time)

julia> versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787ff (2023-02-07 21:53 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1
2 Likes

and the

using BenchmarkTools

n=500000;

x=rand(n);

y=zeros(n);

function threaded_exp!(y,x)
           Threads.@threads for i in eachindex(x)
               @inbounds y[i]=@inline exp(x[i])
           end
       end


function sequential_exp!(y,x)
           for i in eachindex(x)
               @inbounds y[i]=@inline exp(x[i])
           end
       end


tseq = @belapsed sequential_exp!(y,x)

tmt  = @belapsed threaded_exp!(y,x)


SpUp = tseq/tmt; Threads.nthreads()


@show tseq,tmt,SpUp;

gives me:

(tseq, tmt, SpUp) = (0.000863083, 0.000123042, 7.014539750654248)
(0.000863083, 0.000123042, 7.014539750654248)

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.1.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
1 Like

Thanks for the feedback : the bandwidth increases again !

I would launch the SingleSpring.jl test twice to ensure that no compilation is included in the timing.
Is the GLMakie animation smooth ?

Yes, it is smooth.

I’m working with more real Julia code, which generally uses a lot of memory and is 2x-4x faster than my Intel i9 2019 MacBook Pro (without proving :slight_smile: ). Just CPU, mostly Float32 Flux operations, and Intel i9 Macbook is significantly hotter and noisier for the case.

2 Likes

@Ronis_BR can you share what tools you are using to take advantage of the shared memory?

Hi @ndinsmore !

Nothing special, just Metal.jl. The only important thing is creating the arrays memory aligned so that you can use the same memory region in CPU and GPU.

2 Likes

Just use the Threads.@threads macro before a for loop that you want to use in your code…Be mindful of the data-race problem. Initialize the arrays with the appropriate size and manipulate data with multi-threading in a particular memory location of the array elements.

1 Like