128Gb with M1 ultra and 96 with M2s. But this is clearly a drawback of this SOC integrated architecture.
And Apple M2 Max results (30% faster compare to M1 Max):
julia> include("SingleSpring.jl")
27.5 GFLOPS
132.0 GB/s
7.324295 seconds (1.40 M allocations: 1.162 GiB, 0.77% gc time, 1.46% compilation time)
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.1.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores
Environment:
JULIA_EDITOR = code
But what is interesting it significantly better with Julia 1.9.0 beta 4:
julia> include("SingleSpring.jl")
33.5 GFLOPS
161.0 GB/s
6.690864 seconds (1.30 M allocations: 1.165 GiB, 0.87% gc time, 4.50% compilation time)
julia> versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787ff (2023-02-07 21:53 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores
Environment:
JULIA_IMAGE_THREADS = 1
and the
using BenchmarkTools
n=500000;
x=rand(n);
y=zeros(n);
function threaded_exp!(y,x)
Threads.@threads for i in eachindex(x)
@inbounds y[i]=@inline exp(x[i])
end
end
function sequential_exp!(y,x)
for i in eachindex(x)
@inbounds y[i]=@inline exp(x[i])
end
end
tseq = @belapsed sequential_exp!(y,x)
tmt = @belapsed threaded_exp!(y,x)
SpUp = tseq/tmt; Threads.nthreads()
@show tseq,tmt,SpUp;
gives me:
(tseq, tmt, SpUp) = (0.000863083, 0.000123042, 7.014539750654248)
(0.000863083, 0.000123042, 7.014539750654248)
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.1.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores
Thanks for the feedback : the bandwidth increases again !
I would launch the SingleSpring.jl
test twice to ensure that no compilation is included in the timing.
Is the GLMakie animation smooth ?
Yes, it is smooth.
I’m working with more real Julia code, which generally uses a lot of memory and is 2x-4x faster than my Intel i9 2019 MacBook Pro (without proving ). Just CPU, mostly Float32 Flux operations, and Intel i9 Macbook is significantly hotter and noisier for the case.
@Ronis_BR can you share what tools you are using to take advantage of the shared memory?
Hi @ndinsmore !
Nothing special, just Metal.jl. The only important thing is creating the arrays memory aligned so that you can use the same memory region in CPU and GPU.
Just use the Threads.@threads macro before a for loop that you want to use in your code…Be mindful of the data-race problem. Initialize the arrays with the appropriate size and manipulate data with multi-threading in a particular memory location of the array elements.