# Taking advantage of Apple M1?

Can you give me an example of multi-threading? it is my first time with more than one core hahaha

Hi Romain,

This observation is a bit old (July 2022) and I was not able to reproduce this issue after updating my environment (Julia 1.8.5, Ventura and up to date Julia packages). Makie OpenGL animation run smooth now

My issue concerning Metal.jl (slow broadcast copy in 2D · Issue #41 · JuliaGPU/Metal.jl · GitHub) remains though.

1 Like

Well, multi-threading is not a trivial subject and I guess that you will have to study the documentation (Multi-Threading · The Julia Language) and especially data races (Multi-Threading · The Julia Language).

Here is a trivial example though.

``````using BenchmarkTools

function sequential_exp!(y,x)
for i in eachindex(x)
@inbounds y[i]=exp(x[i])
end
end

@inbounds y[i]=exp(x[i])
end
end

n=500000
x=rand(n)
y=zeros(n)

tseq = @belapsed sequential_exp!(y,x)
SpUp = tseq/tmt
@show tseq,tmt,SpUp
``````

I obtain (M1 Max):

``````julia> include("testmt.jl")
(tseq, tmt, SpUp) = (0.001626458, 0.000223959, 7.262302475006585)
(0.001626458, 0.000223959, 7.262302475006585)
``````
1 Like

Okay, now I can’t resist. Here is what I get with 8 threads on my 14nm Intel desktop:

``````julia> using BenchmarkTools

julia> n=500000;

julia> x=rand(n);

julia> y=zeros(n);

@inbounds y[i]=@inline exp(x[i])
end
end
threaded_exp! (generic function with 1 method)

julia> function sequential_exp!(y,x)
for i in eachindex(x)
@inbounds y[i]=@inline exp(x[i])
end
end
sequential_exp! (generic function with 1 method)

julia> tseq = @belapsed sequential_exp!(y,x)
0.000430829

6.9831e-5

8

julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000430829, 6.9831e-5, 6.1695951654709225)

julia> versioninfo()
Julia Version 1.10.0-DEV.410
Commit 62c1137e38* (2023-01-21 04:16 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
``````

The most important bit:

``````julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000430829, 6.9831e-5, 6.1695951654709225)
``````

You can try rerunning with the `@inline`s like I have above.
Sure, I only got a 6.2x speedup, but

``````julia> (0.001626458, 0.000223959) ./ (0.000430829, 6.9831e-5)
(3.775182264889318, 3.2071572797181767)
``````

far from being smoked, my old Intel CPU was over 3x faster on both 1 thread and 8 threads.

To make the M1 look good, you need either benchmarks that aren’t SIMD-friendly, or are memory bandwidth constrained.
To be fair, most code probably falls under one of those two categories. Including your benchmark if we remove the `@inline`.

4 Likes

Thanks Chris, this is very Interesting !
I am very surprised that the `@inline` make such a difference here !

M1 Max (unplugged )

``````Threads.nthreads() = 8
(tseq, tmt, SpUp) = (0.000936084, 0.000131709, 7.10721363004806)
(0.000936084, 0.000131709, 7.10721363004806)
``````

I agree that this example has not been chosen to emphasis the M1’s strength but as the simplest MT example (showing convincing Speedup) I could think of.
I will find a better one

1 Like

Here is a memory bound example (wave propag with low order F.D. scheme) with an Makie (GLMakie) animation running during the computations.

SingleSpring.jl
``````using GLMakie

Base.@kwdef struct SpringParam
ls::Float64 # Spring length
ms::Float64 # Mass
ks::Float64 # Spring Strength
ns::Int     # Mass number
end

Base.@kwdef struct InitParam
λ::Float64      # Wavelength of the initial deformation
shift::Float64  # Maximal initial node displacement
pos::Float64    # Central position of the node displacement
end

Base.@kwdef struct AnimParam
δt::Float64      # Time step for the dynamic simulation
nδt::Int         # Number of timesteps
nδtperframe::Int # Number of timesteps per animation frame
end

steady_position(i, ls) = (i - 1) * ls
function initial_position(i,j,ls,λ,shift,posx,posy)
dx=xs-posx
dy=ys-posy
dr=sqrt(dx^2+dy^2)
ds=(-1/λ^2)*exp(-(dr/λ)^2/2)*shift
xs+dx*ds,ys+dy*ds
end

function display_perf(td,xc,nδt)
nflops=20 # Manual count floating point ops
nRW=12    # Manual count Array Reads and Writes
T=eltype(xc) # i.e. Float64
floatsize_inbytes=sizeof(T) #Float64 -> 8 Bytes
nbytes=nRW*floatsize_inbytes # Float64 -> 96 Bytes

mupns=length(xc)*nδt/(td*1.e9) # mass pos updates per ns
println("\$(round((mupns*nflops),sigdigits=3))\t GFLOPS")
println("\$(round((mupns*nbytes),sigdigits=3))\t GB/s")
end

#V is the type of the returned arrays (type are first class citizens !)
function initialize_arrays(sp,ip,V)
(;ls,ns) = sp
(;λ,shift,pos) = ip

#2D Array comprehensions
#  -> cxs,xys,cxc,cyc are 2D arrays of size (ns,ns)
cxs=[steady_position(i,ls) for i ∈ 1:ns, _ ∈ 1:ns]
cys=[steady_position(j,ls) for _ ∈ 1:ns, j ∈ 1:ns]
cxc=[initial_position(i,j,ls,λ,shift,pos,pos)[1] for i ∈ 1:ns, j ∈ 1:ns]
cyc=[initial_position(i,j,ls,λ,shift,pos,pos)[2] for i ∈ 1:ns, j ∈ 1:ns]

#Broadcast of V Ctor equivalent to xs=V(cxs);ys=V(cys);xc=V(cxc);yc=V(cyc)
xs,ys,xc,yc=V.((cxs,cys,cxc,cyc))
#Assume zero initial velocities : xp(revious)=xc(urrent)
fx,xt,xp=zero(xc),zero(xc),copy(xc)
fy,yt,yp=zero(yc),zero(yc),copy(yc)

xs,xc,xp,xt,fx,ys,yc,yp,yt,fy
end

function update_position!(xt::Array{T,2},xc::Array{T,2},xp::Array{T,2},
fx::Array{T,2},δt, ms) where {T}

nsx,nsy=size(xc)
coef = T(δt^2 / ms)
t2 = T(2)
@inbounds for i ∈ 1:nsx
xt[i,j] = t2*xc[i,j]-xp[i,j]+fx[i,j]*coef
end
end
end

function update_force!(fx::Array{T,2},xc::Array{T,2},ks) where {T}
nsx,nsy=size(xc)
mtks=T(-ks)
t4=T(4)
@inbounds for i ∈ 2:nsx-1
fx[i,j] = mtks*(t4*xc[i,j]
-xc[i,j-1]
-xc[i-1,j]
-xc[i+1,j]
-xc[i,j+1])
end
end
end

function density_shift!(dc,xc,yc,ls)
ns=size(xc,1) #assuming square array
r0,r1,r2=1:ns-2,2:ns-1,3:ns # UnitRange
@views @. dc[r1,r1] = 1/((xc[r2,r1] - xc[r0,r1])*(yc[r1,r2] - yc[r1,r0])) - ds
end

function setup_scene(xaxis,dc,nf)
# scene = Scene(resolution = (1500, 1500))
dcnode = Observable(dc)

xM=maximum(xaxis)

dM=maximum(dc)
fig = Figure(resolution = (2000, 1400), fontsize = 22)

ax = Axis3(fig[1, 1], limits=(0,xM,0,xM,-dM/8,dM/8))

# scale!(ax, 1, 1, 0.1*xM/dM)

#   translate_cam!(scene,(10,10,2))
#     update_cam!(scene, Vec3f0(0, 0, 5), Vec3f0(0.01, 0.01, 0))
#     rotate_cam!(scene,(π/4 ,-π/2 ,0))
#     αstep=-(π/4)/(nf÷2)
#     lift(d->rotate_cam!(scene,(0 ,αstep ,0)),dcnode)

fig,dcnode
end

#Triscale colors ;)

function advance_nδtpf(xc, xp, xt, fx, sp, ap)
(;ms, ks) = sp
(;δt, nδtperframe) = ap

for _ ∈ 1:nδtperframe # Anonymous iter variable (_)
update_force!(fx, xc, ks)
update_position!(xt, xc, xp, fx, δt, ms)
xc, xp, xt = xt, xc, xp # Julia's swap
end
xc, xp, xt
end

function animate_makie(sp,ip,ap,V)
(;δt,nδt,nδtperframe) = ap
xs,xc,xp,xt,fx,ys,yc,yp,yt,fy=initialize_arrays(sp,ip,V)

dc=zero(xc)
cdc=Array(xc)
density_shift!(dc,xc,yc,sp.ls)
@. cdc = dc
xaxis=Array(xs[:,1])
yaxis=Array(ys[1,:])
nf=nδt÷nδtperframe

scene,dcnode=setup_scene(xaxis,cdc,nf)
display(scene)

t=0.0
tdynamic=0.0
for _ ∈ 1:nf
tdynamic+= @elapsed begin
end
density_shift!(dc,xc,yc,sp.ls)
dcnode[] = dc
sleep(0.01)
t+=nδtperframe*δt
end

display_perf(tdynamic,xc,nδt) #Display GFLOPS and GB/s
end

function go()
sp=SpringParam(ls=0.1,ms=1,ks=2.5,ns=1000)
ip=InitParam(λ=3.0,shift=0.1,pos=sp.ls*sp.ns/2)
ap=AnimParam(δt=0.2,nδt=5000,nδtperframe=20)

V=Array{Float64,2}

@time animate_makie(sp,ip,ap,V)
end
go()
``````

I obtain (M1 Max):

``````julia> include("SingleSpring.jl")
21.1     GFLOPS
101.0    GB/s
8.506742 seconds (1.46 M allocations: 1.164 GiB, 0.86% gc time, 1.18% compilation time)
``````

with a final image :

8 Likes

Can we start a gofundme to get @Elrod a M2 MacBook pro so he can use some of his wizardry to make is sing?

5 Likes

Maybe it’s worth to mention that this:

``````julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000943125, 0.000267958, 3.5196747251434926)
``````

was ran on an M1 without a fan, chilling at around 27degC (7degC above room temperature) right now with two 4K displays (via DisplayLink), 600 processes and 3500 active threads with a load of around 4.

…but yes, roasting is something else

2 Likes

In this case, Chris’s wizard tricks may not improve M1’s results.

I guess that the remaining factor 2 in favor of the Chris’s x86 chip is related to the AVX (256 bits wide) instructions which (AFAIK) do not have counterparts on ARM (restricted to 128bits wide SIMD instructions).

This is specific to purely SIMD friendly and CPU bound problems like my first example.

The second example (waves) is memory bound and I would be interested to have x86 results on this one too

1 Like

Here are results from my Windows 11 machine:

``````julia> include("SingleSpring.jl")
4.57     GFLOPS
22.0     GB/s
27.811275 seconds (1.47 M allocations: 1.182 GiB, 1.06% gc time, 0.68% compilation time: 2% of which was recompilation)

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 8 on 8 virtual cores
Environment:
JULIA_EDITOR = code
``````
3 Likes

Thanks !

In this case you see the benefit of M1’s architecture with this high speed unified memory: the MacBook Pro is 5 times faster.

2 Likes

Impressive! But does the timing also depend on the relative speed of the graphics cards in the two machines?

Nope. Pure cpu timings.

The graphics part (opengl rendering) is not measured by the gflops and GBs metrics here.

Note that the final time (in second ) should not be taken too seriously since I have included a sleep(0.01) in order to limit the animation frame rate.

1 Like

I am surprised to see that my mac with i5 icelake cpu beats out the m1 by a hefty margin. It is more like an m1 pro. I guess they never upgraded it to tiger lake because the difference vs apple sillicon would have been quite small.

``````julia> tseq = @belapsed sequential_exp!(y,x)
0.000633398

0.000176527

8

julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000633398, 0.000176527, 3.58810833470234)

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin21.4.0)
CPU: 8 × Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, icelake-client)
Threads: 8 on 8 virtual cores
Environment:
``````

I get about the same performance on 8x 14nm Intel cores.

``````julia> include("/home/chriselrod/Documents/progwork/julia/benchmarks/SingleSpring.jl");
21.3     GFLOPS
102.0    GB/s
8.639054 seconds (1.32 M allocations: 1.166 GiB, 0.95% gc time, 4.84% compilation time)

julia> go()
21.8     GFLOPS
105.0    GB/s
8.459699 seconds (1.29 M allocations: 1.164 GiB, 0.30% gc time)

julia> versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787f* (2023-02-07 21:53 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
Threads: 8 on 36 virtual cores
``````

However, it is consuming far more power while doing it.
I do have a 7980XE at work that’s actually plugged into a monitor to measure wattage.
It’ll idle at >95 watts.

This CPU has AVX512 (as does Ice Lake), which is why they perform so well on SIMD workloads.

1 Like

We migrated to M1 architecture for many analysis and simulation tasks. Our problem is sequential (simulation of a satellite attitude control subsystem) but we could paralelize some parts.

Just using the M1 provided a huge gain compared with our previous Intel processor, something like 3x faster. I have no idea why, I just run the same code.

Now, we are taking advantage of unified memory to allow some computations in GPU. Previously, the cost to copy to GPU memory was so high that the gain was negligible. Now, since we can use the same memory, I was able to achieve more speed! This unified memory architecture is just amazing!

1 Like

M1 systems top out at 64 GB of memory

1 Like

128Gb with M1 ultra and 96 with M2s. But this is clearly a drawback of this SOC integrated architecture.

And Apple M2 Max results (30% faster compare to M1 Max):

``````julia> include("SingleSpring.jl")
27.5     GFLOPS
132.0    GB/s
7.324295 seconds (1.40 M allocations: 1.162 GiB, 0.77% gc time, 1.46% compilation time)

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.1.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores
Environment:
JULIA_EDITOR = code
``````

But what is interesting it significantly better with Julia 1.9.0 beta 4:

``````julia> include("SingleSpring.jl")
33.5	 GFLOPS
161.0	 GB/s
6.690864 seconds (1.30 M allocations: 1.165 GiB, 0.87% gc time, 4.50% compilation time)

julia> versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787ff (2023-02-07 21:53 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores
Environment:
``````
2 Likes

and the

``````using BenchmarkTools

n=500000;

x=rand(n);

y=zeros(n);

@inbounds y[i]=@inline exp(x[i])
end
end

function sequential_exp!(y,x)
for i in eachindex(x)
@inbounds y[i]=@inline exp(x[i])
end
end

tseq = @belapsed sequential_exp!(y,x)

@show tseq,tmt,SpUp;
``````

gives me:

``````(tseq, tmt, SpUp) = (0.000863083, 0.000123042, 7.014539750654248)
(0.000863083, 0.000123042, 7.014539750654248)

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.1.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores
``````
1 Like