Can you give me an example of multi-threading? it is my first time with more than one core hahaha
Hi Romain,
This observation is a bit old (July 2022) and I was not able to reproduce this issue after updating my environment (Julia 1.8.5, Ventura and up to date Julia packages). Makie OpenGL animation run smooth now
My issue concerning Metal.jl (slow broadcast copy in 2D · Issue #41 · JuliaGPU/Metal.jl · GitHub) remains though.
Well, multi-threading is not a trivial subject and I guess that you will have to study the documentation (Multi-Threading · The Julia Language) and especially data races (Multi-Threading · The Julia Language).
Here is a trivial example though.
using BenchmarkTools
@show Threads.nthreads()
function sequential_exp!(y,x)
for i in eachindex(x)
@inbounds y[i]=exp(x[i])
end
end
function threaded_exp!(y,x)
Threads.@threads for i in eachindex(x)
@inbounds y[i]=exp(x[i])
end
end
n=500000
x=rand(n)
y=zeros(n)
tseq = @belapsed sequential_exp!(y,x)
tmt = @belapsed threaded_exp!(y,x)
SpUp = tseq/tmt
@show tseq,tmt,SpUp
I obtain (M1 Max):
julia> include("testmt.jl")
Threads.nthreads() = 8
(tseq, tmt, SpUp) = (0.001626458, 0.000223959, 7.262302475006585)
(0.001626458, 0.000223959, 7.262302475006585)
Okay, now I can’t resist. Here is what I get with 8 threads on my 14nm Intel desktop:
julia> using BenchmarkTools
julia> n=500000;
julia> x=rand(n);
julia> y=zeros(n);
julia> function threaded_exp!(y,x)
Threads.@threads for i in eachindex(x)
@inbounds y[i]=@inline exp(x[i])
end
end
threaded_exp! (generic function with 1 method)
julia> function sequential_exp!(y,x)
for i in eachindex(x)
@inbounds y[i]=@inline exp(x[i])
end
end
sequential_exp! (generic function with 1 method)
julia> tseq = @belapsed sequential_exp!(y,x)
0.000430829
julia> tmt = @belapsed threaded_exp!(y,x)
6.9831e-5
julia> SpUp = tseq/tmt; Threads.nthreads()
8
julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000430829, 6.9831e-5, 6.1695951654709225)
julia> versioninfo()
Julia Version 1.10.0-DEV.410
Commit 62c1137e38* (2023-01-21 04:16 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
The most important bit:
julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000430829, 6.9831e-5, 6.1695951654709225)
You can try rerunning with the @inline
s like I have above.
Sure, I only got a 6.2x speedup, but
julia> (0.001626458, 0.000223959) ./ (0.000430829, 6.9831e-5)
(3.775182264889318, 3.2071572797181767)
far from being smoked, my old Intel CPU was over 3x faster on both 1 thread and 8 threads.
To make the M1 look good, you need either benchmarks that aren’t SIMD-friendly, or are memory bandwidth constrained.
To be fair, most code probably falls under one of those two categories. Including your benchmark if we remove the @inline
.
Thanks Chris, this is very Interesting !
I am very surprised that the @inline
make such a difference here !
M1 Max (unplugged )
Threads.nthreads() = 8
(tseq, tmt, SpUp) = (0.000936084, 0.000131709, 7.10721363004806)
(0.000936084, 0.000131709, 7.10721363004806)
I agree that this example has not been chosen to emphasis the M1’s strength but as the simplest MT example (showing convincing Speedup) I could think of.
I will find a better one
Here is a memory bound example (wave propag with low order F.D. scheme) with an Makie (GLMakie) animation running during the computations.
SingleSpring.jl
using GLMakie
using PlotUtils #for colorant and color gradient (cgrad)
Base.@kwdef struct SpringParam
ls::Float64 # Spring length
ms::Float64 # Mass
ks::Float64 # Spring Strength
ns::Int # Mass number
end
Base.@kwdef struct InitParam
λ::Float64 # Wavelength of the initial deformation
shift::Float64 # Maximal initial node displacement
pos::Float64 # Central position of the node displacement
end
Base.@kwdef struct AnimParam
δt::Float64 # Time step for the dynamic simulation
nδt::Int # Number of timesteps
nδtperframe::Int # Number of timesteps per animation frame
end
steady_position(i, ls) = (i - 1) * ls
function initial_position(i,j,ls,λ,shift,posx,posy)
xs=steady_position(i,ls)
ys=steady_position(j,ls)
dx=xs-posx
dy=ys-posy
dr=sqrt(dx^2+dy^2)
ds=(-1/λ^2)*exp(-(dr/λ)^2/2)*shift
xs+dx*ds,ys+dy*ds
end
function display_perf(td,xc,nδt)
nflops=20 # Manual count floating point ops
nRW=12 # Manual count Array Reads and Writes
T=eltype(xc) # i.e. Float64
floatsize_inbytes=sizeof(T) #Float64 -> 8 Bytes
nbytes=nRW*floatsize_inbytes # Float64 -> 96 Bytes
mupns=length(xc)*nδt/(td*1.e9) # mass pos updates per ns
println("$(round((mupns*nflops),sigdigits=3))\t GFLOPS")
println("$(round((mupns*nbytes),sigdigits=3))\t GB/s")
end
#V is the type of the returned arrays (type are first class citizens !)
function initialize_arrays(sp,ip,V)
(;ls,ns) = sp
(;λ,shift,pos) = ip
#2D Array comprehensions
# -> cxs,xys,cxc,cyc are 2D arrays of size (ns,ns)
cxs=[steady_position(i,ls) for i ∈ 1:ns, _ ∈ 1:ns]
cys=[steady_position(j,ls) for _ ∈ 1:ns, j ∈ 1:ns]
cxc=[initial_position(i,j,ls,λ,shift,pos,pos)[1] for i ∈ 1:ns, j ∈ 1:ns]
cyc=[initial_position(i,j,ls,λ,shift,pos,pos)[2] for i ∈ 1:ns, j ∈ 1:ns]
#Broadcast of V Ctor equivalent to xs=V(cxs);ys=V(cys);xc=V(cxc);yc=V(cyc)
xs,ys,xc,yc=V.((cxs,cys,cxc,cyc))
#Assume zero initial velocities : xp(revious)=xc(urrent)
fx,xt,xp=zero(xc),zero(xc),copy(xc)
fy,yt,yp=zero(yc),zero(yc),copy(yc)
xs,xc,xp,xt,fx,ys,yc,yp,yt,fy
end
function update_position!(xt::Array{T,2},xc::Array{T,2},xp::Array{T,2},
fx::Array{T,2},δt, ms) where {T}
nsx,nsy=size(xc)
coef = T(δt^2 / ms)
t2 = T(2)
Threads.@threads for j ∈ 1:nsy
@inbounds for i ∈ 1:nsx
xt[i,j] = t2*xc[i,j]-xp[i,j]+fx[i,j]*coef
end
end
end
function update_force!(fx::Array{T,2},xc::Array{T,2},ks) where {T}
nsx,nsy=size(xc)
mtks=T(-ks)
t4=T(4)
Threads.@threads for j ∈ 2:nsy-1
@inbounds for i ∈ 2:nsx-1
fx[i,j] = mtks*(t4*xc[i,j]
-xc[i,j-1]
-xc[i-1,j]
-xc[i+1,j]
-xc[i,j+1])
end
end
end
function density_shift!(dc,xc,yc,ls)
ns=size(xc,1) #assuming square array
r0,r1,r2=1:ns-2,2:ns-1,3:ns # UnitRange
ds=1/(4ls^2) #steady density
@views @. dc[r1,r1] = 1/((xc[r2,r1] - xc[r0,r1])*(yc[r1,r2] - yc[r1,r0])) - ds
end
function setup_scene(xaxis,dc,nf)
# scene = Scene(resolution = (1500, 1500))
dcnode = Observable(dc)
xM=maximum(xaxis)
dM=maximum(dc)
fig = Figure(resolution = (2000, 1400), fontsize = 22)
ax = Axis3(fig[1, 1], limits=(0,xM,0,xM,-dM/8,dM/8))
surface!(ax,xaxis,xaxis,lift(d->d,dcnode),colormap=corporate_gradient(),colorrange = (-0.1,0.1))
# scale!(ax, 1, 1, 0.1*xM/dM)
# translate_cam!(scene,(10,10,2))
# update_cam!(scene, Vec3f0(0, 0, 5), Vec3f0(0.01, 0.01, 0))
# rotate_cam!(scene,(π/4 ,-π/2 ,0))
# αstep=-(π/4)/(nf÷2)
# lift(d->rotate_cam!(scene,(0 ,αstep ,0)),dcnode)
fig,dcnode
end
#Triscale colors ;)
corporate_gradient() = cgrad([colorant"#177272",colorant"#fafafa",colorant"#ee6f40"])
function advance_nδtpf(xc, xp, xt, fx, sp, ap)
(;ms, ks) = sp
(;δt, nδtperframe) = ap
for _ ∈ 1:nδtperframe # Anonymous iter variable (_)
update_force!(fx, xc, ks)
update_position!(xt, xc, xp, fx, δt, ms)
xc, xp, xt = xt, xc, xp # Julia's swap
end
xc, xp, xt
end
function animate_makie(sp,ip,ap,V)
(;δt,nδt,nδtperframe) = ap
xs,xc,xp,xt,fx,ys,yc,yp,yt,fy=initialize_arrays(sp,ip,V)
dc=zero(xc)
cdc=Array(xc)
density_shift!(dc,xc,yc,sp.ls)
@. cdc = dc
xaxis=Array(xs[:,1])
yaxis=Array(ys[1,:])
nf=nδt÷nδtperframe
scene,dcnode=setup_scene(xaxis,cdc,nf)
display(scene)
t=0.0
tdynamic=0.0
for _ ∈ 1:nf
tdynamic+= @elapsed begin
xc,xp,xt=advance_nδtpf(xc,xp,xt,fx,sp,ap)
yc,yp,yt=advance_nδtpf(yc,yp,yt,fy,sp,ap)
end
density_shift!(dc,xc,yc,sp.ls)
dcnode[] = dc
sleep(0.01)
t+=nδtperframe*δt
end
display_perf(tdynamic,xc,nδt) #Display GFLOPS and GB/s
end
function go()
sp=SpringParam(ls=0.1,ms=1,ks=2.5,ns=1000)
ip=InitParam(λ=3.0,shift=0.1,pos=sp.ls*sp.ns/2)
ap=AnimParam(δt=0.2,nδt=5000,nδtperframe=20)
V=Array{Float64,2}
@time animate_makie(sp,ip,ap,V)
end
go()
I obtain (M1 Max):
julia> include("SingleSpring.jl")
21.1 GFLOPS
101.0 GB/s
8.506742 seconds (1.46 M allocations: 1.164 GiB, 0.86% gc time, 1.18% compilation time)
with a final image :
Can we start a gofundme to get @Elrod a M2 MacBook pro so he can use some of his wizardry to make is sing?
Maybe it’s worth to mention that this:
julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000943125, 0.000267958, 3.5196747251434926)
was ran on an M1 without a fan, chilling at around 27degC (7degC above room temperature) right now with two 4K displays (via DisplayLink), 600 processes and 3500 active threads with a load of around 4.
…but yes, roasting is something else
In this case, Chris’s wizard tricks may not improve M1’s results.
I guess that the remaining factor 2 in favor of the Chris’s x86 chip is related to the AVX (256 bits wide) instructions which (AFAIK) do not have counterparts on ARM (restricted to 128bits wide SIMD instructions).
This is specific to purely SIMD friendly and CPU bound problems like my first example.
The second example (waves) is memory bound and I would be interested to have x86 results on this one too
Here are results from my Windows 11 machine:
julia> include("SingleSpring.jl")
4.57 GFLOPS
22.0 GB/s
27.811275 seconds (1.47 M allocations: 1.182 GiB, 1.06% gc time, 0.68% compilation time: 2% of which was recompilation)
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 8 on 8 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS = 8
Thanks !
In this case you see the benefit of M1’s architecture with this high speed unified memory: the MacBook Pro is 5 times faster.
Impressive! But does the timing also depend on the relative speed of the graphics cards in the two machines?
Nope. Pure cpu timings.
The graphics part (opengl rendering) is not measured by the gflops and GBs metrics here.
Note that the final time (in second ) should not be taken too seriously since I have included a sleep(0.01) in order to limit the animation frame rate.
I am surprised to see that my mac with i5 icelake cpu beats out the m1 by a hefty margin. It is more like an m1 pro. I guess they never upgraded it to tiger lake because the difference vs apple sillicon would have been quite small.
julia> tseq = @belapsed sequential_exp!(y,x)
0.000633398
julia> tmt = @belapsed threaded_exp!(y,x)
0.000176527
julia> SpUp = tseq/tmt; Threads.nthreads()
8
julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000633398, 0.000176527, 3.58810833470234)
julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin21.4.0)
CPU: 8 × Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, icelake-client)
Threads: 8 on 8 virtual cores
Environment:
JULIA_NUM_THREADS = 8
I get about the same performance on 8x 14nm Intel cores.
julia> include("/home/chriselrod/Documents/progwork/julia/benchmarks/SingleSpring.jl");
21.3 GFLOPS
102.0 GB/s
8.639054 seconds (1.32 M allocations: 1.166 GiB, 0.95% gc time, 4.84% compilation time)
julia> go()
21.8 GFLOPS
105.0 GB/s
8.459699 seconds (1.29 M allocations: 1.164 GiB, 0.30% gc time)
julia> versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787f* (2023-02-07 21:53 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
Threads: 8 on 36 virtual cores
However, it is consuming far more power while doing it.
I do have a 7980XE at work that’s actually plugged into a monitor to measure wattage.
It’ll idle at >95 watts.
This CPU has AVX512 (as does Ice Lake), which is why they perform so well on SIMD workloads.
We migrated to M1 architecture for many analysis and simulation tasks. Our problem is sequential (simulation of a satellite attitude control subsystem) but we could paralelize some parts.
Just using the M1 provided a huge gain compared with our previous Intel processor, something like 3x faster. I have no idea why, I just run the same code.
Now, we are taking advantage of unified memory to allow some computations in GPU. Previously, the cost to copy to GPU memory was so high that the gain was negligible. Now, since we can use the same memory, I was able to achieve more speed! This unified memory architecture is just amazing!
M1 systems top out at 64 GB of memory
128Gb with M1 ultra and 96 with M2s. But this is clearly a drawback of this SOC integrated architecture.
And Apple M2 Max results (30% faster compare to M1 Max):
julia> include("SingleSpring.jl")
27.5 GFLOPS
132.0 GB/s
7.324295 seconds (1.40 M allocations: 1.162 GiB, 0.77% gc time, 1.46% compilation time)
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.1.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores
Environment:
JULIA_EDITOR = code
But what is interesting it significantly better with Julia 1.9.0 beta 4:
julia> include("SingleSpring.jl")
33.5 GFLOPS
161.0 GB/s
6.690864 seconds (1.30 M allocations: 1.165 GiB, 0.87% gc time, 4.50% compilation time)
julia> versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787ff (2023-02-07 21:53 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores
Environment:
JULIA_IMAGE_THREADS = 1
and the
using BenchmarkTools
n=500000;
x=rand(n);
y=zeros(n);
function threaded_exp!(y,x)
Threads.@threads for i in eachindex(x)
@inbounds y[i]=@inline exp(x[i])
end
end
function sequential_exp!(y,x)
for i in eachindex(x)
@inbounds y[i]=@inline exp(x[i])
end
end
tseq = @belapsed sequential_exp!(y,x)
tmt = @belapsed threaded_exp!(y,x)
SpUp = tseq/tmt; Threads.nthreads()
@show tseq,tmt,SpUp;
gives me:
(tseq, tmt, SpUp) = (0.000863083, 0.000123042, 7.014539750654248)
(0.000863083, 0.000123042, 7.014539750654248)
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.1.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores