Taking advantage of Apple M1?

marianoarnaiz · January 21, 2023, 1:17pm

Hi all.
So, I got my new laptop with a Apple M1 chip. I have heard is great.

I wonder how are you guys taking advange of it and its cores/GPU. How do you start Julia, use Metal.jl, and other tips to speed up processes.

All general tips, tricks and cheats are welcomed

LaurentPlagne · January 21, 2023, 1:31pm

CPU : +++ M1 cores are great with native ARM Julia. Especially I noticed that the multi-threading is much better without rosetta. My M1 pro laptop usually smokes x86 desktops without any noise
GPU : — Metal.jl is way behind CUDA.jl. I also experienced short freezes during Makie OpenGL animation (wich run smoothly with an NVidia small discrete cards).

rveltz · January 21, 2023, 4:30pm

can you share some code for this please?

marianoarnaiz · January 21, 2023, 6:43pm

Can you give me an example of multi-threading? it is my first time with more than one core hahaha

LaurentPlagne · January 22, 2023, 10:23am

Hi Romain,

This observation is a bit old (July 2022) and I was not able to reproduce this issue after updating my environment (Julia 1.8.5, Ventura and up to date Julia packages). Makie OpenGL animation run smooth now

My issue concerning Metal.jl (slow broadcast copy in 2D · Issue #41 · JuliaGPU/Metal.jl · GitHub) remains though.

LaurentPlagne · January 22, 2023, 10:36am

Well, multi-threading is not a trivial subject and I guess that you will have to study the documentation (Multi-Threading · The Julia Language) and especially data races (Multi-Threading · The Julia Language).

Here is a trivial example though.

using BenchmarkTools

@show Threads.nthreads()

function sequential_exp!(y,x)
    for i in eachindex(x)
        @inbounds y[i]=exp(x[i])
    end
end

function threaded_exp!(y,x)
    Threads.@threads for i in eachindex(x)
        @inbounds y[i]=exp(x[i])
    end
end

n=500000
x=rand(n)
y=zeros(n)

tseq = @belapsed sequential_exp!(y,x)
tmt  = @belapsed threaded_exp!(y,x)
SpUp = tseq/tmt
@show tseq,tmt,SpUp

I obtain (M1 Max):

julia> include("testmt.jl")
Threads.nthreads() = 8
(tseq, tmt, SpUp) = (0.001626458, 0.000223959, 7.262302475006585)
(0.001626458, 0.000223959, 7.262302475006585)

Elrod · January 22, 2023, 12:34pm

Okay, now I can’t resist. Here is what I get with 8 threads on my 14nm Intel desktop:

julia> using BenchmarkTools

julia> n=500000;

julia> x=rand(n);

julia> y=zeros(n);

julia> function threaded_exp!(y,x)
           Threads.@threads for i in eachindex(x)
               @inbounds y[i]=@inline exp(x[i])
           end
       end
threaded_exp! (generic function with 1 method)

julia> function sequential_exp!(y,x)
           for i in eachindex(x)
               @inbounds y[i]=@inline exp(x[i])
           end
       end
sequential_exp! (generic function with 1 method)

julia> tseq = @belapsed sequential_exp!(y,x)
0.000430829

julia> tmt  = @belapsed threaded_exp!(y,x)
6.9831e-5

julia> SpUp = tseq/tmt; Threads.nthreads()
8

julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000430829, 6.9831e-5, 6.1695951654709225)

julia> versioninfo()
Julia Version 1.10.0-DEV.410
Commit 62c1137e38* (2023-01-21 04:16 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

The most important bit:

julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000430829, 6.9831e-5, 6.1695951654709225)

You can try rerunning with the @inlines like I have above.
Sure, I only got a 6.2x speedup, but

julia> (0.001626458, 0.000223959) ./ (0.000430829, 6.9831e-5)
(3.775182264889318, 3.2071572797181767)

far from being smoked, my old Intel CPU was over 3x faster on both 1 thread and 8 threads.

To make the M1 look good, you need either benchmarks that aren’t SIMD-friendly, or are memory bandwidth constrained.
To be fair, most code probably falls under one of those two categories. Including your benchmark if we remove the @inline.

LaurentPlagne · January 22, 2023, 3:27pm

Thanks Chris, this is very Interesting !
I am very surprised that the @inline make such a difference here !

M1 Max (unplugged )

Threads.nthreads() = 8
(tseq, tmt, SpUp) = (0.000936084, 0.000131709, 7.10721363004806)
(0.000936084, 0.000131709, 7.10721363004806)

I agree that this example has not been chosen to emphasis the M1’s strength but as the simplest MT example (showing convincing Speedup) I could think of.
I will find a better one

LaurentPlagne · January 22, 2023, 4:05pm

Here is a memory bound example (wave propag with low order F.D. scheme) with an Makie (GLMakie) animation running during the computations.

SingleSpring.jl

using GLMakie
using PlotUtils #for colorant and color gradient (cgrad)

Base.@kwdef struct SpringParam
    ls::Float64 # Spring length
    ms::Float64 # Mass
    ks::Float64 # Spring Strength
    ns::Int     # Mass number
end

Base.@kwdef struct InitParam
    λ::Float64      # Wavelength of the initial deformation
    shift::Float64  # Maximal initial node displacement
    pos::Float64    # Central position of the node displacement
end

Base.@kwdef struct AnimParam
    δt::Float64      # Time step for the dynamic simulation
    nδt::Int         # Number of timesteps
    nδtperframe::Int # Number of timesteps per animation frame
end

steady_position(i, ls) = (i - 1) * ls
function initial_position(i,j,ls,λ,shift,posx,posy)
    xs=steady_position(i,ls)
    ys=steady_position(j,ls)
    dx=xs-posx 
    dy=ys-posy
    dr=sqrt(dx^2+dy^2)
    ds=(-1/λ^2)*exp(-(dr/λ)^2/2)*shift
    xs+dx*ds,ys+dy*ds
end 

function display_perf(td,xc,nδt)
    nflops=20 # Manual count floating point ops
    nRW=12    # Manual count Array Reads and Writes
    T=eltype(xc) # i.e. Float64
    floatsize_inbytes=sizeof(T) #Float64 -> 8 Bytes
    nbytes=nRW*floatsize_inbytes # Float64 -> 96 Bytes
    
    mupns=length(xc)*nδt/(td*1.e9) # mass pos updates per ns
    println("$(round((mupns*nflops),sigdigits=3))\t GFLOPS")
    println("$(round((mupns*nbytes),sigdigits=3))\t GB/s")
end

#V is the type of the returned arrays (type are first class citizens !)
function initialize_arrays(sp,ip,V) 
    (;ls,ns) = sp
    (;λ,shift,pos) = ip

    #2D Array comprehensions 
    #  -> cxs,xys,cxc,cyc are 2D arrays of size (ns,ns)
    cxs=[steady_position(i,ls) for i ∈ 1:ns, _ ∈ 1:ns] 
    cys=[steady_position(j,ls) for _ ∈ 1:ns, j ∈ 1:ns]
    cxc=[initial_position(i,j,ls,λ,shift,pos,pos)[1] for i ∈ 1:ns, j ∈ 1:ns]
    cyc=[initial_position(i,j,ls,λ,shift,pos,pos)[2] for i ∈ 1:ns, j ∈ 1:ns]

    #Broadcast of V Ctor equivalent to xs=V(cxs);ys=V(cys);xc=V(cxc);yc=V(cyc)
    xs,ys,xc,yc=V.((cxs,cys,cxc,cyc))
    #Assume zero initial velocities : xp(revious)=xc(urrent)
    fx,xt,xp=zero(xc),zero(xc),copy(xc)
    fy,yt,yp=zero(yc),zero(yc),copy(yc)

    xs,xc,xp,xt,fx,ys,yc,yp,yt,fy
end

function update_position!(xt::Array{T,2},xc::Array{T,2},xp::Array{T,2},
    fx::Array{T,2},δt, ms) where {T}

    nsx,nsy=size(xc)
    coef = T(δt^2 / ms)
    t2 = T(2)
    Threads.@threads for j ∈ 1:nsy
        @inbounds for i ∈ 1:nsx
            xt[i,j] = t2*xc[i,j]-xp[i,j]+fx[i,j]*coef
        end
    end
end

function update_force!(fx::Array{T,2},xc::Array{T,2},ks) where {T}
    nsx,nsy=size(xc)
    mtks=T(-ks)
    t4=T(4)
    Threads.@threads for j ∈ 2:nsy-1
        @inbounds for i ∈ 2:nsx-1
            fx[i,j] = mtks*(t4*xc[i,j]
            -xc[i,j-1]
            -xc[i-1,j]
            -xc[i+1,j]
            -xc[i,j+1])
        end
    end
end

function density_shift!(dc,xc,yc,ls)
    ns=size(xc,1) #assuming square array
    r0,r1,r2=1:ns-2,2:ns-1,3:ns # UnitRange
    ds=1/(4ls^2)  #steady density
    @views @. dc[r1,r1] = 1/((xc[r2,r1] - xc[r0,r1])*(yc[r1,r2] - yc[r1,r0])) - ds
end

function setup_scene(xaxis,dc,nf)
    # scene = Scene(resolution = (1500, 1500))
    dcnode = Observable(dc)

    xM=maximum(xaxis)

    dM=maximum(dc)
    fig = Figure(resolution = (2000, 1400), fontsize = 22)

    ax = Axis3(fig[1, 1], limits=(0,xM,0,xM,-dM/8,dM/8))

    surface!(ax,xaxis,xaxis,lift(d->d,dcnode),colormap=corporate_gradient(),colorrange = (-0.1,0.1))
    # scale!(ax, 1, 1, 0.1*xM/dM)

    #   translate_cam!(scene,(10,10,2))
    #     update_cam!(scene, Vec3f0(0, 0, 5), Vec3f0(0.01, 0.01, 0))
    #     rotate_cam!(scene,(π/4 ,-π/2 ,0))
    #     αstep=-(π/4)/(nf÷2)
    #     lift(d->rotate_cam!(scene,(0 ,αstep ,0)),dcnode)



    fig,dcnode
end

#Triscale colors ;)
corporate_gradient() = cgrad([colorant"#177272",colorant"#fafafa",colorant"#ee6f40"])

function advance_nδtpf(xc, xp, xt, fx, sp, ap)
    (;ms, ks) = sp
    (;δt, nδtperframe) = ap

    for _ ∈ 1:nδtperframe # Anonymous iter variable (_)
        update_force!(fx, xc, ks)
        update_position!(xt, xc, xp, fx, δt, ms)
        xc, xp, xt = xt, xc, xp # Julia's swap
    end
    xc, xp, xt
end

function animate_makie(sp,ip,ap,V)
    (;δt,nδt,nδtperframe) = ap
    xs,xc,xp,xt,fx,ys,yc,yp,yt,fy=initialize_arrays(sp,ip,V)

    dc=zero(xc)
    cdc=Array(xc)
    density_shift!(dc,xc,yc,sp.ls)
    @. cdc = dc 
    xaxis=Array(xs[:,1])
    yaxis=Array(ys[1,:])
    nf=nδt÷nδtperframe

    scene,dcnode=setup_scene(xaxis,cdc,nf)
    display(scene)
    
    t=0.0
    tdynamic=0.0
    for _ ∈ 1:nf
        tdynamic+= @elapsed begin
            xc,xp,xt=advance_nδtpf(xc,xp,xt,fx,sp,ap)
            yc,yp,yt=advance_nδtpf(yc,yp,yt,fy,sp,ap)
        end
        density_shift!(dc,xc,yc,sp.ls)
        dcnode[] = dc
        sleep(0.01)
        t+=nδtperframe*δt
    end

    display_perf(tdynamic,xc,nδt) #Display GFLOPS and GB/s
end

function go()
    sp=SpringParam(ls=0.1,ms=1,ks=2.5,ns=1000)
    ip=InitParam(λ=3.0,shift=0.1,pos=sp.ls*sp.ns/2)
    ap=AnimParam(δt=0.2,nδt=5000,nδtperframe=20)

    V=Array{Float64,2}

    @time animate_makie(sp,ip,ap,V)
end
go()

I obtain (M1 Max):

julia> include("SingleSpring.jl")
21.1     GFLOPS
101.0    GB/s
  8.506742 seconds (1.46 M allocations: 1.164 GiB, 0.86% gc time, 1.18% compilation time)

with a final image :

ndinsmore · January 22, 2023, 9:37pm

Can we start a gofundme to get @Elrod a M2 MacBook pro so he can use some of his wizardry to make is sing?

tamasgal · January 22, 2023, 10:47pm

Maybe it’s worth to mention that this:

julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000943125, 0.000267958, 3.5196747251434926)

was ran on an M1 without a fan, chilling at around 27degC (7degC above room temperature) right now with two 4K displays (via DisplayLink), 600 processes and 3500 active threads with a load of around 4.

…but yes, roasting is something else

LaurentPlagne · January 23, 2023, 4:59pm

In this case, Chris’s wizard tricks may not improve M1’s results.

I guess that the remaining factor 2 in favor of the Chris’s x86 chip is related to the AVX (256 bits wide) instructions which (AFAIK) do not have counterparts on ARM (restricted to 128bits wide SIMD instructions).

This is specific to purely SIMD friendly and CPU bound problems like my first example.

The second example (waves) is memory bound and I would be interested to have x86 results on this one too

PeterSimon · January 23, 2023, 5:14pm

Here are results from my Windows 11 machine:

julia> include("SingleSpring.jl")
4.57     GFLOPS
22.0     GB/s
 27.811275 seconds (1.47 M allocations: 1.182 GiB, 1.06% gc time, 0.68% compilation time: 2% of which was recompilation)

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8

LaurentPlagne · January 23, 2023, 5:17pm

Thanks !

In this case you see the benefit of M1’s architecture with this high speed unified memory: the MacBook Pro is 5 times faster.

PeterSimon · January 23, 2023, 5:48pm

Impressive! But does the timing also depend on the relative speed of the graphics cards in the two machines?

LaurentPlagne · January 23, 2023, 6:00pm

Nope. Pure cpu timings.

The graphics part (opengl rendering) is not measured by the gflops and GBs metrics here.

Note that the final time (in second ) should not be taken too seriously since I have included a sleep(0.01) in order to limit the animation frame rate.

davidbp · February 15, 2023, 10:16pm

I am surprised to see that my mac with i5 icelake cpu beats out the m1 by a hefty margin. It is more like an m1 pro. I guess they never upgraded it to tiger lake because the difference vs apple sillicon would have been quite small.

julia> tseq = @belapsed sequential_exp!(y,x)
0.000633398

julia> tmt  = @belapsed threaded_exp!(y,x)
0.000176527

julia> SpUp = tseq/tmt; Threads.nthreads()
8

julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000633398, 0.000176527, 3.58810833470234)

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, icelake-client)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 8

Elrod · February 16, 2023, 12:33am

I get about the same performance on 8x 14nm Intel cores.

julia> include("/home/chriselrod/Documents/progwork/julia/benchmarks/SingleSpring.jl");
21.3     GFLOPS
102.0    GB/s
  8.639054 seconds (1.32 M allocations: 1.166 GiB, 0.95% gc time, 4.84% compilation time)

julia> go()
21.8     GFLOPS
105.0    GB/s
  8.459699 seconds (1.29 M allocations: 1.164 GiB, 0.30% gc time)

julia> versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787f* (2023-02-07 21:53 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 8 on 36 virtual cores

However, it is consuming far more power while doing it.
I do have a 7980XE at work that’s actually plugged into a monitor to measure wattage.
It’ll idle at >95 watts.

This CPU has AVX512 (as does Ice Lake), which is why they perform so well on SIMD workloads.

Ronis_BR · February 16, 2023, 4:01am

We migrated to M1 architecture for many analysis and simulation tasks. Our problem is sequential (simulation of a satellite attitude control subsystem) but we could paralelize some parts.

Just using the M1 provided a huge gain compared with our previous Intel processor, something like 3x faster. I have no idea why, I just run the same code.

Now, we are taking advantage of unified memory to allow some computations in GPU. Previously, the cost to copy to GPU memory was so high that the gain was negligible. Now, since we can use the same memory, I was able to achieve more speed! This unified memory architecture is just amazing!

PetrKryslUCSD · February 16, 2023, 4:18am

M1 systems top out at 64 GB of memory

Topic		Replies	Views
Apple M1, M1 pro M1 Max and Julia developpers Offtopic	17	5387	November 1, 2021
Does Mac M1 in multithreads is slower that in single thread? Performance mac-m1	10	3443	May 16, 2021
Julia on Apple M1? General Usage	2	8354	November 18, 2020
Possible threading bug on M1 Max (only ARM build) Performance multithreading	19	1518	March 12, 2022
Apple M1 GPU from Julia? GPU question	20	5800	March 31, 2023

Taking advantage of Apple M1?

Related topics