Taking advantage of Apple M1?

Hi all.
So, I got my new laptop with a Apple M1 chip. I have heard is great.

I wonder how are you guys taking advange of it and its cores/GPU. How do you start Julia, use Metal.jl, and other tips to speed up processes.

All general tips, tricks and cheats are welcomed :wink:

2 Likes

CPU : +++ M1 cores are great with native ARM Julia. Especially I noticed that the multi-threading is much better without rosetta. My M1 pro laptop usually smokes x86 desktops without any noise :wink:
GPU : ā€” Metal.jl is way behind CUDA.jl. I also experienced short freezes during Makie OpenGL animation (wich run smoothly with an NVidia small discrete cards).

5 Likes

can you share some code for this please?

1 Like

Can you give me an example of multi-threading? it is my first time with more than one core hahaha

Hi Romain,

This observation is a bit old (July 2022) and I was not able to reproduce this issue after updating my environment (Julia 1.8.5, Ventura and up to date Julia packages). Makie OpenGL animation run smooth now :partying_face:

My issue concerning Metal.jl (slow broadcast copy in 2D Ā· Issue #41 Ā· JuliaGPU/Metal.jl Ā· GitHub) remains though.

1 Like

Well, multi-threading is not a trivial subject and I guess that you will have to study the documentation (Multi-Threading Ā· The Julia Language) and especially data races (Multi-Threading Ā· The Julia Language).

Here is a trivial example though.

using BenchmarkTools

@show Threads.nthreads()

function sequential_exp!(y,x)
    for i in eachindex(x)
        @inbounds y[i]=exp(x[i])
    end
end

function threaded_exp!(y,x)
    Threads.@threads for i in eachindex(x)
        @inbounds y[i]=exp(x[i])
    end
end

n=500000
x=rand(n)
y=zeros(n)

tseq = @belapsed sequential_exp!(y,x)
tmt  = @belapsed threaded_exp!(y,x)
SpUp = tseq/tmt
@show tseq,tmt,SpUp

I obtain (M1 Max):

julia> include("testmt.jl")
Threads.nthreads() = 8
(tseq, tmt, SpUp) = (0.001626458, 0.000223959, 7.262302475006585)
(0.001626458, 0.000223959, 7.262302475006585)
1 Like

Okay, now I canā€™t resist. Here is what I get with 8 threads on my 14nm Intel desktop:

julia> using BenchmarkTools

julia> n=500000;

julia> x=rand(n);

julia> y=zeros(n);

julia> function threaded_exp!(y,x)
           Threads.@threads for i in eachindex(x)
               @inbounds y[i]=@inline exp(x[i])
           end
       end
threaded_exp! (generic function with 1 method)

julia> function sequential_exp!(y,x)
           for i in eachindex(x)
               @inbounds y[i]=@inline exp(x[i])
           end
       end
sequential_exp! (generic function with 1 method)

julia> tseq = @belapsed sequential_exp!(y,x)
0.000430829

julia> tmt  = @belapsed threaded_exp!(y,x)
6.9831e-5

julia> SpUp = tseq/tmt; Threads.nthreads()
8

julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000430829, 6.9831e-5, 6.1695951654709225)

julia> versioninfo()
Julia Version 1.10.0-DEV.410
Commit 62c1137e38* (2023-01-21 04:16 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 Ɨ Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

The most important bit:

julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000430829, 6.9831e-5, 6.1695951654709225)

You can try rerunning with the @inlines like I have above.
Sure, I only got a 6.2x speedup, but

julia> (0.001626458, 0.000223959) ./ (0.000430829, 6.9831e-5)
(3.775182264889318, 3.2071572797181767)

far from being smoked, my old Intel CPU was over 3x faster on both 1 thread and 8 threads. :slight_smile:

To make the M1 look good, you need either benchmarks that arenā€™t SIMD-friendly, or are memory bandwidth constrained. :wink:
To be fair, most code probably falls under one of those two categories. Including your benchmark if we remove the @inline.

4 Likes

Thanks Chris, this is very Interesting !
I am very surprised that the @inline make such a difference here !

M1 Max (unplugged )

Threads.nthreads() = 8
(tseq, tmt, SpUp) = (0.000936084, 0.000131709, 7.10721363004806)
(0.000936084, 0.000131709, 7.10721363004806)

I agree that this example has not been chosen to emphasis the M1ā€™s strength but as the simplest MT example (showing convincing Speedup) I could think of.
I will find a better one :wink:

1 Like

Here is a memory bound example (wave propag with low order F.D. scheme) with an Makie (GLMakie) animation running during the computations.

SingleSpring.jl
using GLMakie
using PlotUtils #for colorant and color gradient (cgrad)

Base.@kwdef struct SpringParam
    ls::Float64 # Spring length
    ms::Float64 # Mass
    ks::Float64 # Spring Strength
    ns::Int     # Mass number
end

Base.@kwdef struct InitParam
    Ī»::Float64      # Wavelength of the initial deformation
    shift::Float64  # Maximal initial node displacement
    pos::Float64    # Central position of the node displacement
end

Base.@kwdef struct AnimParam
    Ī“t::Float64      # Time step for the dynamic simulation
    nĪ“t::Int         # Number of timesteps
    nĪ“tperframe::Int # Number of timesteps per animation frame
end

steady_position(i, ls) = (i - 1) * ls
function initial_position(i,j,ls,Ī»,shift,posx,posy)
    xs=steady_position(i,ls)
    ys=steady_position(j,ls)
    dx=xs-posx 
    dy=ys-posy
    dr=sqrt(dx^2+dy^2)
    ds=(-1/Ī»^2)*exp(-(dr/Ī»)^2/2)*shift
    xs+dx*ds,ys+dy*ds
end 

function display_perf(td,xc,nĪ“t)
    nflops=20 # Manual count floating point ops
    nRW=12    # Manual count Array Reads and Writes
    T=eltype(xc) # i.e. Float64
    floatsize_inbytes=sizeof(T) #Float64 -> 8 Bytes
    nbytes=nRW*floatsize_inbytes # Float64 -> 96 Bytes
    
    mupns=length(xc)*nĪ“t/(td*1.e9) # mass pos updates per ns
    println("$(round((mupns*nflops),sigdigits=3))\t GFLOPS")
    println("$(round((mupns*nbytes),sigdigits=3))\t GB/s")
end

#V is the type of the returned arrays (type are first class citizens !)
function initialize_arrays(sp,ip,V) 
    (;ls,ns) = sp
    (;Ī»,shift,pos) = ip

    #2D Array comprehensions 
    #  -> cxs,xys,cxc,cyc are 2D arrays of size (ns,ns)
    cxs=[steady_position(i,ls) for i āˆˆ 1:ns, _ āˆˆ 1:ns] 
    cys=[steady_position(j,ls) for _ āˆˆ 1:ns, j āˆˆ 1:ns]
    cxc=[initial_position(i,j,ls,Ī»,shift,pos,pos)[1] for i āˆˆ 1:ns, j āˆˆ 1:ns]
    cyc=[initial_position(i,j,ls,Ī»,shift,pos,pos)[2] for i āˆˆ 1:ns, j āˆˆ 1:ns]

    #Broadcast of V Ctor equivalent to xs=V(cxs);ys=V(cys);xc=V(cxc);yc=V(cyc)
    xs,ys,xc,yc=V.((cxs,cys,cxc,cyc))
    #Assume zero initial velocities : xp(revious)=xc(urrent)
    fx,xt,xp=zero(xc),zero(xc),copy(xc)
    fy,yt,yp=zero(yc),zero(yc),copy(yc)

    xs,xc,xp,xt,fx,ys,yc,yp,yt,fy
end

function update_position!(xt::Array{T,2},xc::Array{T,2},xp::Array{T,2},
    fx::Array{T,2},Ī“t, ms) where {T}

    nsx,nsy=size(xc)
    coef = T(Ī“t^2 / ms)
    t2 = T(2)
    Threads.@threads for j āˆˆ 1:nsy
        @inbounds for i āˆˆ 1:nsx
            xt[i,j] = t2*xc[i,j]-xp[i,j]+fx[i,j]*coef
        end
    end
end

function update_force!(fx::Array{T,2},xc::Array{T,2},ks) where {T}
    nsx,nsy=size(xc)
    mtks=T(-ks)
    t4=T(4)
    Threads.@threads for j āˆˆ 2:nsy-1
        @inbounds for i āˆˆ 2:nsx-1
            fx[i,j] = mtks*(t4*xc[i,j]
            -xc[i,j-1]
            -xc[i-1,j]
            -xc[i+1,j]
            -xc[i,j+1])
        end
    end
end

function density_shift!(dc,xc,yc,ls)
    ns=size(xc,1) #assuming square array
    r0,r1,r2=1:ns-2,2:ns-1,3:ns # UnitRange
    ds=1/(4ls^2)  #steady density
    @views @. dc[r1,r1] = 1/((xc[r2,r1] - xc[r0,r1])*(yc[r1,r2] - yc[r1,r0])) - ds
end

function setup_scene(xaxis,dc,nf)
    # scene = Scene(resolution = (1500, 1500))
    dcnode = Observable(dc)

    xM=maximum(xaxis)

    dM=maximum(dc)
    fig = Figure(resolution = (2000, 1400), fontsize = 22)

    ax = Axis3(fig[1, 1], limits=(0,xM,0,xM,-dM/8,dM/8))

    surface!(ax,xaxis,xaxis,lift(d->d,dcnode),colormap=corporate_gradient(),colorrange = (-0.1,0.1))
    # scale!(ax, 1, 1, 0.1*xM/dM)

    #   translate_cam!(scene,(10,10,2))
    #     update_cam!(scene, Vec3f0(0, 0, 5), Vec3f0(0.01, 0.01, 0))
    #     rotate_cam!(scene,(Ļ€/4 ,-Ļ€/2 ,0))
    #     Ī±step=-(Ļ€/4)/(nfĆ·2)
    #     lift(d->rotate_cam!(scene,(0 ,Ī±step ,0)),dcnode)



    fig,dcnode
end

#Triscale colors ;)
corporate_gradient() = cgrad([colorant"#177272",colorant"#fafafa",colorant"#ee6f40"])

function advance_nĪ“tpf(xc, xp, xt, fx, sp, ap)
    (;ms, ks) = sp
    (;Ī“t, nĪ“tperframe) = ap

    for _ āˆˆ 1:nĪ“tperframe # Anonymous iter variable (_)
        update_force!(fx, xc, ks)
        update_position!(xt, xc, xp, fx, Ī“t, ms)
        xc, xp, xt = xt, xc, xp # Julia's swap
    end
    xc, xp, xt
end

function animate_makie(sp,ip,ap,V)
    (;Ī“t,nĪ“t,nĪ“tperframe) = ap
    xs,xc,xp,xt,fx,ys,yc,yp,yt,fy=initialize_arrays(sp,ip,V)

    dc=zero(xc)
    cdc=Array(xc)
    density_shift!(dc,xc,yc,sp.ls)
    @. cdc = dc 
    xaxis=Array(xs[:,1])
    yaxis=Array(ys[1,:])
    nf=nĪ“tĆ·nĪ“tperframe

    scene,dcnode=setup_scene(xaxis,cdc,nf)
    display(scene)
    
    t=0.0
    tdynamic=0.0
    for _ āˆˆ 1:nf
        tdynamic+= @elapsed begin
            xc,xp,xt=advance_nĪ“tpf(xc,xp,xt,fx,sp,ap)
            yc,yp,yt=advance_nĪ“tpf(yc,yp,yt,fy,sp,ap)
        end
        density_shift!(dc,xc,yc,sp.ls)
        dcnode[] = dc
        sleep(0.01)
        t+=nĪ“tperframe*Ī“t
    end

    display_perf(tdynamic,xc,nĪ“t) #Display GFLOPS and GB/s
end

function go()
    sp=SpringParam(ls=0.1,ms=1,ks=2.5,ns=1000)
    ip=InitParam(Ī»=3.0,shift=0.1,pos=sp.ls*sp.ns/2)
    ap=AnimParam(Ī“t=0.2,nĪ“t=5000,nĪ“tperframe=20)

    V=Array{Float64,2}

    @time animate_makie(sp,ip,ap,V)
end
go()

I obtain (M1 Max):

julia> include("SingleSpring.jl")
21.1     GFLOPS
101.0    GB/s
  8.506742 seconds (1.46 M allocations: 1.164 GiB, 0.86% gc time, 1.18% compilation time)

with a final image :

8 Likes

Can we start a gofundme to get @Elrod a M2 MacBook pro so he can use some of his wizardry to make is sing?

5 Likes

Maybe itā€™s worth to mention that this:

julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000943125, 0.000267958, 3.5196747251434926)

was ran on an M1 without a fan, chilling at around 27degC (7degC above room temperature) right now with two 4K displays (via DisplayLink), 600 processes and 3500 active threads :wink: with a load of around 4.

ā€¦but yes, roasting is something else :laughing:

2 Likes

In this case, Chrisā€™s wizard tricks may not improve M1ā€™s results.

I guess that the remaining factor 2 in favor of the Chrisā€™s x86 chip is related to the AVX (256 bits wide) instructions which (AFAIK) do not have counterparts on ARM (restricted to 128bits wide SIMD instructions).

This is specific to purely SIMD friendly and CPU bound problems like my first example.

The second example (waves) is memory bound and I would be interested to have x86 results on this one too :slight_smile:

1 Like

Here are results from my Windows 11 machine:

julia> include("SingleSpring.jl")
4.57     GFLOPS
22.0     GB/s
 27.811275 seconds (1.47 M allocations: 1.182 GiB, 1.06% gc time, 0.68% compilation time: 2% of which was recompilation)

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 Ɨ Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
3 Likes

Thanks !

In this case you see the benefit of M1ā€™s architecture with this high speed unified memory: the MacBook Pro is 5 times faster.

2 Likes

Impressive! But does the timing also depend on the relative speed of the graphics cards in the two machines?

Nope. Pure cpu timings.

The graphics part (opengl rendering) is not measured by the gflops and GBs metrics here.

Note that the final time (in second ) should not be taken too seriously since I have included a sleep(0.01) in order to limit the animation frame rate.

1 Like

I am surprised to see that my mac with i5 icelake cpu beats out the m1 by a hefty margin. It is more like an m1 pro. I guess they never upgraded it to tiger lake because the difference vs apple sillicon would have been quite small.

julia> tseq = @belapsed sequential_exp!(y,x)
0.000633398

julia> tmt  = @belapsed threaded_exp!(y,x)
0.000176527

julia> SpUp = tseq/tmt; Threads.nthreads()
8

julia> @show tseq,tmt,SpUp;
(tseq, tmt, SpUp) = (0.000633398, 0.000176527, 3.58810833470234)

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.4.0)
  CPU: 8 Ɨ Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, icelake-client)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_NUM_THREADS = 8

I get about the same performance on 8x 14nm Intel cores.

julia> include("/home/chriselrod/Documents/progwork/julia/benchmarks/SingleSpring.jl");
21.3     GFLOPS
102.0    GB/s
  8.639054 seconds (1.32 M allocations: 1.166 GiB, 0.95% gc time, 4.84% compilation time)

julia> go()
21.8     GFLOPS
105.0    GB/s
  8.459699 seconds (1.29 M allocations: 1.164 GiB, 0.30% gc time)

julia> versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787f* (2023-02-07 21:53 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 Ɨ Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 8 on 36 virtual cores

However, it is consuming far more power while doing it.
I do have a 7980XE at work thatā€™s actually plugged into a monitor to measure wattage.
Itā€™ll idle at >95 watts.

This CPU has AVX512 (as does Ice Lake), which is why they perform so well on SIMD workloads.

1 Like

We migrated to M1 architecture for many analysis and simulation tasks. Our problem is sequential (simulation of a satellite attitude control subsystem) but we could paralelize some parts.

Just using the M1 provided a huge gain compared with our previous Intel processor, something like 3x faster. I have no idea why, I just run the same code.

Now, we are taking advantage of unified memory to allow some computations in GPU. Previously, the cost to copy to GPU memory was so high that the gain was negligible. Now, since we can use the same memory, I was able to achieve more speed! This unified memory architecture is just amazing!

1 Like

M1 systems top out at 64 GB of memory
:frowning:

1 Like