AMD Rome vs Intel Xeon shows bad scaling with threads for AMD

I ran my 3D finite difference stencil benchmark https://github.com/Chiil/MicroHH.jl/blob/main/test/dynamics_kernel.jl on AMD Rome 7H12 and Intel Xeon Platinum 8360Y with grid spacing (itot = 256; jtot = 256; ktot = 1024). The @fast3d macro writes a nested 3D loop with a @tturbo decorator in front. I noticed that the AMD code has pretty dramatic scaling with threads beyond 4, whereas the Intel scales well until 16 cores.

Any idea what is going on here? The AMD is our production machine and I would be very happy with better scaling.

AMD results:

  437.382 ms (0 allocations: 0 bytes)
chiel@tcn408:~/MicroHH.jl/test$ julia --project -O3 -t2 dynamics_kernel.jl 
  228.276 ms (0 allocations: 0 bytes)
chiel@tcn408:~/MicroHH.jl/test$ julia --project -O3 -t4 dynamics_kernel.jl 
  110.916 ms (0 allocations: 0 bytes)
chiel@tcn408:~/MicroHH.jl/test$ julia --project -O3 -t8 dynamics_kernel.jl 
  73.895 ms (0 allocations: 0 bytes)
chiel@tcn408:~/MicroHH.jl/test$ julia --project -O3 -t16 dynamics_kernel.jl 
  85.145 ms (0 allocations: 0 bytes)

Intel results:

chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t1 dynamics_kernel.jl 
  402.697 ms (0 allocations: 0 bytes)
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t2 dynamics_kernel.jl 
  201.879 ms (0 allocations: 0 bytes)
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t4 dynamics_kernel.jl 
  101.593 ms (0 allocations: 0 bytes)
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t8 dynamics_kernel.jl 
  52.347 ms (0 allocations: 0 bytes)
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t16 dynamics_kernel.jl 
  29.518 ms (0 allocations: 0 bytes)

The cores are grouped into subsets of 4 (“core complexes”) and beyond that it seems some work is needed to optimize memory traffic. I stumbled across a related paper: CFD Application on AMD Epyc Rome by Szustak et al. Perhaps @tkf has a suggestion for getting Julia threads to act like their OpenMP work teams.

I can’t guess too much given the information in the OP. But I’m curious what you’d get with JULIA_EXCLUSIVE=1 and still with the explicit number of threads specified by -t as in the OP (e.g., JULIA_EXCLUSIVE=1 julia --project -O3 -t4 dynamics_kernel.jl).

There’s GitHub - JuliaConcurrent/SyncBarriers.jl if you want to write SPMD-flavoured code for multi-threaded Julia.

JULIA_EXCLUSIVE=1 makes performance worse, and watching htop, it seems to put multiple threads on one core, which I do not understand. @tkf which extra information would you need?

Echoing what Ralph says about 4 threads per core complex.
In the BIOS the Numa Per Socket (NPS) can be altered.
Here is a good writeup on this from Dell
https://www.dell.com/support/kbdoc/en-uk/000137696/amd-rome-is-it-for-real-architecture-and-initial-hpc-performance

That’s interesting. I think I’ve only seen improvements with JULIA_EXCLUSIVE=1 (although somewhat rare) when using Threads.@spawn.

@Elrod does @tturbo do something special with JULIA_EXCLUSIVE=1?

I just meant that there’s nothing I can guess. To say anything useful, it’d probably require me to understand the code, run it, and do some profiling with perf etc.

FWIW, on AMD EPYC 7502, I get

$ julia --project -O3 -t2 test/dynamics_kernel.jl
  102.258 ms (0 allocations: 0 bytes)
$ julia --project -O3 -t4 test/dynamics_kernel.jl
  50.678 ms (0 allocations: 0 bytes)
$ julia --project -O3 -t8 test/dynamics_kernel.jl
  22.772 ms (0 allocations: 0 bytes)
$ julia --project -O3 -t16 test/dynamics_kernel.jl
  16.875 ms (0 allocations: 0 bytes)
$ julia --project -O3 -t32 test/dynamics_kernel.jl
  14.039 ms (0 allocations: 0 bytes)
$ JULIA_EXCLUSIVE=1 julia --project -O3 -t2 test/dynamics_kernel.jl
  107.643 ms (0 allocations: 0 bytes)
$ JULIA_EXCLUSIVE=1 julia --project -O3 -t4 test/dynamics_kernel.jl
  53.208 ms (0 allocations: 0 bytes)
$ JULIA_EXCLUSIVE=1 julia --project -O3 -t8 test/dynamics_kernel.jl
  40.422 ms (0 allocations: 0 bytes)
$ JULIA_EXCLUSIVE=1 julia --project -O3 -t16 test/dynamics_kernel.jl
  22.268 ms (0 allocations: 0 bytes)
$ JULIA_EXCLUSIVE=1 julia --project -O3 -t32 test/dynamics_kernel.jl
  14.014 ms (0 allocations: 0 bytes)

i.e., no slow down like AMD Rome 7H12 in the OP but I do see the slow down with JULIA_EXCLUSIVE=1 for nthreads = 8, 16.

1 Like

That is interesting. I’d be happy to learn how go get more control on thread distribution, similar to OpenMP…

The easiest way to control the distribution of Julia tasks to OS threads is to use

Threads.@thread :static for tid in 1:Threads.nthreads()
    # run code on `tid`-th thread
end

It works on the current Julia version (as of 1.8) but I don’t know if it will continue to work.

To control the distribution of OS threads to CPUs, you can use GitHub - carstenbauer/ThreadPinning.jl: Pinning Julia threads to cores Or simply setting JULIA_EXCLUSIVE=1 may be fine unless you have multiple processes using the same machine.

Maybe it’s helpful to look at @carstenbauer’s benchmarking packages https://github.com/JuliaPerf/BandwidthBenchmark.jl and https://github.com/JuliaPerf/STREAMBenchmark.jl

2 Likes

No. It should be more or less the same as Threads.@threads :static.
ThreadingUtilities starts sticky tasks on threads 2:Threads.nthreads(), which under JULIA_EXCLUSIVE=1 should correspond to cores 1:Threads.nthreads(), while the main thread would then be running on core 0.
These are then the tasks @tturbo would run on.

2 Likes

side note:
on OS level : the Linux kernel 5.18 - will have a better scheduler

  • *" A patch entitled " sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs “”
  • "What’s exciting though is the end result and that is with an AMD Zen 3 platform he’s been testing, the OpenMP-parallelized Stream memory benchmark was 173~272% faster depending upon the memory operation tested. "

read more:

Irrelevant for the OP but the M1 max does not scale well for more than 4 threads neither

## User input
itot = 256; jtot = 256; ktot = 1024;
igc = 1; jgc = 1; kgc = 1;
julia --project -O3 -t1 test/dynamics_kernel.jl
  150.445 ms (0 allocations: 0 bytes)
julia --project -O3 -t2 test/dynamics_kernel.jl
  77.839 ms (0 allocations: 0 bytes)
julia --project -O3 -t4 test/dynamics_kernel.jl
  54.055 ms (0 allocations: 0 bytes)
julia --project -O3 -t8 test/dynamics_kernel.jl
  53.369 ms (0 allocations: 0 bytes)

that’s different. the m1 only has 4 high performance cores, and iirc you can’t use high perf cores with low perf cores.

this is M1 Max (10 cores 8+2)

LoopVectorization.jl may not know that. Check LoopVectorization.lv_max_num_threads().
I hard coded apple silicon to only use 4 threads.

I’ll need to add ways to check for the actual number of performance cores.

2 Likes

Didn’t know that !

static(4)

:wink:

Just to double check, does Sys.CPU_THREADS == 8? If so, I can now use Sys.CPU_THREADS as the number of big cores on apple aarch64.

Yes I get 8 with arm master version.
Fwiw I get 10 with x86 1.7.1

Apple silicon + 1.7 segfaults regularly with LoopVectorization anyway, so I’m not overly concerned with supporting it. 1.8 and master should be handled better now:
https://github.com/JuliaSIMD/CPUSummary.jl/commit/6f7c35be60c75ff793980aade5a347702c2b41e0

Mind rerrunning the benchmark after updating to confirm it scales?

2 Likes

The update allows tturbo to use more threads !
Unfortunately this specific kernel does not scale well on this machine.

 Row │ nthreads  time_in_ms
     │ Int64     Float64
─────┼──────────────────────
   1 │        1     153.043
   2 │        2      79.634
   3 │        3      53.592
   4 │        4      42.201
   5 │        5      38.589
   6 │        6      35.744
   7 │        7      36.54
   8 │        8      45.309
                              times (ms)                 
              ┌                                        ┐ 
            1 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 153.043   
            2 ┤■■■■■■■■■■■■■■■■ 79.634                   
            3 ┤■■■■■■■■■■■ 53.592                        
   nthreads 4 ┤■■■■■■■■■ 42.201                          
            5 ┤■■■■■■■■ 38.589                           
            6 ┤■■■■■■■ 35.744                            
            7 ┤■■■■■■■ 36.54                             
            8 ┤■■■■■■■■■ 45.309  

This screenshot shows the evolution of the cpu usage:

the Julia script to launch the bench with different thread numbers
using UnicodePlots
using DataFrames

# from @tkf https://discourse.julialang.org/t/collecting-all-output-from-shell-commands
function communicate(cmd::Cmd, input)
    @show cmd
    inp = Pipe()
    out = Pipe()
    err = Pipe()

    process = run(pipeline(cmd, stdin=inp, stdout=out, stderr=err), wait=false)
    close(out.in)
    close(err.in)

    stdout = @async String(read(out))
    stderr = @async String(read(err))
    write(process, input)
    close(inp)
    wait(process)
    return (
        stdout = fetch(stdout),
        stderr = fetch(stderr),
        code = process.exitcode
    )
end

function launch(n_threads)
    df=DataFrame()
    for t ∈ n_threads
        c=communicate(`../julia/usr/bin/julia --project -t$t test/dynamics_kernel.jl`,"")
        tms=parse(Float64,split(c[1]," ")[3])
        append!(df,DataFrame(nthreads = t,time_in_ms=tms))
    end
    p=barplot( df.nthreads, df.time_in_ms,title="times (ms)",ylabel="nthreads")
    df,p
end

df,p=launch(1:8)
@show df
p
2 Likes