AMD Rome vs Intel Xeon shows bad scaling with threads for AMD

Chiil · March 10, 2022, 1:13pm

I ran my 3D finite difference stencil benchmark https://github.com/Chiil/MicroHH.jl/blob/main/test/dynamics_kernel.jl on AMD Rome 7H12 and Intel Xeon Platinum 8360Y with grid spacing (itot = 256; jtot = 256; ktot = 1024). The @fast3d macro writes a nested 3D loop with a @tturbo decorator in front. I noticed that the AMD code has pretty dramatic scaling with threads beyond 4, whereas the Intel scales well until 16 cores.

Any idea what is going on here? The AMD is our production machine and I would be very happy with better scaling.

AMD results:

  437.382 ms (0 allocations: 0 bytes)
chiel@tcn408:~/MicroHH.jl/test$ julia --project -O3 -t2 dynamics_kernel.jl 
  228.276 ms (0 allocations: 0 bytes)
chiel@tcn408:~/MicroHH.jl/test$ julia --project -O3 -t4 dynamics_kernel.jl 
  110.916 ms (0 allocations: 0 bytes)
chiel@tcn408:~/MicroHH.jl/test$ julia --project -O3 -t8 dynamics_kernel.jl 
  73.895 ms (0 allocations: 0 bytes)
chiel@tcn408:~/MicroHH.jl/test$ julia --project -O3 -t16 dynamics_kernel.jl 
  85.145 ms (0 allocations: 0 bytes)

Intel results:

chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t1 dynamics_kernel.jl 
  402.697 ms (0 allocations: 0 bytes)
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t2 dynamics_kernel.jl 
  201.879 ms (0 allocations: 0 bytes)
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t4 dynamics_kernel.jl 
  101.593 ms (0 allocations: 0 bytes)
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t8 dynamics_kernel.jl 
  52.347 ms (0 allocations: 0 bytes)
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t16 dynamics_kernel.jl 
  29.518 ms (0 allocations: 0 bytes)

Ralph_Smith · March 11, 2022, 4:21am

The cores are grouped into subsets of 4 (“core complexes”) and beyond that it seems some work is needed to optimize memory traffic. I stumbled across a related paper: CFD Application on AMD Epyc Rome by Szustak et al. Perhaps @tkf has a suggestion for getting Julia threads to act like their OpenMP work teams.

tkf · March 11, 2022, 5:38am

I can’t guess too much given the information in the OP. But I’m curious what you’d get with JULIA_EXCLUSIVE=1 and still with the explicit number of threads specified by -t as in the OP (e.g., JULIA_EXCLUSIVE=1 julia --project -O3 -t4 dynamics_kernel.jl).

There’s GitHub - JuliaConcurrent/SyncBarriers.jl if you want to write SPMD-flavoured code for multi-threaded Julia.

Chiil · March 11, 2022, 6:38am

JULIA_EXCLUSIVE=1 makes performance worse, and watching htop, it seems to put multiple threads on one core, which I do not understand. @tkf which extra information would you need?

johnh · March 11, 2022, 7:20am

Echoing what Ralph says about 4 threads per core complex.
In the BIOS the Numa Per Socket (NPS) can be altered.
Here is a good writeup on this from Dell
https://www.dell.com/support/kbdoc/en-uk/000137696/amd-rome-is-it-for-real-architecture-and-initial-hpc-performance

tkf · March 11, 2022, 7:49am

That’s interesting. I think I’ve only seen improvements with JULIA_EXCLUSIVE=1 (although somewhat rare) when using Threads.@spawn.

@Elrod does @tturbo do something special with JULIA_EXCLUSIVE=1?

I just meant that there’s nothing I can guess. To say anything useful, it’d probably require me to understand the code, run it, and do some profiling with perf etc.

tkf · March 11, 2022, 7:55am

FWIW, on AMD EPYC 7502, I get

$ julia --project -O3 -t2 test/dynamics_kernel.jl
  102.258 ms (0 allocations: 0 bytes)
$ julia --project -O3 -t4 test/dynamics_kernel.jl
  50.678 ms (0 allocations: 0 bytes)
$ julia --project -O3 -t8 test/dynamics_kernel.jl
  22.772 ms (0 allocations: 0 bytes)
$ julia --project -O3 -t16 test/dynamics_kernel.jl
  16.875 ms (0 allocations: 0 bytes)
$ julia --project -O3 -t32 test/dynamics_kernel.jl
  14.039 ms (0 allocations: 0 bytes)
$ JULIA_EXCLUSIVE=1 julia --project -O3 -t2 test/dynamics_kernel.jl
  107.643 ms (0 allocations: 0 bytes)
$ JULIA_EXCLUSIVE=1 julia --project -O3 -t4 test/dynamics_kernel.jl
  53.208 ms (0 allocations: 0 bytes)
$ JULIA_EXCLUSIVE=1 julia --project -O3 -t8 test/dynamics_kernel.jl
  40.422 ms (0 allocations: 0 bytes)
$ JULIA_EXCLUSIVE=1 julia --project -O3 -t16 test/dynamics_kernel.jl
  22.268 ms (0 allocations: 0 bytes)
$ JULIA_EXCLUSIVE=1 julia --project -O3 -t32 test/dynamics_kernel.jl
  14.014 ms (0 allocations: 0 bytes)

i.e., no slow down like AMD Rome 7H12 in the OP but I do see the slow down with JULIA_EXCLUSIVE=1 for nthreads = 8, 16.

Chiil · March 11, 2022, 8:29am

That is interesting. I’d be happy to learn how go get more control on thread distribution, similar to OpenMP…

tkf · March 11, 2022, 8:59am

The easiest way to control the distribution of Julia tasks to OS threads is to use

Threads.@thread :static for tid in 1:Threads.nthreads()
    # run code on `tid`-th thread
end

It works on the current Julia version (as of 1.8) but I don’t know if it will continue to work.

To control the distribution of OS threads to CPUs, you can use GitHub - carstenbauer/ThreadPinning.jl: Pinning Julia threads to cores Or simply setting JULIA_EXCLUSIVE=1 may be fine unless you have multiple processes using the same machine.

Maybe it’s helpful to look at @carstenbauer’s benchmarking packages https://github.com/JuliaPerf/BandwidthBenchmark.jl and https://github.com/JuliaPerf/STREAMBenchmark.jl

Elrod · March 11, 2022, 7:30pm

No. It should be more or less the same as Threads.@threads :static.
ThreadingUtilities starts sticky tasks on threads 2:Threads.nthreads(), which under JULIA_EXCLUSIVE=1 should correspond to cores 1:Threads.nthreads(), while the main thread would then be running on core 0.
These are then the tasks @tturbo would run on.

ImreSamu · March 11, 2022, 10:28pm

side note:
on OS level : the Linux kernel 5.18 - will have a better scheduler

*" A patch entitled " sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs “”
"What’s exciting though is the end result and that is with an AMD Zen 3 platform he’s been testing, the OpenMP-parallelized Stream memory benchmark was 173~272% faster depending upon the memory operation tested. "

read more:

LaurentPlagne · March 12, 2022, 9:07am

Irrelevant for the OP but the M1 max does not scale well for more than 4 threads neither

## User input
itot = 256; jtot = 256; ktot = 1024;
igc = 1; jgc = 1; kgc = 1;

julia --project -O3 -t1 test/dynamics_kernel.jl
  150.445 ms (0 allocations: 0 bytes)
julia --project -O3 -t2 test/dynamics_kernel.jl
  77.839 ms (0 allocations: 0 bytes)
julia --project -O3 -t4 test/dynamics_kernel.jl
  54.055 ms (0 allocations: 0 bytes)
julia --project -O3 -t8 test/dynamics_kernel.jl
  53.369 ms (0 allocations: 0 bytes)

Oscar_Smith · March 12, 2022, 9:09am

that’s different. the m1 only has 4 high performance cores, and iirc you can’t use high perf cores with low perf cores.

LaurentPlagne · March 12, 2022, 9:11am

this is M1 Max (10 cores 8+2)

Elrod · March 12, 2022, 12:55pm

LoopVectorization.jl may not know that. Check LoopVectorization.lv_max_num_threads().
I hard coded apple silicon to only use 4 threads.

I’ll need to add ways to check for the actual number of performance cores.

LaurentPlagne · March 12, 2022, 1:55pm

Didn’t know that !

static(4)

Elrod · March 12, 2022, 4:59pm

Just to double check, does Sys.CPU_THREADS == 8? If so, I can now use Sys.CPU_THREADS as the number of big cores on apple aarch64.

LaurentPlagne · March 12, 2022, 5:06pm

Yes I get 8 with arm master version.
Fwiw I get 10 with x86 1.7.1

Elrod · March 12, 2022, 5:23pm

Apple silicon + 1.7 segfaults regularly with LoopVectorization anyway, so I’m not overly concerned with supporting it. 1.8 and master should be handled better now:
https://github.com/JuliaSIMD/CPUSummary.jl/commit/6f7c35be60c75ff793980aade5a347702c2b41e0

Mind rerrunning the benchmark after updating to confirm it scales?

LaurentPlagne · March 13, 2022, 8:33am

The update allows tturbo to use more threads !
Unfortunately this specific kernel does not scale well on this machine.

 Row │ nthreads  time_in_ms
     │ Int64     Float64
─────┼──────────────────────
   1 │        1     153.043
   2 │        2      79.634
   3 │        3      53.592
   4 │        4      42.201
   5 │        5      38.589
   6 │        6      35.744
   7 │        7      36.54
   8 │        8      45.309
                              times (ms)                 
              ┌                                        ┐ 
            1 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 153.043   
            2 ┤■■■■■■■■■■■■■■■■ 79.634                   
            3 ┤■■■■■■■■■■■ 53.592                        
   nthreads 4 ┤■■■■■■■■■ 42.201                          
            5 ┤■■■■■■■■ 38.589                           
            6 ┤■■■■■■■ 35.744                            
            7 ┤■■■■■■■ 36.54                             
            8 ┤■■■■■■■■■ 45.309

This screenshot shows the evolution of the cpu usage:

the Julia script to launch the bench with different thread numbers

using UnicodePlots
using DataFrames

# from @tkf https://discourse.julialang.org/t/collecting-all-output-from-shell-commands
function communicate(cmd::Cmd, input)
    @show cmd
    inp = Pipe()
    out = Pipe()
    err = Pipe()

    process = run(pipeline(cmd, stdin=inp, stdout=out, stderr=err), wait=false)
    close(out.in)
    close(err.in)

    stdout = @async String(read(out))
    stderr = @async String(read(err))
    write(process, input)
    close(inp)
    wait(process)
    return (
        stdout = fetch(stdout),
        stderr = fetch(stderr),
        code = process.exitcode
    )
end

function launch(n_threads)
    df=DataFrame()
    for t ∈ n_threads
        c=communicate(`../julia/usr/bin/julia --project -t$t test/dynamics_kernel.jl`,"")
        tms=parse(Float64,split(c[1]," ")[3])
        append!(df,DataFrame(nthreads = t,time_in_ms=tms))
    end
    p=barplot( df.nthreads, df.time_in_ms,title="times (ms)",ylabel="nthreads")
    df,p
end

df,p=launch(1:8)
@show df
p

Topic		Replies	Views
Decrease in performance using Threads.@threads in Linux Julia at Scale	16	2012	July 23, 2019
Show off Julia performance on your PC! Performance	53	4438	April 26, 2020
How to achieve perfect scaling with Threads (Julia 1.7.1) Performance multithreading	33	2499	January 13, 2023
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2686	December 1, 2021
Thread overhead variability across machines Performance	13	1873	November 28, 2017

AMD Rome vs Intel Xeon shows bad scaling with threads for AMD

Related topics