I ran my 3D finite difference stencil benchmark https://github.com/Chiil/MicroHH.jl/blob/main/test/dynamics_kernel.jl on AMD Rome 7H12 and Intel Xeon Platinum 8360Y with grid spacing (itot = 256; jtot = 256; ktot = 1024). The @fast3d macro writes a nested 3D loop with a @tturbo decorator in front. I noticed that the AMD code has pretty dramatic scaling with threads beyond 4, whereas the Intel scales well until 16 cores.
Any idea what is going on here? The AMD is our production machine and I would be very happy with better scaling.
AMD results:
437.382 ms (0 allocations: 0 bytes)
chiel@tcn408:~/MicroHH.jl/test$ julia --project -O3 -t2 dynamics_kernel.jl
228.276 ms (0 allocations: 0 bytes)
chiel@tcn408:~/MicroHH.jl/test$ julia --project -O3 -t4 dynamics_kernel.jl
110.916 ms (0 allocations: 0 bytes)
chiel@tcn408:~/MicroHH.jl/test$ julia --project -O3 -t8 dynamics_kernel.jl
73.895 ms (0 allocations: 0 bytes)
chiel@tcn408:~/MicroHH.jl/test$ julia --project -O3 -t16 dynamics_kernel.jl
85.145 ms (0 allocations: 0 bytes)
Intel results:
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t1 dynamics_kernel.jl
402.697 ms (0 allocations: 0 bytes)
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t2 dynamics_kernel.jl
201.879 ms (0 allocations: 0 bytes)
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t4 dynamics_kernel.jl
101.593 ms (0 allocations: 0 bytes)
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t8 dynamics_kernel.jl
52.347 ms (0 allocations: 0 bytes)
chiel@gcn35:~/MicroHH.jl/test$ julia --project -O3 -t16 dynamics_kernel.jl
29.518 ms (0 allocations: 0 bytes)
The cores are grouped into subsets of 4 (“core complexes”) and beyond that it seems some work is needed to optimize memory traffic. I stumbled across a related paper: CFD Application on AMD Epyc Rome by Szustak et al. Perhaps @tkf has a suggestion for getting Julia threads to act like their OpenMP work teams.
I can’t guess too much given the information in the OP. But I’m curious what you’d get with JULIA_EXCLUSIVE=1 and still with the explicit number of threads specified by -t as in the OP (e.g., JULIA_EXCLUSIVE=1 julia --project -O3 -t4 dynamics_kernel.jl).
JULIA_EXCLUSIVE=1 makes performance worse, and watching htop, it seems to put multiple threads on one core, which I do not understand. @tkf which extra information would you need?
That’s interesting. I think I’ve only seen improvements with JULIA_EXCLUSIVE=1 (although somewhat rare) when using Threads.@spawn.
@Elrod does @tturbo do something special with JULIA_EXCLUSIVE=1?
I just meant that there’s nothing I can guess. To say anything useful, it’d probably require me to understand the code, run it, and do some profiling with perf etc.
No. It should be more or less the same as Threads.@threads :static.
ThreadingUtilities starts sticky tasks on threads 2:Threads.nthreads(), which under JULIA_EXCLUSIVE=1 should correspond to cores 1:Threads.nthreads(), while the main thread would then be running on core 0.
These are then the tasks @tturbo would run on.
side note:
on OS level : the Linux kernel 5.18 - will have a better scheduler
*" A patch entitled " sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs “”
"What’s exciting though is the end result and that is with an AMD Zen 3 platform he’s been testing, the OpenMP-parallelized Stream memory benchmark was 173~272% faster depending upon the memory operation tested. "