Hyperthreading in HPC

fgerick · July 7, 2023, 2:31pm

Hi everyone,

I have a question regarding the use of hyperthreading/SMT on HPC systems. HPC hyperthreading on off - Google Search reveals some vague hints that in the past it has been accepted that enabling hyperthreading would result in a decrease in performance in most applications. Now it’s 2023 and there are potentially more than 128 hyperthreads that could be available to the user. In my Julia applications, when slapping on the @threads macro, I tend to see the advantage of the hyperthreads. BLAS is not the same, which I don’t understand fully also (?). Does anyone have any experience or had the possibility to compare hyperthreading off vs on on a big cluster? Or are there any good blog posts that detail some benchmarks and the real world disadvantage of hyperthreading on modern architectures?

Thanks in advance

jling · July 7, 2023, 2:39pm

hyperthreading is local to a CPU core, doesn’t have anything to do with “HPC” or not. Maybe a little because you can talk about if hyperthreading is enabled/good on HPC-targeting CPU models but that’s about it.

(open)BLAS already uses multi-threading, so if you nest @threads with BLAS calls, it can’t magically be faster.

Hyperthreading is usually helpful unless you run a super tight CPU-bounded inner loop. In this case, you want only run 1 thread per physical CPU core because that’s enough to pin the CPU at 100%. (or of course, whenever you have lock contention or other bottleneck that worsens as you increase threads, but that’s not specifically caused by hyperthreading)

fgerick · July 7, 2023, 2:51pm

Sorry, maybe I wasn’t very clear in my questions…

this is exactly my question yes. I know hyperthreading works per core on one cpu on one node. My question is, why do HPC admins disable it? Where is the proof that its a performance penalty? Can I convince my local HPC admin to not do it, or should I be happy they disabled it?

I know it’s not good to mix BLAS threads and Julia threads. My question is: Why can I get more performance from hyperthreading with Julia threads, but BLAS does/cannot benefit from hyperthreading?

Oscar_Smith · July 7, 2023, 3:05pm

The general rule is that hyperthreads make memory bound (or branchy) code faster but don’t help cpu bound code.

dlakelan · July 7, 2023, 3:08pm

Hyper threading works by making it very fast to context switch when a task bottlenecks waiting for something.

When you’ve got something like BLAS the only thing it’ll wait for is a cache miss, and it’s written to minimize those. there’s no loading data from disk, hitting a semaphore while signalling another thread, etc which would stall the computation. In general HPC workloads often just don’t have these kinds of stalls often enough that the hyperthread has anything to do.

Hyperthreading makes desktop usage more responsive because there’s a lot of waiting for user input or storage or network events or whatever and then a context switch is very fast because all the context is already loaded into the hyperthread.

fgerick · July 7, 2023, 3:14pm

Thanks for these insights. Assuming that on a cluster with a fairly heterogeneous user base, i.e. not everyone runs optimized code that is only cpu bound, is there still an advantage of having hyperthreading disabled? Is the disadvantage of hyperthreading really so big? Experienced high performance users can simply pin their allocated cores to the physical cores I believe?

fgerick · July 7, 2023, 3:54pm

Also this comment by @StefanKarpinski made me wonder these things:

github.com/JuliaLang/julia

Use number of physical not logical cores for auto nthreads?

opened 11:18PM - 06 Jan 22 UTC

jtrakk

speculative multithreading

In 1.7 `auto` uses hyperthreading. For some workloads, using physical cores with…out hyperthreading may be faster. I'm not sure which workloads these are (does anybody have a reference with benchmarks?), but if Julia users would rather not use hyperthreading by default, `auto` should do that. If Julia doesn't have access to the number of physical cores without [Hwloc.jl](https://github.com/JuliaParallel/Hwloc.jl), it could use a heuristic like `ceil(jl_cpu_threads/2)`.

fgerick · July 7, 2023, 6:13pm

Sorry for another reply. There seems to be not much overhead for using hyperthreads in a tight loop, in my quick test:

without hyperthreads:

julia> using BenchmarkTools, ThreadPinning

julia> Threads.nthreads()
24

julia> pinthreads(24:24+23)

julia> threadinfo()

System: 48 cores (2-way SMT), 2 sockets, 2 NUMA domains

| 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
  16,17,18,19,20,21,22,23,48,49,50,51,52,53,54,55,
  56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71 |
| 24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,
  40,41,42,43,44,45,46,47,72,73,74,75,76,77,78,79,
  80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95 |

# = Julia thread, # = HT, # = Julia thread on HT, | = Socket seperator

Julia threads: 24
├ Occupied CPU-threads: 24
└ Mapping (Thread => CPUID): 1 => 24, 2 => 25, 3 => 26, 4 => 27, 5 => 28, ...

julia> function mygemmth!(C, A, B)
               Threads.@threads for m ∈ axes(A,1)
                               for n ∈ axes(B,2)
                                       Cmn = zero(eltype(C))
                                       for k ∈ axes(A,2)
                                               Cmn += A[m,k] * B[k,n]
                                       end
                                       C[m,n] = Cmn
                       end
               end
       end
mygemmth! (generic function with 1 method)

julia> M, K, N = 3000, 3000, 3000;

julia> C1 = Matrix{Float64}(undef, M, N); A = randn(M, K); B = randn(K, N);

julia> C2 = similar(C1); C3 = similar(C1);

julia> @benchmark mygemmth!($C1, $A, $B)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.881 s …   2.949 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.915 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.915 s ± 47.535 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                       █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.88 s         Histogram: frequency by time        2.95 s <

 Memory estimate: 13.62 KiB, allocs estimate: 147.

With hyperthreads

julia> using BenchmarkTools, ThreadPinning

julia> Threads.nthreads()
48

julia> pinthreads(vcat(24:24+23,72:72+23))

julia> threadinfo()

System: 48 cores (2-way SMT), 2 sockets, 2 NUMA domains

| 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
  16,17,18,19,20,21,22,23,48,49,50,51,52,53,54,55,
  56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71 |
| 24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,
  40,41,42,43,44,45,46,47,72,73,74,75,76,77,78,79,
  80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95 |

# = Julia thread, # = HT, # = Julia thread on HT, | = Socket seperator

Julia threads: 48
├ Occupied CPU-threads: 48
└ Mapping (Thread => CPUID): 1 => 24, 2 => 25, 3 => 26, 4 => 27, 5 => 28, ...

julia> function mygemmth!(C, A, B)
               Threads.@threads for m ∈ axes(A,1)
                               for n ∈ axes(B,2)
                                       Cmn = zero(eltype(C))
                                       for k ∈ axes(A,2)
                                               Cmn += A[m,k] * B[k,n]
                                       end
                                       C[m,n] = Cmn
                       end
               end
       end
mygemmth! (generic function with 1 method)

julia> M, K, N = 3000, 3000, 3000;

julia> C1 = Matrix{Float64}(undef, M, N); A = randn(M, K); B = randn(K, N);

julia> C2 = similar(C1); C3 = similar(C1);

julia> @benchmark mygemmth!($C1, $A, $B)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.857 s …    2.858 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.857 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.857 s ± 715.678 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                        █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.86 s         Histogram: frequency by time         2.86 s <

 Memory estimate: 27.22 KiB, allocs estimate: 293.

EDIT: the colors in the threadinfo are not shown. Basically you would see that the first one has only Julia threads 24-72, wheras the second one has Julia threads 24-72 + Julia thread on HT 72-95.

Of course there is no benefit here in this example, but it also shows that enabling HT and using it, does not come at a costly disadvantage?

johnh · July 7, 2023, 8:25pm

@fgerick I think you have started a good discussion. The wisdom in HPC is that hyperthreading is/was not good for performacne and is switched off in most cases- as @dlakelan says HPC codes should be utilising CPU heavily. Certainly in any company I have worked for which configures HPC systems we switch it off.
I agree though that it is 2023 and you should approach things with an open mind.

In a Linux system you can dynamically disable every odd numbered CPU - which is not exactly the same as disabling hyperthreading in the BIOS but it has the same effect.
So you can experiment with this without a reboot.

ps. Use ‘lstopo’ to or ‘numactl’ to examine the mapping of CPU number to CCD or socket.
Shall I just say archly that there are some surprises there on recent CPUs, ie it does not automaticllay follow that CPU numbers 0,1,2,3,4,5,6… are on socket 1

johnh · July 7, 2023, 8:28pm

Also as we are on the topic, I think you will see bigger effects by looking at the settings for Nodes Per Socket on AMD processors

It definitely does have a performance effect and I would always present to customers the NPS setting we would make at Dell and advise the customer to try different setting with their workloads.

ps. hwloc and the associated utility lstopo is very good for showing the interior layout of your systems. IT should be available on an HPC cluster.

dlakelan · July 7, 2023, 8:41pm

thanks for this. I actually was misinformed and thought that the cores were always 0,1,2,3,…N-1 were one separate cpu and then N…2N were the “hyperthreads”. I see via lstopo that’s not true at all on my machine!

If I set JULIA_EXCLUSIVE I think it pins threads to cores. Is it pinning properly to separate cores?

carstenbauer · July 7, 2023, 10:43pm

No need to leave Julia and loose interactivity: just use ThreadPinning.jl. It shows this correctly. And if you really want to use hwloc, there is Hwloc.jl.

carstenbauer · July 7, 2023, 10:48pm

Note that while it is fine to pin manually to specific CPU thread IDs, are you aware of pinthreads(:cores) and pinthreads(:cputhreads)? See ?pinthreads for more information and options.

You may want to try threadinfo(; color=false).

carstenbauer · July 7, 2023, 10:52pm

No. Afaik, it pins to the first N CPU threads. This may often imply different cores but it’s not guaranteed. Use pinthreads(:cores) from ThreadPinning.jl and your safe.

fgerick · July 8, 2023, 6:32am

Thanks, yes I was aware of these, but I now discovered this one pinthreads(socket(2))

You seem to be working on putting Julia on HPC platforms. Have you had the chance to try a hyperthreading cluster and compare real life code examples, which are not just highly optimized BLAS routines?

I have seen that there is a complex latency structure on a two-socket AMD Epyc node, using the ThreadPinning.ThreadPinning.bench_core2core_latency():

carstenbauer · July 8, 2023, 7:00am

You can even combine those, e.g. pinthreads(numa(1, 2:4), socket(2, 1:3; compact=true)) pins the first 3 Julia threads to the second, third, and fourth physical core in the first NUMA domain and then the next 3 Julia threads compactly to the first three CPU threads (not cores) of the second socket. Compactly means that if, say, your system has 2 CPU threads per core (2-way SMT), the two Julia threads will occupy the two CPU threads in the first core (of the second socket) and a third Julia thread will occupy one CPU thread in the second CPU core (of the second socket).

Since this can become pretty complicated/hard to explain in words on larger systems, I created threadinfo() which tries to visualise this nicely.

carstenbauer · July 8, 2023, 7:04am

Yes I have and in my (limited) experience I’d say that whether HT helps or not depends very much on the application. But generally, I lean towards “don’t use it if you don’t know what you’re doing / you haven’t benchmarked your specific application”. And I second @johnh, most clusters I use these days just have it disabled (or maybe opt-in).

I do not, BTW, understand Stefan’s vague and very general statement in the GitHub issue. Julia threads aren’t inherently “better” at HT than BLAS threads. They are both pthreads under the hood (at least for our default OpenBLAS) so use the same technology. It just depends on what you do with your Julia threads if they benefit from HT or not.

Yeah AMD CPUs have more NUMA nodes than standard Intel CPUs (although that’s configurable). BTW, note that you have threadinfo(; groupby=:numa) to visualise the thread pinning with respect to NUMA nodes instead of sockets (default).