JULIA_NUM_THREADS and physical cores

In the documentation about JULIA_NUM_THREADS it states:
“If $JULIA_NUM_THREADS exceeds the number of available physical CPU cores, then the number of threads is set to the number of cores.”

What exactly is a physical core?
My PC:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       39 bits physical, 48 bits virtual
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               94
Model name:          Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Stepping:            3
CPU MHz:             800.059
CPU max MHz:         4300,0000
CPU min MHz:         800,0000
BogoMIPS:            8016.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d

I interpret “Core per socket: 4” and “Thread(s) per core: 2” to mean that I only have 4 physical cores and 8 logical cores?
Yet by putting JULIA_NUM_THREADS=99, I get Threads.nthreads() = 8 and not 4?

1 Like

I think either the doc is misleading or the implementation was wrong. In your case, you have 4 physical cores, and by hyper-threading, each core can run 2 threads. If you’re on Windows, 8 is the # of logic processors system will display in resource manager.

However, for computation intensive work, you should expect each thread to hit ~100% usage, thus hyper threading would only hurt performance (because CPU has to switch context).

5 Likes

I used to make that argument too but I no longer think that it’s generally true. Hyperthreading is weird and benefits seem to be application specific. It can sometimes improve performance even when all threads run at 100%. My uninformed guess is that it varies with things like multithreaded scalability of the problem, cache usage and hit rates, CPU clock speed throttling, etc.

For example, my work machine has 8 cores and can run 16 hyperthreads. I’ve seen applications which run on all cores which run 15-30% faster when hyperthreading is enabled. On the other hand, in my field (optimization models) I use commercial solvers like CPLEX Barrier which should in principle scale excellently with cores/threads, but which don’t improve at all beyond 3 threads. You just never know. It’s mystifying and sometimes frustrating.

2 Likes

I am not confident that this comment is precisely accurate, so (to any readers) please correct anything that’s wrong!

Each physical core has a 4 or 5 instruction/cycle limit (4 for Intel, 5 for AMD). There are also narrower limits within this: 2 loads / cycle (vector loads crossing a 64 byte boundary count as 2 loads; note that aligned loads will never cross a boundary). Many CPUs have 2 add/mull/fma instructions a cycle, although 14nm and 12nm Ryzen have only 1 256 bit op/cycle, Haswell has only 1 256 bit add a cycle, etc.
You’d have to look up your architecture here.

Well optimized code – like a good BLAS – can often hit this limit with a single thread. Two threads on one core thus cannot help.
Code that is less well optimized, whether that be in terms of cache misses or not taking advantage of superscalar parallelism, can benefit from two threads.

Quoting Wikipedia:

Hyper-threading works by duplicating certain sections of the processor—those that store the architectural state—but not duplicating the main execution resources. This allows a hyper-threading processor to appear as the usual “physical” processor and an extra “logical” processor to the host operating system (HTT-unaware operating systems see two “physical” processors), allowing the operating system to schedule two threads or processes simultaneously and appropriately. When execution resources would not be used by the current task in a processor without hyper-threading, and especially when the processor is stalled, a hyper-threading equipped processor can use those execution resources to execute another scheduled task. (The processor may stall due to a cache miss, branch misprediction, or data dependency.)[8]

So at least as I understand it, how much benefit you’ll get from hyperthreading is based on how well a single thread takes advantage of those

“How well optimized” may not be the best way for me to describe things, in that a function like

function mydot(a,b)
    d = zero(promote_type(eltype(a),eltype(b)))
    @inbounds @simd for i in eachindex(a,b)
        d += a[i] * b[i]
    end
    d
end

is about as well optimized as this particular function (single threaded dot product) can get, but suffers from hitting the load limit when a and b are short: the loop gets unrolled 4x, so each loop iteration has 8x loads, 4x fma, and 1x incrementing the loop counter. If the vectors are aligned, that will require at least 4 cycles, but there are only 13 ops out of the potential 16. If the vectors are short, they often will not be aligned:

julia> a = rand(64); reinterpret(Int, pointer(a)) % 64
48

For this a, every other load will cross a boundary, so only 4 32 byte vector loads could begin every 3 cycles.
If a and b are long, it will instead become memory bound, with the CPU waiting for memory to arrive before it can actually execute the loads. If another thread has memory available, it could use the execution resources then.
I think in most cases where you have high memory pressure, it’s best to not use more threads than the number of physical cores, to try and limit the amount of cache you need.

The #1 killer of superscalar parallelism is probably dependency chains (instruction latency means it will take some number of cycles before the result is available for use), so code with long dependency chains that you can’t unroll to alleviate that may be a good candidate for benefiting from hyper threading.

All that said, the easiest and most reliable way to find out what is fastest for your problem on your machine is to just try it.

7 Likes