I am not confident that this comment is precisely accurate, so (to any readers) please correct anything that’s wrong!
Each physical core has a 4 or 5 instruction/cycle limit (4 for Intel, 5 for AMD). There are also narrower limits within this: 2 loads / cycle (vector loads crossing a 64 byte boundary count as 2 loads; note that aligned loads will never cross a boundary). Many CPUs have 2 add/mull/fma instructions a cycle, although 14nm and 12nm Ryzen have only 1 256 bit op/cycle, Haswell has only 1 256 bit add a cycle, etc.
You’d have to look up your architecture here.
Well optimized code – like a good BLAS – can often hit this limit with a single thread. Two threads on one core thus cannot help.
Code that is less well optimized, whether that be in terms of cache misses or not taking advantage of superscalar parallelism, can benefit from two threads.
Hyper-threading works by duplicating certain sections of the processor—those that store the architectural state—but not duplicating the main execution resources. This allows a hyper-threading processor to appear as the usual “physical” processor and an extra “logical” processor to the host operating system (HTT-unaware operating systems see two “physical” processors), allowing the operating system to schedule two threads or processes simultaneously and appropriately. When execution resources would not be used by the current task in a processor without hyper-threading, and especially when the processor is stalled, a hyper-threading equipped processor can use those execution resources to execute another scheduled task. (The processor may stall due to a cache miss, branch misprediction, or data dependency.)
So at least as I understand it, how much benefit you’ll get from hyperthreading is based on how well a single thread takes advantage of those
“How well optimized” may not be the best way for me to describe things, in that a function like
d = zero(promote_type(eltype(a),eltype(b)))
@inbounds @simd for i in eachindex(a,b)
d += a[i] * b[i]
is about as well optimized as this particular function (single threaded dot product) can get, but suffers from hitting the load limit when
b are short: the loop gets unrolled 4x, so each loop iteration has 8x loads, 4x fma, and 1x incrementing the loop counter. If the vectors are aligned, that will require at least 4 cycles, but there are only 13 ops out of the potential 16. If the vectors are short, they often will not be aligned:
julia> a = rand(64); reinterpret(Int, pointer(a)) % 64
a, every other load will cross a boundary, so only 4 32 byte vector loads could begin every 3 cycles.
b are long, it will instead become memory bound, with the CPU waiting for memory to arrive before it can actually execute the loads. If another thread has memory available, it could use the execution resources then.
I think in most cases where you have high memory pressure, it’s best to not use more threads than the number of physical cores, to try and limit the amount of cache you need.
The #1 killer of superscalar parallelism is probably dependency chains (instruction latency means it will take some number of cycles before the result is available for use), so code with long dependency chains that you can’t unroll to alleviate that may be a good candidate for benefiting from hyper threading.
All that said, the easiest and most reliable way to find out what is fastest for your problem on your machine is to just try it.