Sorry for another reply. There seems to be not much overhead for using hyperthreads in a tight loop, in my quick test:
without hyperthreads:
julia> using BenchmarkTools, ThreadPinning
julia> Threads.nthreads()
24
julia> pinthreads(24:24+23)
julia> threadinfo()
System: 48 cores (2-way SMT), 2 sockets, 2 NUMA domains
| 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
16,17,18,19,20,21,22,23,48,49,50,51,52,53,54,55,
56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71 |
| 24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,
40,41,42,43,44,45,46,47,72,73,74,75,76,77,78,79,
80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95 |
# = Julia thread, # = HT, # = Julia thread on HT, | = Socket seperator
Julia threads: 24
├ Occupied CPU-threads: 24
└ Mapping (Thread => CPUID): 1 => 24, 2 => 25, 3 => 26, 4 => 27, 5 => 28, ...
julia> function mygemmth!(C, A, B)
Threads.@threads for m ∈ axes(A,1)
for n ∈ axes(B,2)
Cmn = zero(eltype(C))
for k ∈ axes(A,2)
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
end
mygemmth! (generic function with 1 method)
julia> M, K, N = 3000, 3000, 3000;
julia> C1 = Matrix{Float64}(undef, M, N); A = randn(M, K); B = randn(K, N);
julia> C2 = similar(C1); C3 = similar(C1);
julia> @benchmark mygemmth!($C1, $A, $B)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max): 2.881 s … 2.949 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.915 s ┊ GC (median): 0.00%
Time (mean ± σ): 2.915 s ± 47.535 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.88 s Histogram: frequency by time 2.95 s <
Memory estimate: 13.62 KiB, allocs estimate: 147.
With hyperthreads
julia> using BenchmarkTools, ThreadPinning
julia> Threads.nthreads()
48
julia> pinthreads(vcat(24:24+23,72:72+23))
julia> threadinfo()
System: 48 cores (2-way SMT), 2 sockets, 2 NUMA domains
| 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
16,17,18,19,20,21,22,23,48,49,50,51,52,53,54,55,
56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71 |
| 24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,
40,41,42,43,44,45,46,47,72,73,74,75,76,77,78,79,
80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95 |
# = Julia thread, # = HT, # = Julia thread on HT, | = Socket seperator
Julia threads: 48
├ Occupied CPU-threads: 48
└ Mapping (Thread => CPUID): 1 => 24, 2 => 25, 3 => 26, 4 => 27, 5 => 28, ...
julia> function mygemmth!(C, A, B)
Threads.@threads for m ∈ axes(A,1)
for n ∈ axes(B,2)
Cmn = zero(eltype(C))
for k ∈ axes(A,2)
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
end
mygemmth! (generic function with 1 method)
julia> M, K, N = 3000, 3000, 3000;
julia> C1 = Matrix{Float64}(undef, M, N); A = randn(M, K); B = randn(K, N);
julia> C2 = similar(C1); C3 = similar(C1);
julia> @benchmark mygemmth!($C1, $A, $B)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min … max): 2.857 s … 2.858 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.857 s ┊ GC (median): 0.00%
Time (mean ± σ): 2.857 s ± 715.678 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.86 s Histogram: frequency by time 2.86 s <
Memory estimate: 27.22 KiB, allocs estimate: 293.
EDIT: the colors in the threadinfo are not shown. Basically you would see that the first one has only Julia threads 24-72, wheras the second one has Julia threads 24-72 + Julia thread on HT 72-95.
Of course there is no benefit here in this example, but it also shows that enabling HT and using it, does not come at a costly disadvantage?