Strategy to get best performance with @turbo and multithreading in real applications?

A simple example is shown in the end of the post. It can be seen that for the smaller vectors, the multithreading in @turbo makes no difference. For the larger vectors, increasing nthreads will enhance the performance, but 4 and 8 only differ slightly. (I’m curious about the 5* performance difference between 1 and 2, but that’s another story)
That is to say, the multithreading permance boost depends highly on the size of the vector, which is quite natural. The problem is, considering this, how we should finely tune the performance of a piece of code?
My actual user case involves the handling of a large N*M matrix. The algorithm is roughly like the following:

for i_m = 1 : M
    v = view(mat, :, i_m)
    @turbo for i_n = 1 : N
        v[i_n] = aux1[i_n] * aux2[i_n] # the actual code is more complex
    end
end

That is to say, the treatments of the M dimension is completely independent.

The matrix may be so large that theh memory becomes a great problem. Thus, users are allowed to input the maximum memory, and the matrix will be split into different n*M blocks (only in the N dimension) with each block smaller than the memory required. All following quantities may differ from run to run:

  1. N and M values
  2. the maximum memory
  3. number of threads used for calculation

Combining these arguments, I double the best performance can only be achieved by finely tune the multithreading between M dimension and @turbo, which is roughly like the following:

chunks = get_start_end_M_for_threads(M, nt_M)
tasks = map(chunks) do chunk
    Threads.@spawn my_f_thread(chunk[1], chunk[2], nn, block_mat, nt_N)
end
_ = fetch.(tasks)
# no need to reduce here

function my_f_thread(ms, me, nn, block_mat, nt_N)
    for i_m = ms : me
        v = view(block_mat, :, i_m)
        @turbo threads=nt_N for i_n = 1 : nn
            v[i_n] = aux1[i_n] * aux2[i_n]
        end
    end
end

However, I find difficulties when actually implementing it:

  1. How @turbo multithreading behaviours differ among different hardwares? That is to say, is it even possible to find a good heuristic way to determine the best nt_M, nt_N?
  2. As a macro, the @turbo threads=nt_N will raise error for a variable nt_N, which must be determined in the run time. Is it even possible to implement my_f_thread without eval, which I highly doubt will raise even greater performance problems?

I cannot combine N and M into one dimension, because some auxiliary vectors only has the N dimension.

I cannot block the large matrix in the M dimension, because the aforementiond function is only a part of a large processing pipeline, and the later stages require the full M dimension. Only the N dimension can be blocked.

julia> using MKL, LoopVectorization, BenchmarkTools, Random, LinearAlgebra

julia> n1 = 2 ^ 14; n2 = 2 ^ 17;

julia> a1 = zeros(n1); b1 = zeros(n1); rand!(a1); rand!(b1);

julia> a2 = zeros(n2); b2 = zeros(n2); rand!(a2); rand!(b2);

julia> function my_dot_1(x, y, n)
           s = zero(eltype(x))
           @turbo for i = 1 : n
               s += x[i] * y[i]
           end
           return s
       end
my_dot_1 (generic function with 1 method)

julia> 

julia> function my_dot_2(x, y, n)
           s = zero(eltype(x))
           @turbo thread=2 for i = 1 : n
               s += x[i] * y[i]
           end
           return s
       end
my_dot_2 (generic function with 1 method)

julia> function my_dot_4(x, y, n)
           s = zero(eltype(x))
           @turbo thread=4 for i = 1 : n
               s += x[i] * y[i]
           end
           return s
       end
my_dot_4 (generic function with 1 method)

julia> function my_dot_8(x, y, n)
           s = zero(eltype(x))
           @turbo thread=8 for i = 1 : n
               s += x[i] * y[i]
           end
           return s
       end
my_dot_8 (generic function with 1 method)

julia> @btime my_dot_1($a1, $b1, $n1)
  1.487 μs (0 allocations: 0 bytes)
4086.504852746797

julia> @btime my_dot_2($a1, $b1, $n1)
  1.573 μs (0 allocations: 0 bytes)
4086.504852746797

julia> @btime my_dot_4($a1, $b1, $n1)
  1.576 μs (0 allocations: 0 bytes)
4086.504852746797

julia> @btime my_dot_8($a1, $b1, $n1)
  1.492 μs (0 allocations: 0 bytes)
4086.504852746797

julia> @btime dot($a1, $b1)
  1.615 μs (1 allocation: 16 bytes)
4086.5048527467975

julia> @btime my_dot_1($a2, $b2, $n2)
  75.687 μs (0 allocations: 0 bytes)
32625.7939963322

julia> @btime my_dot_2($a2, $b2, $n2)
  16.270 μs (0 allocations: 0 bytes)
32625.79399633219

julia> @btime my_dot_4($a2, $b2, $n2)
  6.024 μs (0 allocations: 0 bytes)
32625.7939963322

julia> @btime my_dot_8($a2, $b2, $n2)
  5.492 μs (0 allocations: 0 bytes)
32625.7939963322

julia> @btime dot($a2, $b2)
  75.348 μs (1 allocation: 16 bytes)
32625.7939963322

julia> using VectorizationBase

julia> VectorizationBase.num_cores()
static(24)

julia> Threads.nthreads()
8

it might simply be the case that you already saturated the memory bandwidth? what system are you on

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    2
Core(s) per socket:    24
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
Stepping:              7
CPU MHz:               2962.939
CPU max MHz:           4000.0000
CPU min MHz:           1000.0000
BogoMIPS:              4800.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-23,48-71
NUMA node1 CPU(s):     24-47,72-95
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba rsb_ctxsw ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
julia> versioninfo()
Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 96 × Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 8 on 96 virtual cores
Environment:
  LD_LIBRARY_PATH = /public/opt/tools/julia/julia-1.9.4/lib/julia:/public/opt/tools/julia/julia-1.9.4/lib
  JULIA_PKG_SERVER = https://mirrors.pku.edu.cn/julia
  JULIA_PKG_USE_CLI_GIT = true
julia> @btime my_dot_1($a1, $b1, $n1)
  1.902 μs (0 allocations: 0 bytes)
4117.29392973325

julia> @btime my_dot_2($a1, $b1, $n1)
  1.070 μs (0 allocations: 0 bytes)
4117.293929733249

julia> @btime my_dot_4($a1, $b1, $n1)
  1.074 μs (0 allocations: 0 bytes)
4117.293929733249

julia> @btime my_dot_8($a1, $b1, $n1)
  1.071 μs (0 allocations: 0 bytes)
4117.293929733249

#AMD Ryzen 9 3900X 12-Core Processor

for me at 2 threads I saturated something else, I suspect for you 1 thread was enough somehow.