Is there a (ideally, standard) test of threaded performance?

I am looking at a puzzle, where two Linux systems give very different
scaling of parallel computations with threads. It would be great to figure
out which one (if any) works as expected.

I plan to work on multithreading until summer. So far see this old post of mime - one of the first things I plan is to update this…
To figure out the differences of the systems you can install hwloc on them. This also contains lstopo which gives a graphical overview of the architecture. There is also a Julia package Hwloc.jl (try Hwloc.topology_graphical() … )

AFAIK cache sizes and number of NUMA nodes are the main data to watch. In particular the number of NUMA nodes gives the number of independent pathways to RAM. For large problems, all threads will compete for this bottleneck. More expensive non-laptops may have more than one, which immediatly shows up in the multithreading performance. One can see that clearly in the last graph in that old thread.

Here is a simple example: integrating a function along an interval.
Each thread is given plenty of work by multiplying the work for each subinterval
by repeatedly evaluating the contribution to the integral.

My intention is to test a few systems to solve the puzzle from the thread

module thr_integrate
function _integrate_subinterval(f, x1, x2, xi, w, nloops)
J = (x2 - x1) / 2
r = zero(typeof(x1))
for l in 1:nloops
for j in eachindex(xi)
x = x1 * (xi[j] - (+1)) / (-1 - (+1)) + x2 * (xi[j] - (-1)) / (+1 - (-1))
fj = f(x)
r += fj * w[j] * J
end
end
return r / nloops
end
using Base.Threads
function test()
f(x) = -2 + (-x) + x^2 - 0.01 * x^3
xa = 0.0
xb = 20.0
true_result = -2 * (xb - xa) -(1/2) * (xb - xa)^2 + (1/3) * (xb - xa)^3 - (1/4) * 0.01 * (xb - xa)^4
ni = 100_000 # Number of intervals
d = (xb - xa) / ni
nloops = 20000
xi = vec(
[
-0.973906528517171
-0.865063366688985
-0.679409568299025
-0.433395394129247
-0.148874338981631
0.148874338981631
0.433395394129247
0.679409568299024
0.865063366688984
0.973906528517172
],
)
w = vec(
[
0.066671344308688
0.149451349150581
0.219086362515981
0.269266719309996
0.295524224714752
0.295524224714753
0.269266719309996
0.219086362515982
0.149451349150581
0.066671344308688
],
)
tstart = time();
nth = Base.Threads.nthreads() # Number of threads to use
results = fill(zero(typeof(xa)), nth)
Threads.@threads for k in 1:ni
results[Threads.threadid()] += _integrate_subinterval(f, xa + (k - 1) * d, xa + (k) * d, xi, w, nloops)
end
r = sum(results)
println("Result: $(r) compared to $(true_result)")
total_time = time() - tstart
println("With $(nth) threads: $(round(total_time, digits=2)) seconds")
total_time
end
nothing
end
using .thr_integrate: test
test()