# Simple performance test of threaded execution

Is there a (ideally, standard) test of threaded performance?

I am looking at a puzzle, where two Linux systems give very different
scaling of parallel computations with threads. It would be great to figure
out which one (if any) works as expected.

Obviously, I would like to find a Julia code.

I plan to work on multithreading until summer. So far see this old post of mime - one of the first things I plan is to update this…
To figure out the differences of the systems you can install `hwloc` on them. This also contains `lstopo` which gives a graphical overview of the architecture. There is also a Julia package Hwloc.jl (try Hwloc.topology_graphical() … )

AFAIK cache sizes and number of NUMA nodes are the main data to watch. In particular the number of NUMA nodes gives the number of independent pathways to RAM. For large problems, all threads will compete for this bottleneck. More expensive non-laptops may have more than one, which immediatly shows up in the multithreading performance. One can see that clearly in the last graph in that old thread.

2 Likes

Here is a simple example: integrating a function along an interval.
Each thread is given plenty of work by multiplying the work for each subinterval
by repeatedly evaluating the contribution to the integral.

My intention is to test a few systems to solve the puzzle from the thread

``````module thr_integrate

function _integrate_subinterval(f, x1, x2, xi, w, nloops)
J = (x2 - x1) / 2
r = zero(typeof(x1))
for l in 1:nloops
for j in eachindex(xi)
x = x1 * (xi[j] - (+1)) / (-1 - (+1)) + x2 * (xi[j] - (-1)) / (+1 - (-1))
fj = f(x)
r += fj * w[j] * J
end
end
return r  / nloops
end

function test()
f(x) = -2 + (-x) + x^2 - 0.01 * x^3
xa = 0.0
xb = 20.0
true_result = -2 * (xb - xa) -(1/2) * (xb - xa)^2 + (1/3) * (xb - xa)^3 - (1/4) * 0.01 * (xb - xa)^4
ni = 100_000 # Number of intervals
d = (xb - xa) / ni
nloops = 20000
xi = vec(
[
-0.973906528517171
-0.865063366688985
-0.679409568299025
-0.433395394129247
-0.148874338981631
0.148874338981631
0.433395394129247
0.679409568299024
0.865063366688984
0.973906528517172
],
)
w = vec(
[
0.066671344308688
0.149451349150581
0.219086362515981
0.269266719309996
0.295524224714752
0.295524224714753
0.269266719309996
0.219086362515982
0.149451349150581
0.066671344308688
],
)

tstart = time();

results = fill(zero(typeof(xa)), nth)
results[Threads.threadid()] += _integrate_subinterval(f, xa + (k - 1) * d, xa + (k) * d, xi, w, nloops)
end

r = sum(results)

println("Result: \$(r) compared to \$(true_result)")
total_time = time() - tstart
println("With \$(nth) threads: \$(round(total_time, digits=2)) seconds")

total_time
end

nothing
end

using .thr_integrate: test
test()

``````

On this particular system I get decent scaling:

``````[pkrysl@horntail data]\$ for nth in 1 2 4 8 16 32; do ./julia-1.9.0-beta4/bin/julia -t \$nth ./thr_integrate.jl ; done
Result: 2026.6666666666981 compared to 2026.6666666666665
Result: 2026.66666666669 compared to 2026.6666666666665
Result: 2026.6666666666658 compared to 2026.6666666666665
Result: 2026.6666666666758 compared to 2026.6666666666665
Result: 2026.6666666666747 compared to 2026.6666666666665
Result: 2026.6666666666758 compared to 2026.6666666666665
``````

With

``````julia> versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787ff (2023-02-07 21:53 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × AMD Opteron(tm) Processor 6380
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, bdver1)
Threads: 1 on 64 virtual cores
``````