Simple performance test of threaded execution

Is there a (ideally, standard) test of threaded performance?

I am looking at a puzzle, where two Linux systems give very different
scaling of parallel computations with threads. It would be great to figure
out which one (if any) works as expected.

Obviously, I would like to find a Julia code.

I plan to work on multithreading until summer. So far see this old post of mime - one of the first things I plan is to update this…
To figure out the differences of the systems you can install hwloc on them. This also contains lstopo which gives a graphical overview of the architecture. There is also a Julia package Hwloc.jl (try Hwloc.topology_graphical() … )

AFAIK cache sizes and number of NUMA nodes are the main data to watch. In particular the number of NUMA nodes gives the number of independent pathways to RAM. For large problems, all threads will compete for this bottleneck. More expensive non-laptops may have more than one, which immediatly shows up in the multithreading performance. One can see that clearly in the last graph in that old thread.

2 Likes

Here is a simple example: integrating a function along an interval.
Each thread is given plenty of work by multiplying the work for each subinterval
by repeatedly evaluating the contribution to the integral.

My intention is to test a few systems to solve the puzzle from the thread

module thr_integrate

function _integrate_subinterval(f, x1, x2, xi, w, nloops)
    J = (x2 - x1) / 2
    r = zero(typeof(x1))
    for l in 1:nloops
        for j in eachindex(xi)
            x = x1 * (xi[j] - (+1)) / (-1 - (+1)) + x2 * (xi[j] - (-1)) / (+1 - (-1))
            fj = f(x)
            r += fj * w[j] * J
        end
    end
    return r  / nloops
end

using Base.Threads

function test()
    f(x) = -2 + (-x) + x^2 - 0.01 * x^3
    xa = 0.0
    xb = 20.0
    true_result = -2 * (xb - xa) -(1/2) * (xb - xa)^2 + (1/3) * (xb - xa)^3 - (1/4) * 0.01 * (xb - xa)^4
    ni = 100_000 # Number of intervals
    d = (xb - xa) / ni
    nloops = 20000
    xi = vec(
        [
        -0.973906528517171
        -0.865063366688985
        -0.679409568299025
        -0.433395394129247
        -0.148874338981631
        0.148874338981631
        0.433395394129247
        0.679409568299024
        0.865063366688984
        0.973906528517172
        ],
        )
    w = vec(
        [
        0.066671344308688
        0.149451349150581
        0.219086362515981
        0.269266719309996
        0.295524224714752
        0.295524224714753
        0.269266719309996
        0.219086362515982
        0.149451349150581
        0.066671344308688
        ],
        )

    tstart = time();
    nth = Base.Threads.nthreads() # Number of threads to use

    results = fill(zero(typeof(xa)), nth)
    Threads.@threads for k in 1:ni
        results[Threads.threadid()] += _integrate_subinterval(f, xa + (k - 1) * d, xa + (k) * d, xi, w, nloops)
    end

    r = sum(results)

    println("Result: $(r) compared to $(true_result)")
    total_time = time() - tstart
    println("With $(nth) threads: $(round(total_time, digits=2)) seconds")

    total_time
end

nothing
end

using .thr_integrate: test
test()

On this particular system I get decent scaling:

[pkrysl@horntail data]$ for nth in 1 2 4 8 16 32; do ./julia-1.9.0-beta4/bin/julia -t $nth ./thr_integrate.jl ; done
Result: 2026.6666666666981 compared to 2026.6666666666665
With 1 threads: 99.63 seconds
Result: 2026.66666666669 compared to 2026.6666666666665
With 2 threads: 50.02 seconds
Result: 2026.6666666666658 compared to 2026.6666666666665
With 4 threads: 25.69 seconds
Result: 2026.6666666666758 compared to 2026.6666666666665
With 8 threads: 13.38 seconds
Result: 2026.6666666666747 compared to 2026.6666666666665
With 16 threads: 8.12 seconds
Result: 2026.6666666666758 compared to 2026.6666666666665
With 32 threads: 5.81 seconds

With

julia> versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787ff (2023-02-07 21:53 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Opteron(tm) Processor 6380
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, bdver1)
  Threads: 1 on 64 virtual cores