BenchmarkTools for benchmarking thread scalability of functions

gerhardu · September 3, 2022, 12:49am

I use BenchmarkTools.jl to benchmark functions in my code. That works great!
But what I am missing is a good interface to analyze thread scalability. Is there an option in BenchmarkTools.jl or another package?

As far as I know, the number of threads cannot be changed during execution time.
Therefore, I wrote a script that starts second julia script several times using increasing number of threads. The second script benchmarks my function and writes the result to a file.
At the end, I plot the benchmark results from the file to get an idea of the performance for different number of threads. That work, but is elaborate. Has anyone a suggestion how to make this better?

Actually, I was surprised that I could not find anything within BenchmarkTools.jl since it is an important issue when it comes to speed up.

Thanks

lmiq · September 3, 2022, 2:10am

It is fairly common that the number of threads optimal for performance in a code is not the maximum number of threads available. Thus, it is nice to decouple the number of threads used by a function from the actual number of threads. Doing that, you can easily from a single Julia section evaluate the scalability.

For example:

julia> function splitter(n,nchunks,ichunk)
           n_per_chunk = div(n,nchunks) # only works for multiples
           first = (ichunk-1)*n_per_chunk+1
           last = ichunk*n_per_chunk
           return first:last
       end
       function sumstuff(x; nchunks=Threads.nthreads())
           partial_sums = fill(zero(eltype(x)), nchunks)
           Threads.@threads for ichunk in 1:nchunks
               for i in splitter(length(x), nchunks, ichunk)
                   @inbounds partial_sums[ichunk] += x[i]
               end
           end
           return sum(partial_sums)
       end
sumstuff (generic function with 1 method)

julia> x = rand(10^7);

julia> sum(x) ≈ sumstuff(x)
true

julia> @btime sum($x)
  4.913 ms (0 allocations: 0 bytes)
5.001204526098089e6

julia> Threads.nthreads()
8

julia> @btime sumstuff($x; nchunks=2)
  6.053 ms (50 allocations: 4.44 KiB)
5.001204526097896e6

julia> @btime sumstuff($x; nchunks=4)
  3.158 ms (50 allocations: 4.45 KiB)
5.001204526098112e6

julia> @btime sumstuff($x; nchunks=8)
  2.855 ms (50 allocations: 4.48 KiB)
5.001204526098093e6

Of course you can use a much smarter and general splitter, etc, but that is the idea. With that in a single section you can benchmark the scalability of the code, and additionally, there are cases where using more chunks than threads is useful, particularly if the workload is uneven. For instance, it seems here that this is slightly faster:

julia> @btime sumstuff($x; nchunks=32)
  2.630 ms (50 allocations: 4.69 KiB)
5.001204526098115e6

That said, I am not against having a good package to benchmark scalability of codes in general.

gerhardu · September 4, 2022, 5:49pm

Thanks for your advice! First I had a look at ThreadPooling.jl to restrict the number of available Threads for the loop. But your solution is more useful to me. Especially since it is easily to adopt for usage with Distributed.jl. Haven’t done that yet, but I will definitely have a go on it and compare the result to Thread parallelisation. For me that is a good way to quantify general statements, like parallelisation type x works well if the problem size is big enough and so on…

Skeptical person that I am, I implemented this splitter method and ran it with different number of threads. It gave consistent results: Eg with 1 thread chunks do not improve the timing, with 2 threads 2 chunks gave the most speed up and so on). I could also see in htop that only as many threads are active as used chunks.

I implemented a bigger computation to have a look how Thread parallelisation improves with the problem size. I expected to see the biggest speed up with the biggest problem, but this was not the case. Any idea why this is the case?

Here the code to reproduce it (I ran it with 16 threads, Note: execution takes like 1 min or so)

using BenchmarkTools
using Plots
using DataFrames
using Colors, ColorSchemes

nrofproblems= 1000
nrofparticleslist =[10,500,5000]

function splitter(n,nchunks,ichunk)
    n_per_chunk = div(n,nchunks) # only works for multiples
    first = (ichunk-1)*n_per_chunk+1
    last = ichunk*n_per_chunk
    return first:last
end

function generalsplitter(n::Int64,nchunks::Int64,ichunk::Int64)
    return (floor(Int, (ichunk-1)*(n/nchunks))+1:floor(Int, (ichunk)*n/nchunks))
end

# my function that solves the entire nrofproblems
function calcminimumpotential(x; nchunks=Threads.nthreads())
    resultperchunk = zeros(Float64,nchunks)
    Threads.@threads for ichunk in 1:nchunks
        #for i in splitter(length(x), nchunks, ichunk)
        for i in generalsplitter(length(x), nchunks, ichunk)
            @inbounds resultperchunk[ichunk] = calcpotential(x[i])
        end
    end
    return minimum(resultperchunk)
end

# function that solves one problem -> to be distributed
function calcpotential(ax::Array{Float64})
    elements = length(ax)
    potential = 0
    for i in 1:elements
        for j in 1:(i-1)
            potential -= 1/abs(ax[i]-ax[j])
        end
    end
    return potential
end

# DataFrame to store results
df = DataFrame()
# loop through parameter points
for nrofparticles in nrofparticleslist
    # create list of problems
    global x = [ rand(Float64, nrofparticles) for i in 1:nrofproblems]
    # bm the solution of a single problem
    bmres = @benchmark calcpotential(x[1])
    bmsingleproblem = minimum(bmres.times)
    # create list of chunks to be analysed -> the problemlist will be chunkged up in this amount of subproblemlists
    thenchunks = [1,2,4,6,10,16,24]
    # prepare a list in which the result of each subproblemslist is stored (in this case we only care about the minimum of the value of each subproblem)
    result = zero(thenchunks)
    for (i, nchunks) in enumerate(thenchunks)
        print("/ chunks= ",nchunks,"...")
        bmres = @benchmark calcminimumpotential($x; nchunks=$nchunks)
        result[i] = minimum(bmres.times)
        #result[i]= @btime sumstuff($x; nchunks=$nchunks)
    end
    # create tmpdf with results and append to df
    tmpdf  =   DataFrame(  nthreads      =          [Threads.nthreads()  for i in thenchunks],
                        nrofparticles    =          [nrofparticles       for i in thenchunks],         
                        nrofproblems     =          [nrofproblems        for i in thenchunks],                   
                        bmsingleproblem  =          [bmsingleproblem     for i in thenchunks],                   
                        nchunks          =          thenchunks,
                        speedup          =          result[1]./result)
    append!(df, tmpdf)
end

uniquenrofparticles = unique(df,2).nrofparticles # vector of unique values for threads
mycolor = ColorScheme(range(colorant"red", colorant"black", length=length(uniquenrofparticles)))

p=plot(xlabel="nchuncks", ylabel="speedup", title="speed up in N-particle interactions (nrth=" * string(Threads.nthreads()) * ")", xlim=[1,maximum(df.nchunks)], ylim=[1,1.5*maximum(df.speedup)],legend=:topleft, )
for (i, nrofparticle) in enumerate(uniquenrofparticles)
    subdf = filter(:nrofparticles => ==(nrofparticle), df)
    if i==1 plot!(p,subdf.nchunks, subdf.nchunks, label="perfect scaling", line=(:dash,3), color="black") end
    plot!(p,subdf.nchunks, subdf.speedup, label="t(single problem)/sec=" * string( subdf.bmsingleproblem[i] / 10^6), linewidth=3, color=mycolor[i])
end
display(p)

Something more that was surprising: When I execute the code above several times I get slightly different results: The speed up factor can change from eg. 5 to 6 for the same parameter point.
But I guess that could be taken care of by tuning BenchmarkTools to the size of the problem.

Anyways, the thing that puzzles me the most is why the “biggest problem” did not speed up the most with parallelisation!

lmiq · September 4, 2022, 7:19pm

Did you try increasing the number of chunks even further? (To more, much more, than the number of threads?)

Shouldn’t this start as +Inf and you keep the minimum on each chunk?

Ps: are you actually computing particle interaction potentials?

lmiq · September 4, 2022, 9:41pm

Indeed, I see the same:

julia> x = [ rand(Float64, 5000) for i in 1:1000];

julia> @btime calcminimumpotential($x; nchunks=1)
  19.088 s (51 allocations: 4.45 KiB)
-2.4539686463132015e10

julia> @btime calcminimumpotential($x; nchunks=8)
  7.731 s (52 allocations: 4.55 KiB)
-1.1744933514693645e10

julia> 19.088/7.731
2.4690208252489976

julia> x = [ rand(Float64, 500) for i in 1:1000];

julia> @btime calcminimumpotential($x; nchunks=1)
  190.480 ms (50 allocations: 4.42 KiB)
-1.1044675757726028e9

julia> @btime calcminimumpotential($x; nchunks=8)
  51.732 ms (51 allocations: 4.52 KiB)
-1.1044675757726028e9

julia> 190.48/51.732
3.6820536611768344

Note that for the largest problems the benchmark is only running the problem once, so that can cause the fluctuations. But I don’t see an obvious reason here for the scaleup of the largest problem be worse than the others, I agree.

(small styling suggestions: Int64 is usually written only as Int, and you probably want Vector{Float64} where you have written Array{Float64}. Vector is an alias for Array{Float64,1};

also, initialize potential with potential = 0.0 or potential = zero(eltype(ax)), such that the it is of the same type as the result - this has an important perfomance implication:

julia> @btime calcminimumpotential($x; nchunks=8)
  33.961 ms (50 allocations: 4.48 KiB)
-7.668986000931947e7

vs:

julia> @btime calcminimumpotential($x; nchunks=8)
  59.813 ms (50 allocations: 4.48 KiB)
-7.668986000931947e7

That alone may change the benchmarks pretty much.

Edit:

The benchmarks with the fixed instability (the relative scalings do not change):

julia> # function that solves one problem -> to be distributed
       function calcpotential(ax::Vector{Float64})
           elements = length(ax)
           potential = 0.0
           for i in 1:elements
               for j in 1:(i-1)
                   potential -= 1/abs(ax[i]-ax[j])
               end
           end
           return potential
       end
calcpotential (generic function with 1 method)

julia> x = [ rand(Float64, 500) for i in 1:1000];

julia> @btime calcminimumpotential($x; nchunks=1)
  126.951 ms (51 allocations: 4.45 KiB)
-5.856750892180462e9

julia> @btime calcminimumpotential($x; nchunks=8)
  34.019 ms (50 allocations: 4.48 KiB)
-5.856750892180462e9

julia> 126/34
3.7058823529411766

julia> x = [ rand(Float64, 5000) for i in 1:1000];

julia> @btime calcminimumpotential($x; nchunks=1)
  12.628 s (51 allocations: 4.45 KiB)
-1.0457137932630594e10

julia> @btime calcminimumpotential($x; nchunks=8)
  4.443 s (52 allocations: 4.55 KiB)
-1.0457137932630594e10

julia> 12.628 / 4.443
2.842223722709881

gerhardu · September 5, 2022, 9:33am

I ran it again with 1.000.000 (much smaller) problems and i got the expected relation between speedup and problem size.
But it showed that the general rule that larger individual problems get the best speedup with Thread parallisation is not always true (if there is no other problem hidden in my code). I guess it is worth checking each use case if it is important.

I am still pretty new to Julia. Always happy to get some advice on best practices!

Thx, did not know that!

No, it is just the first thing that came to my mind that is easily scalable an expensive. I am more of a (quantum-) fields person than particles.

lmiq · September 5, 2022, 11:12am

I don´t see any problem, but the rule is should be considered correct for general code design. These toy benchmarks are somewhat misleading. It may be that the memory accesses are a significant part of the time, and that the way the memory is shared between CPUs is causing this strange effect.

For instance, in this other machine, with 6 physical cores, the larger problem scales better up to 6 chunks:

julia> x = [ rand(Float64, 5000) for i in 1:1000];

julia> @btime calcminimumpotential($x; nchunks=1)
  10.919 s (75 allocations: 6.61 KiB)
-4.0351050843096066e8

julia> @btime calcminimumpotential($x; nchunks=2)
  5.686 s (74 allocations: 6.59 KiB)
-4.0351050843096066e8

julia> @btime calcminimumpotential($x; nchunks=6)
  2.152 s (74 allocations: 6.62 KiB)
-5.820788044719601e8

julia> 10.919/2.152
5.073884758364312

julia> x = [ rand(Float64, 500) for i in 1:1000];

julia> @btime calcminimumpotential($x; nchunks=1)
  108.800 ms (74 allocations: 6.58 KiB)
-3.736233044582224e6

julia> @btime calcminimumpotential($x; nchunks=2)
  54.440 ms (74 allocations: 6.59 KiB)
-3.736233044582224e6

julia> @btime calcminimumpotential($x; nchunks=6)
  37.240 ms (74 allocations: 6.62 KiB)
-5.845620497574008e6

julia> 108.8 / 37.24
2.9215896885069816