It is fairly common that the number of threads optimal for performance in a code is not the maximum number of threads available. Thus, it is nice to decouple the number of threads used by a function from the actual number of threads. Doing that, you can easily from a single Julia section evaluate the scalability.
For example:
julia> function splitter(n,nchunks,ichunk)
n_per_chunk = div(n,nchunks) # only works for multiples
first = (ichunk-1)*n_per_chunk+1
last = ichunk*n_per_chunk
return first:last
end
function sumstuff(x; nchunks=Threads.nthreads())
partial_sums = fill(zero(eltype(x)), nchunks)
Threads.@threads for ichunk in 1:nchunks
for i in splitter(length(x), nchunks, ichunk)
@inbounds partial_sums[ichunk] += x[i]
end
end
return sum(partial_sums)
end
sumstuff (generic function with 1 method)
julia> x = rand(10^7);
julia> sum(x) ≈ sumstuff(x)
true
julia> @btime sum($x)
4.913 ms (0 allocations: 0 bytes)
5.001204526098089e6
julia> Threads.nthreads()
8
julia> @btime sumstuff($x; nchunks=2)
6.053 ms (50 allocations: 4.44 KiB)
5.001204526097896e6
julia> @btime sumstuff($x; nchunks=4)
3.158 ms (50 allocations: 4.45 KiB)
5.001204526098112e6
julia> @btime sumstuff($x; nchunks=8)
2.855 ms (50 allocations: 4.48 KiB)
5.001204526098093e6
Of course you can use a much smarter and general splitter, etc, but that is the idea. With that in a single section you can benchmark the scalability of the code, and additionally, there are cases where using more chunks than threads is useful, particularly if the workload is uneven. For instance, it seems here that this is slightly faster:
julia> @btime sumstuff($x; nchunks=32)
2.630 ms (50 allocations: 4.69 KiB)
5.001204526098115e6
That said, I am not against having a good package to benchmark scalability of codes in general.