BenchmarkTools for benchmarking thread scalability of functions

It is fairly common that the number of threads optimal for performance in a code is not the maximum number of threads available. Thus, it is nice to decouple the number of threads used by a function from the actual number of threads. Doing that, you can easily from a single Julia section evaluate the scalability.

For example:

julia> function splitter(n,nchunks,ichunk)
           n_per_chunk = div(n,nchunks) # only works for multiples
           first = (ichunk-1)*n_per_chunk+1
           last = ichunk*n_per_chunk
           return first:last
       end
       function sumstuff(x; nchunks=Threads.nthreads())
           partial_sums = fill(zero(eltype(x)), nchunks)
           Threads.@threads for ichunk in 1:nchunks
               for i in splitter(length(x), nchunks, ichunk)
                   @inbounds partial_sums[ichunk] += x[i]
               end
           end
           return sum(partial_sums)
       end
sumstuff (generic function with 1 method)

julia> x = rand(10^7);

julia> sum(x) ≈ sumstuff(x)
true

julia> @btime sum($x)
  4.913 ms (0 allocations: 0 bytes)
5.001204526098089e6

julia> Threads.nthreads()
8

julia> @btime sumstuff($x; nchunks=2)
  6.053 ms (50 allocations: 4.44 KiB)
5.001204526097896e6

julia> @btime sumstuff($x; nchunks=4)
  3.158 ms (50 allocations: 4.45 KiB)
5.001204526098112e6

julia> @btime sumstuff($x; nchunks=8)
  2.855 ms (50 allocations: 4.48 KiB)
5.001204526098093e6

Of course you can use a much smarter and general splitter, etc, but that is the idea. With that in a single section you can benchmark the scalability of the code, and additionally, there are cases where using more chunks than threads is useful, particularly if the workload is uneven. For instance, it seems here that this is slightly faster:

julia> @btime sumstuff($x; nchunks=32)
  2.630 ms (50 allocations: 4.69 KiB)
5.001204526098115e6

That said, I am not against having a good package to benchmark scalability of codes in general.