Multi-threaded benchmarking is slow

Motivation

Running many benchmarks can be slow.
Since almost all of them are independent, they can theoretically run easily on parallel.

Problem

It turns out running the benchmarks in parallel takes more or less equal time with running them serially. They have more allocations, more garbage collection and more compilation time.

My approach

I was experimenting on how to run a BenchmarkTools.BenchmarkGroup on parallel.
I came up with using the tags to identify which groups should run on which threads.

The benchmarking problem instance

For example, let’s say we have the following benchmark suite, with "t1", "t2" and "t3" being the tags signifying in which thread will the benchmark group run.

using BenchmarkTools
threadtags = ["t1", "t2", "t3"]
bg = BenchmarkGroup([], 
		    "sum" => BenchmarkGroup([],
			      "1d" => BenchmarkGroup([threadtags[1]]),
			      "2d" => BenchmarkGroup([threadtags[2]])),
		    "prod" => BenchmarkGroup([threadtags[3]]));
foreach(100:100:500) do k
	r = rand(k)
	r2 = rand(k, k)
	bg["sum"]["1d"][k] = @benchmarkable sum($r)
	bg["sum"]["2d"][k] = @benchmarkable sum($r2)
	bg["prod"][k] = @benchmarkable prod($r)
end

giving

julia> bg
2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "sum" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "1d" => 5-element BenchmarkTools.BenchmarkGroup:
                  tags: ["t1"]
                  200 => Benchmark(evals=1, seconds=5.0, samples=10000)
                  300 => Benchmark(evals=1, seconds=5.0, samples=10000)
                  500 => Benchmark(evals=1, seconds=5.0, samples=10000)
                  100 => Benchmark(evals=1, seconds=5.0, samples=10000)
                  400 => Benchmark(evals=1, seconds=5.0, samples=10000)
          "2d" => 5-element BenchmarkTools.BenchmarkGroup:
                  tags: ["t2"]
                  200 => Benchmark(evals=1, seconds=5.0, samples=10000)
                  300 => Benchmark(evals=1, seconds=5.0, samples=10000)
                  500 => Benchmark(evals=1, seconds=5.0, samples=10000)
                  100 => Benchmark(evals=1, seconds=5.0, samples=10000)
                  400 => Benchmark(evals=1, seconds=5.0, samples=10000)
  "prod" => 5-element BenchmarkTools.BenchmarkGroup:
          tags: ["t3"]
          200 => Benchmark(evals=1, seconds=5.0, samples=10000)
          300 => Benchmark(evals=1, seconds=5.0, samples=10000)
          500 => Benchmark(evals=1, seconds=5.0, samples=10000)
          100 => Benchmark(evals=1, seconds=5.0, samples=10000)
          400 => Benchmark(evals=1, seconds=5.0, samples=10000)

Parallelizing it

I used the typical Base.Threads module to run the benchmarks on parallel.
First create a channel with the threads

c = Channel{String}(ch -> foreach(i -> put!(ch, i), threadtags), 1)

and then collect the results on a different channel

bgresvec = Channel{BenchmarkGroup}(length(threadtags)) do ch 
	runbenc(ttag) = put!(ch, run(bg[@tagged ttag]))
	Threads.foreach(runbenc, c)
end |> collect

I know packages like the loving Transducers.jl can simplify the process a lot, but I wanted as less dependencies as possible; so I basically followed the documentation.

Restoring the BenchmarkGroup

bgresvec is a vector of BenchmarkGroup containing Trials.
To reconstruct the BenchmarkGroup I defined the function

# original, results in vector, thread tags
function reconstructbenchmarkgroup(bgorg, bgres, ttags)
	bg = deepcopy(bgorg)
	for t in (ttags)
		# find bgres with that tag
		i = findfirst(bg -> length(bg[@tagged t]) !== 0 , bgresvec)
		for (k,_) in leaves(bg[@tagged t])
			bg[k] = bgres[i][k]
		end
	end
	return bg
end

and now I happily get the BenchmarkGroup as the orginal but with the trials instead

julia> bgres = reconstructbenchmarkgroup(bg, bgresvec, threadtags)
2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "sum" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "1d" => 5-element BenchmarkTools.BenchmarkGroup:
                  tags: ["t1"]
                  200 => Trial(46.000 ns)
                  300 => Trial(61.000 ns)
                  500 => Trial(69.000 ns)
                  100 => Trial(37.000 ns)
                  400 => Trial(66.000 ns)
          "2d" => 5-element BenchmarkTools.BenchmarkGroup:
                  tags: ["t2"]
                  200 => Trial(3.274 μs)
                  300 => Trial(7.278 μs)
                  500 => Trial(29.118 μs)
                  100 => Trial(768.000 ns)
                  400 => Trial(13.176 μs)
  "prod" => 5-element BenchmarkTools.BenchmarkGroup:
          tags: ["t3"]
          200 => Trial(38.000 ns)
          300 => Trial(54.000 ns)
          500 => Trial(54.000 ns)
          100 => Trial(29.000 ns)
          400 => Trial(64.000 ns)

Comparison with serial benchmarking

Sadly, I realized that all this effort led to similar or worse times for the parallel benchmarking.

julia> @info "Single threaded"
[ Info: Single threaded

julia> @time for t in threadtags
       @time run(bg[@tagged t])
       @show t
       end
  2.718555 seconds (701.11 k allocations: 26.902 MiB, 98.62% gc time, 0.14% compilation time)
t = "t1"
  3.480109 seconds (700.54 k allocations: 26.864 MiB, 81.91% gc time)
t = "t2"
  2.570467 seconds (700.50 k allocations: 26.861 MiB, 98.66% gc time)
t = "t3"
  8.769646 seconds (2.10 M allocations: 80.639 MiB, 91.99% gc time, 0.04% compilation time)

julia> @info "Multi threaded"
[ Info: Multi threaded

julia> @time Threads.@threads for t in threadtags
       @time run(bg[@tagged t])
       @show t
       end
  7.509834 seconds (1.28 M allocations: 48.385 MiB, 99.39% gc time, 0.76% compilation time)
t = "t3"
  7.972059 seconds (1.58 M allocations: 59.616 MiB, 99.18% gc time, 0.15% compilation time)
t = "t1"
  9.627295 seconds (2.10 M allocations: 79.783 MiB, 93.64% gc time, 0.36% compilation time)
t = "t2"
  9.669746 seconds (2.13 M allocations: 81.953 MiB, 93.23% gc time, 1.03% compilation time)

You can see the numbers. Parallel benchmarking uses more memory, more garble collection and more compilation time. Especially the compilation time, I cannot explain… anyone cares demystifying this ?

The benchmarks certainly run on parallel but each of them are on average 2-3x slower.
Do you know why and how to avoid this ?
Do your machines maybe run faster w.r.t the serial procedure ?

Ofc, the benchmarking results are also inconsistent between the serial and parallel evaluation.

julia> bgresserial = run(bg);

julia> judge(median(bgresserial), median(bgres))
2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "sum" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "1d" => 5-element BenchmarkTools.BenchmarkGroup:
                  tags: ["t1"]
                  200 => TrialJudgement(-44.23% => improvement)
                  300 => TrialJudgement(-9.52% => improvement)
                  500 => TrialJudgement(-4.26% => invariant)
                  100 => TrialJudgement(+0.00% => invariant)
                  400 => TrialJudgement(-4.17% => invariant)
          "2d" => 5-element BenchmarkTools.BenchmarkGroup:
                  tags: ["t2"]
                  200 => TrialJudgement(-9.82% => improvement)
                  300 => TrialJudgement(-46.43% => improvement)
                  500 => TrialJudgement(+6.79% => regression)
                  100 => TrialJudgement(+0.72% => invariant)
                  400 => TrialJudgement(+4.73% => invariant)
  "prod" => 5-element BenchmarkTools.BenchmarkGroup:
          tags: ["t3"]
          200 => TrialJudgement(-9.76% => improvement)
          300 => TrialJudgement(-27.14% => improvement)
          500 => TrialJudgement(-1.69% => invariant)
          100 => TrialJudgement(-9.38% => improvement)
          400 => TrialJudgement(-7.69% => improvement)

I also tried using ThreadPinning.jl to pin the threads to separate cores.

using ThreadPinning
pinthreads(:cores)

but it didn’t make a difference.

System settings

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core™ i5-1235U
CPU family: 6
Model: 154
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
Stepping: 4
CPU(s) scaling MHz: 66%
CPU max MHz: 4400.0000
CPU min MHz: 400.0000
BogoMIPS: 4992.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 352 KiB (10 instances)
L1i: 576 KiB (10 instances)
L2: 6.5 MiB (4 instances)
L3: 12 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-11
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected

julia> ThreadPinning.threadinfo()
System: 10 cores (no SMT), 1 sockets, 1 NUMA domains
| 0,2,4,5,6,7,8,9,10,11,1,3 | 
# = Julia thread, # = HT, # = Julia thread on HT, | = Socket seperator
Julia threads: 3
├ Occupied CPU-threads: 3
└ Mapping (Thread => CPUID): 1 => 0, 2 => 2, 3 => 4,

julia> versioninfo()
Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × 12th Gen Intel(R) Core(TM) i5-1235U
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
  Threads: 3 on 12 virtual cores

(benchs) pkg> st
Status `~/code/julia/benchs/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [811555cd] ThreadPinning v0.7.17

Fearing that there might be some repercussions that the “views” of the overall BenchmarkGroup was shared between the threads, I moved on to a simpler calculation where BenchmarkGroups were defined separately.

using BenchmarkTools
using ThreadPinning

pinthreads(:cores)

const bgsum1d = BenchmarkGroup(["sum1d"])
const bgsum2d = BenchmarkGroup(["sum2d"])
const bgprod = BenchmarkGroup(["prod"])

foreach(100:100:500) do k
	r = rand(k)
	r2 = rand(k, k)
	bgsum1d[k] = @benchmarkable sum($r)
	bgsum2d[k] = @benchmarkable sum($r2)
	bgprod[k] = @benchmarkable prod($r)
end

@info "Single threaded"
@time for bgr in [bgsum1d, bgsum2d, bgprod]
	@time run(bgr)
	@info first(bgr.tags)
end

println();

@info "Multi threaded"
@time Threads.@threads for bgr in [bgsum1d, bgsum2d, bgprod] 
	@time run(bgr)
	@info first(bgr.tags)
end

with results

julia> include("parbenchsep.jl")
[ Info: Single threaded
  2.020265 seconds (1.20 M allocations: 61.005 MiB, 85.76% gc time, 12.98% compilation time)
[ Info: sum1d
  2.399186 seconds (786.71 k allocations: 32.780 MiB, 71.77% gc time, 3.34% compilation time)
[ Info: sum2d
  1.832098 seconds (799.71 k allocations: 33.608 MiB, 93.31% gc time, 4.67% compilation time)
[ Info: prod
  6.259793 seconds (2.79 M allocations: 127.433 MiB, 82.49% gc time, 6.96% compilation time)

[ Info: Multi threaded
  3.215584 seconds (1.39 M allocations: 52.394 MiB, 98.42% gc time, 0.86% compilation time)
[ Info: prod
  3.577352 seconds (1.60 M allocations: 60.792 MiB, 98.13% gc time)
[ Info: sum1d
  4.963936 seconds (2.13 M allocations: 80.937 MiB, 87.77% gc time, 1.12% compilation time)
[ Info: sum2d
  5.030964 seconds (2.19 M allocations: 85.061 MiB, 86.60% gc time, 2.34% compilation time)

So, now the benchmark is a bit better than before, but again there is this unexplainable to me ratio of 2x slower when benchmarks run on threads.

Do you know what you’re trying to measure? You run a benchmark suite with run, but then discard the results and instead just print the total run time and other total statistics with @time. I think you’re probably not aware of that, because I can’t imagine why’d you want to do that.

Think of this like meta-benchmarking, i.e. measuring with @time the real benchmarks. I always benchmark the same utility, but I do it with either using multi-threading or not. That’s what the @time measurements are about; how much gain do I get for benchmarking the same operations using multiple threads vs serially.
In the original post the benchmarking of the applications are stored in bgres using multiple threads and in bgresserial with a single one.
In the second post, I do not even store the results and I am only printing the meta-benchmarks measurements with @time, because that’s the interest in this discussion.

I don’t quite undertand your goals, but if you want to compare benchmark runs one to another, I think the first thing to do is to ensure that the number of samples and evaluations is constant, e.g. by setting a very high time budget that is sure not to get spent, and then checking that all the numbers of samples and evaluations match. In addition to ensuring that the number of evaluations is sufficient, as always.

That said, trying to run multiple benchmarks in parallel really seems antagonistic to the very concept of benchmarking.

1 Like

I don’t quite undertand your goals
…trying to run multiple benchmarks in parallel really seems antagonistic to the very concept of benchmarking.

I rephrase. This is a toy example, and in reality I have many serial benchmarks that can take a looong time to finish. All of them evaluate serial code. So I figured I will use all my cores to finish the benchmarking sooner. I hope this makes more sense now ?

I think the first thing to do is to ensure that the number of samples and evaluations is constant

Good point. I hardcoded them as you see in the following code.

using BenchmarkTools
using ThreadPinning
using Test

pinthreads([3,2,1])

const bgsum1d = BenchmarkGroup(["sum1d"])
const bgsum2d = BenchmarkGroup(["sum2d"])
const bgprod = BenchmarkGroup(["prod"])

foreach(100:100:500) do k
	r = rand(k)
	r2 = rand(k, k)
	bgsum1d[k] = @benchmarkable sum($r) samples=100 evals=100 seconds=10000
	bgsum2d[k] = @benchmarkable sum($r2) samples=100 evals=100 seconds=10000
	bgprod[k] = @benchmarkable prod($r) samples=100 evals=100 seconds=10000
end

@info "Single threaded: threadid => coreid $(Threads.threadid()) => $(getcpuid())"
brser = @time [ 
	let 
		@time r = run(bgr)
		@info first(bgr.tags)
		r
	end for bgr in [bgsum1d, bgsum2d, bgprod] ]

println(); println();

@info "Multi threaded"
c = Channel{BenchmarkGroup}(ch -> foreach(i -> put!(ch, i), [bgsum1d, bgsum2d, bgprod]), 1)
brpar = @time Channel{BenchmarkGroup}(3) do ch 
	function runbenc(bg) 
		res = @time run(bg)
		@info "$(first(bg.tags)): threadid => coreid $(Threads.threadid()) => $(getcpuid())"
		put!(ch, res)
	end
	Threads.foreach(runbenc, c)
end |> collect

# testing equal bench params
for (brs, brp)  in zip(leaves.(brser), leaves.(brpar))
	for (s,p) in zip(getindex.(brs, 2), getindex.(brp, 2))
		@info @test s.params == p.params
	end
end

giving

julia> include("file.jl")
[ Info: Single threaded: threadid => coreid 1 => 3
  1.830508 seconds (71.34 k allocations: 4.691 MiB, 95.03% gc time, 4.84% compilation time)
[ Info: sum1d
  2.481550 seconds (66.46 k allocations: 4.364 MiB, 70.09% gc time, 3.00% compilation time)
[ Info: sum2d
  1.830897 seconds (71.34 k allocations: 4.691 MiB, 95.03% gc time, 4.83% compilation time)
[ Info: prod
  6.182202 seconds (250.05 k allocations: 16.502 MiB, 84.42% gc time, 4.69% compilation time)


[ Info: Multi threaded
  1.157849 seconds (9.63 k allocations: 439.297 KiB, 99.73% gc time, 31.24% compilation time)
  1.731301 seconds (7.18 k allocations: 267.273 KiB, 99.79% gc time)
[ Info: sum1d: threadid => coreid 3 => 1
[ Info: prod: threadid => coreid 1 => 3
  4.703555 seconds (24.02 k allocations: 985.664 KiB, 85.77% gc time, 7.10% compilation time)
[ Info: sum2d: threadid => coreid 2 => 2
  4.766122 seconds (71.14 k allocations: 4.176 MiB, 84.65% gc time, 9.20% compilation time)
# All tests pass

Btw I shielded 1-3 CPUs from the system processes with cset shield --kthreads=on, to mitigate any noise. I also tried switching cores around with ThreadPinning.jl, but the results are similar.

So now, in this toy example, I start getting a bit of an improvement by benchmarking using all my threads.
However, I would expect them to finish at ~ 2.5 secs or just a bit more, which is the time of the longest benchmarking in the serial (sum2d).
Actually, now sum1d and prod do more or less the same; so the question is why sum2d is 2x slower in the multi-threading case ?

Maybe I should just accept it as multi-threading offset noise, and move on :grimacing:

The benchmark judge for sum2d are not that much different between the serial and parallel

# judge(median(serial["sum2d"]) , median(parallel["sum2d"]))
5-element BenchmarkTools.BenchmarkGroup:
  tags: ["sum2d"]
  200 => TrialJudgement(-3.49% => invariant)
  300 => TrialJudgement(-4.21% => invariant)
  500 => TrialJudgement(-3.89% => invariant)
  100 => TrialJudgement(+0.58% => invariant)
  400 => TrialJudgement(-4.34% => invariant)

So maybe, indeed there is a lot of signaling going on between the benchmarks but the benchmarks themselves run quite normally (?).

btw the related github issue parallel execution · Issue #186 · JuliaCI/BenchmarkTools.jl · GitHub

From the related Github issue that you linked, comment by @Sukera:

An issue may be thermal throttling for intensive benchmarks, as well as cache pollution for shared cache architectures.

That’s why I said:

trying to run multiple benchmarks in parallel really seems antagonistic to the very concept of benchmarking.

Whenever you’re benchmarking something, you want to have the benchmark be the only significant workload on the system, to prevent measurement error.

2 Likes

Yeah, the BenchmarksTools.jl docs also have a section about this. I didn’t configure anything special towards that. I understand that this is a greater concern for multi-core benchmarking but I don’t think my example hits on that, because sum2d is consistently about 2x slower.

So, you think that multi-threaded benchmarking is a bad idea overall ?

2 Likes

BenchmarkTools.jl does more than only running your function. It also:

This will not be a good way to benchmark, there will be way too much noise from the other test runners that will influence the results of each other.

3 Likes

To be clear, sometimes one wants to measure the performance of a parallel program or similar, and that’s perfectly OK, but that’s not what you’re doing (trying to do multiple benchmarks in parallel).

As someone who has run benchmarks on HPC systems for customer acceptance,
the normal way to do benchmarks is to have that benchmark as the only user-level task running.
Indeed you would normally flush all caches before a benchmark run.
You also would pay attention to BIOS setttings, and may make special BIOS Settings and document these.

As people have said, running different benchmarks together would cause confusion.
Having said that, mayeb your concept could be worked up into a real world stress test - ‘what if different users run X Y Z on the ssytem’

2 Likes

Okey. At this point I am already persuaded but I after the GC thingy, I thought what if the threads were in different processes. Then, each process will have a local GC and it might work out better.

So, as a final experiment, I built the processes such that the BenchmarkGroup is also defined inside each process and then benchmarked as well.

Funnily when I do

@spawnat 2 runbench(getbenchgroup1)
@spawnat 2 runbench(getbenchgroup2)
@spawnat 2 runbench(getbenchgroup3)

i.e. I use the same process and runs serially
I get the typical one thread results ~[1.7, 1.7, 2.4]

But when I give it to all processes e.g.

@spawnat 2 runbench(getbenchgroup1)
@spawnat 3 runbench(getbenchgroup2)
@spawnat 4 runbench(getbenchgroup3)

I get something like ~[2.5, 2.5, 3.1]

Conclusion: I am over my head and parallel benchmarking is tough to be consistent and there is no easy way around. Thanks everybody !

1 Like