Motivation
Running many benchmarks can be slow.
Since almost all of them are independent, they can theoretically run easily on parallel.
Problem
It turns out running the benchmarks in parallel takes more or less equal time with running them serially. They have more allocations, more garbage collection and more compilation time.
My approach
I was experimenting on how to run a BenchmarkTools.BenchmarkGroup
on parallel.
I came up with using the tags
to identify which groups should run on which threads.
The benchmarking problem instance
For example, letâs say we have the following benchmark suite, with "t1"
, "t2"
and "t3"
being the tags signifying in which thread will the benchmark group run.
using BenchmarkTools
threadtags = ["t1", "t2", "t3"]
bg = BenchmarkGroup([],
"sum" => BenchmarkGroup([],
"1d" => BenchmarkGroup([threadtags[1]]),
"2d" => BenchmarkGroup([threadtags[2]])),
"prod" => BenchmarkGroup([threadtags[3]]));
foreach(100:100:500) do k
r = rand(k)
r2 = rand(k, k)
bg["sum"]["1d"][k] = @benchmarkable sum($r)
bg["sum"]["2d"][k] = @benchmarkable sum($r2)
bg["prod"][k] = @benchmarkable prod($r)
end
giving
julia> bg
2-element BenchmarkTools.BenchmarkGroup:
tags: []
"sum" => 2-element BenchmarkTools.BenchmarkGroup:
tags: []
"1d" => 5-element BenchmarkTools.BenchmarkGroup:
tags: ["t1"]
200 => Benchmark(evals=1, seconds=5.0, samples=10000)
300 => Benchmark(evals=1, seconds=5.0, samples=10000)
500 => Benchmark(evals=1, seconds=5.0, samples=10000)
100 => Benchmark(evals=1, seconds=5.0, samples=10000)
400 => Benchmark(evals=1, seconds=5.0, samples=10000)
"2d" => 5-element BenchmarkTools.BenchmarkGroup:
tags: ["t2"]
200 => Benchmark(evals=1, seconds=5.0, samples=10000)
300 => Benchmark(evals=1, seconds=5.0, samples=10000)
500 => Benchmark(evals=1, seconds=5.0, samples=10000)
100 => Benchmark(evals=1, seconds=5.0, samples=10000)
400 => Benchmark(evals=1, seconds=5.0, samples=10000)
"prod" => 5-element BenchmarkTools.BenchmarkGroup:
tags: ["t3"]
200 => Benchmark(evals=1, seconds=5.0, samples=10000)
300 => Benchmark(evals=1, seconds=5.0, samples=10000)
500 => Benchmark(evals=1, seconds=5.0, samples=10000)
100 => Benchmark(evals=1, seconds=5.0, samples=10000)
400 => Benchmark(evals=1, seconds=5.0, samples=10000)
Parallelizing it
I used the typical Base.Threads
module to run the benchmarks on parallel.
First create a channel with the threads
c = Channel{String}(ch -> foreach(i -> put!(ch, i), threadtags), 1)
and then collect the results on a different channel
bgresvec = Channel{BenchmarkGroup}(length(threadtags)) do ch
runbenc(ttag) = put!(ch, run(bg[@tagged ttag]))
Threads.foreach(runbenc, c)
end |> collect
I know packages like the loving Transducers.jl can simplify the process a lot, but I wanted as less dependencies as possible; so I basically followed the documentation.
Restoring the BenchmarkGroup
bgresvec
is a vector of BenchmarkGroup
containing Trials
.
To reconstruct the BenchmarkGroup
I defined the function
# original, results in vector, thread tags
function reconstructbenchmarkgroup(bgorg, bgres, ttags)
bg = deepcopy(bgorg)
for t in (ttags)
# find bgres with that tag
i = findfirst(bg -> length(bg[@tagged t]) !== 0 , bgresvec)
for (k,_) in leaves(bg[@tagged t])
bg[k] = bgres[i][k]
end
end
return bg
end
and now I happily get the BenchmarkGroup
as the orginal but with the trials instead
julia> bgres = reconstructbenchmarkgroup(bg, bgresvec, threadtags)
2-element BenchmarkTools.BenchmarkGroup:
tags: []
"sum" => 2-element BenchmarkTools.BenchmarkGroup:
tags: []
"1d" => 5-element BenchmarkTools.BenchmarkGroup:
tags: ["t1"]
200 => Trial(46.000 ns)
300 => Trial(61.000 ns)
500 => Trial(69.000 ns)
100 => Trial(37.000 ns)
400 => Trial(66.000 ns)
"2d" => 5-element BenchmarkTools.BenchmarkGroup:
tags: ["t2"]
200 => Trial(3.274 Îźs)
300 => Trial(7.278 Îźs)
500 => Trial(29.118 Îźs)
100 => Trial(768.000 ns)
400 => Trial(13.176 Îźs)
"prod" => 5-element BenchmarkTools.BenchmarkGroup:
tags: ["t3"]
200 => Trial(38.000 ns)
300 => Trial(54.000 ns)
500 => Trial(54.000 ns)
100 => Trial(29.000 ns)
400 => Trial(64.000 ns)
Comparison with serial benchmarking
Sadly, I realized that all this effort led to similar or worse times for the parallel benchmarking.
julia> @info "Single threaded"
[ Info: Single threaded
julia> @time for t in threadtags
@time run(bg[@tagged t])
@show t
end
2.718555 seconds (701.11 k allocations: 26.902 MiB, 98.62% gc time, 0.14% compilation time)
t = "t1"
3.480109 seconds (700.54 k allocations: 26.864 MiB, 81.91% gc time)
t = "t2"
2.570467 seconds (700.50 k allocations: 26.861 MiB, 98.66% gc time)
t = "t3"
8.769646 seconds (2.10 M allocations: 80.639 MiB, 91.99% gc time, 0.04% compilation time)
julia> @info "Multi threaded"
[ Info: Multi threaded
julia> @time Threads.@threads for t in threadtags
@time run(bg[@tagged t])
@show t
end
7.509834 seconds (1.28 M allocations: 48.385 MiB, 99.39% gc time, 0.76% compilation time)
t = "t3"
7.972059 seconds (1.58 M allocations: 59.616 MiB, 99.18% gc time, 0.15% compilation time)
t = "t1"
9.627295 seconds (2.10 M allocations: 79.783 MiB, 93.64% gc time, 0.36% compilation time)
t = "t2"
9.669746 seconds (2.13 M allocations: 81.953 MiB, 93.23% gc time, 1.03% compilation time)
You can see the numbers. Parallel benchmarking uses more memory, more garble collection and more compilation time. Especially the compilation time, I cannot explain⌠anyone cares demystifying this ?
The benchmarks certainly run on parallel but each of them are on average 2-3x slower.
Do you know why and how to avoid this ?
Do your machines maybe run faster w.r.t the serial procedure ?
Ofc, the benchmarking results are also inconsistent between the serial and parallel evaluation.
julia> bgresserial = run(bg);
julia> judge(median(bgresserial), median(bgres))
2-element BenchmarkTools.BenchmarkGroup:
tags: []
"sum" => 2-element BenchmarkTools.BenchmarkGroup:
tags: []
"1d" => 5-element BenchmarkTools.BenchmarkGroup:
tags: ["t1"]
200 => TrialJudgement(-44.23% => improvement)
300 => TrialJudgement(-9.52% => improvement)
500 => TrialJudgement(-4.26% => invariant)
100 => TrialJudgement(+0.00% => invariant)
400 => TrialJudgement(-4.17% => invariant)
"2d" => 5-element BenchmarkTools.BenchmarkGroup:
tags: ["t2"]
200 => TrialJudgement(-9.82% => improvement)
300 => TrialJudgement(-46.43% => improvement)
500 => TrialJudgement(+6.79% => regression)
100 => TrialJudgement(+0.72% => invariant)
400 => TrialJudgement(+4.73% => invariant)
"prod" => 5-element BenchmarkTools.BenchmarkGroup:
tags: ["t3"]
200 => TrialJudgement(-9.76% => improvement)
300 => TrialJudgement(-27.14% => improvement)
500 => TrialJudgement(-1.69% => invariant)
100 => TrialJudgement(-9.38% => improvement)
400 => TrialJudgement(-7.69% => improvement)
I also tried using ThreadPinning.jl to pin the threads to separate cores.
using ThreadPinning
pinthreads(:cores)
but it didnât make a difference.
System settings
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core⢠i5-1235U
CPU family: 6
Model: 154
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
Stepping: 4
CPU(s) scaling MHz: 66%
CPU max MHz: 4400.0000
CPU min MHz: 400.0000
BogoMIPS: 4992.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 352 KiB (10 instances)
L1i: 576 KiB (10 instances)
L2: 6.5 MiB (4 instances)
L3: 12 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-11
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected
julia> ThreadPinning.threadinfo()
System: 10 cores (no SMT), 1 sockets, 1 NUMA domains
| 0,2,4,5,6,7,8,9,10,11,1,3 |
# = Julia thread, # = HT, # = Julia thread on HT, | = Socket seperator
Julia threads: 3
â Occupied CPU-threads: 3
â Mapping (Thread => CPUID): 1 => 0, 2 => 2, 3 => 4,
julia> versioninfo()
Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 12 Ă 12th Gen Intel(R) Core(TM) i5-1235U
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
Threads: 3 on 12 virtual cores
(benchs) pkg> st
Status `~/code/julia/benchs/Project.toml`
[6e4b80f9] BenchmarkTools v1.3.2
[811555cd] ThreadPinning v0.7.17