Has anyone tried to get BenchmarkTools.jl to play nicely with MPI? Iβve been struggling with intermittent hangs*. It seems without thinking too hard like it should be possible to make @benchmark
MPI-aware pretty easily (just do all the control flow on rank-0), so I might have a look at making a package to do that. Wondering if anyone has done it already, or has simpler/better solutions?
* my current guess is that slight differences in timing mean that different processes try to run different numbers of evaluations inside an @benchmark
call. Trying to work around this just by setting the seconds
kwarg to Inf
for the moment.
You can try forcing an MPI barrier at the end of each run using teardown
. Also make sure that all processes do the same number of evaluations (by setting evals
) per sample, since the teardown expression is only executed at the end of each sample.
The following example works for me (tested here on 4 MPI processes). Note that the amount of work in f
is proportional to the MPI rank + 1.
using MPI
using BenchmarkTools
function f(n)
x = 0.0
for i = 1:n
x += sin(i)
end
x
end
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
n = 10_000_000 * (rank + 1) # amout of work to do by this process
b = @benchmark f($n) evals=1 teardown=(MPI.Barrier(comm))
sleep(0.5 * rank) # make sure not all processes print at the same time
show(stdout, MIME("text/plain"), b)
println()
Without teardown
, each process performs a different number of samples (16/8/5/4 in my tests). With teardown
, all processes run the same number of samples (4):
# > mpirun -n 4 julia --project mpibench.jl
## Rank 0
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min β¦ max): 336.318 ms β¦ 342.024 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 339.971 ms β GC (median): 0.00%
Time (mean Β± Ο): 339.571 ms Β± 2.892 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
336 ms Histogram: frequency by time 342 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
## Rank 1
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min β¦ max): 700.438 ms β¦ 718.264 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 710.761 ms β GC (median): 0.00%
Time (mean Β± Ο): 710.056 ms Β± 7.768 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
700 ms Histogram: frequency by time 718 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
## Rank 2
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min β¦ max): 1.063 s β¦ 1.111 s β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.083 s β GC (median): 0.00%
Time (mean Β± Ο): 1.085 s Β± 21.831 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.06 s Histogram: frequency by time 1.11 s <
Memory estimate: 0 bytes, allocs estimate: 0.
## Rank 3
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min β¦ max): 1.405 s β¦ 1.501 s β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.461 s β GC (median): 0.00%
Time (mean Β± Ο): 1.457 s Β± 40.362 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.41 s Histogram: frequency by time 1.5 s <
Memory estimate: 0 bytes, allocs estimate: 0.
4 Likes
Thanks @jipolanco! Sorry, I didnβt give enough details in my original post. For the case I was looking at Iβd already set the number of evals to 1 (because I was benchmarking a code that takes a fairly long time (~100ms) to run. Iβd tried having an MPI.Barrier() in (a function called in) teardown. Most of the time the @benchmark
run was OK, but especially on larger numbers of processes (I was going up to 48), it would occasionally hang - for those runs Iβd called something like
@benchmark(my_long_mpi_function(),
setup=my_setup(),
teardown=my_teardown(), # Note - includes at least one MPI.Barrier()
seconds=60,
samples=100,
evals=1)
I guess what happened is that just occasionally some rank(s) would try to run one more sample than the others and cause a deadlockβ¦
I tried to make a Package that would monkey-patch BenchmarkTools.jl to make the places where a timer is evaluated MPI-compatible (by broadcasting the timing from rank-0 to all processes): https://github.com/johnomotani/BenchmarkToolsMPI.jl
Unfortunately that approach seems not to work, trying it on my test case gave strange errors, so I think trying to overwrite functions from another module breaks compilation. The βfixβ for the BenchmarkTools.jl functions is pretty simple (modifies 1 line in 2 functions), but I donβt know how to inject it in a sensible way. Donβt think BenchmarkTools.jl should depend on MPI.jl, so canβt change upstream, but would rather not copy-paste the entire code just to slightly modify 2 small functions. Since I have a workaround, Iβm going to give up here, unless anyone has a smart suggestion!