Benchmarking MPI programs?

Has anyone tried to get BenchmarkTools.jl to play nicely with MPI? I’ve been struggling with intermittent hangs*. It seems without thinking too hard like it should be possible to make @benchmark MPI-aware pretty easily (just do all the control flow on rank-0), so I might have a look at making a package to do that. Wondering if anyone has done it already, or has simpler/better solutions?

* my current guess is that slight differences in timing mean that different processes try to run different numbers of evaluations inside an @benchmark call. Trying to work around this just by setting the seconds kwarg to Inf for the moment.

You can try forcing an MPI barrier at the end of each run using teardown. Also make sure that all processes do the same number of evaluations (by setting evals) per sample, since the teardown expression is only executed at the end of each sample.

The following example works for me (tested here on 4 MPI processes). Note that the amount of work in f is proportional to the MPI rank + 1.

using MPI
using BenchmarkTools

function f(n)
    x = 0.0
    for i = 1:n
        x += sin(i)
    end
    x
end

MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
n = 10_000_000 * (rank + 1)  # amout of work to do by this process

b = @benchmark f($n) evals=1 teardown=(MPI.Barrier(comm))

sleep(0.5 * rank)  # make sure not all processes print at the same time
show(stdout, MIME("text/plain"), b)
println()

Without teardown, each process performs a different number of samples (16/8/5/4 in my tests). With teardown, all processes run the same number of samples (4):

# > mpirun -n 4 julia --project mpibench.jl

## Rank 0
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  336.318 ms … 342.024 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     339.971 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   339.571 ms Β±   2.892 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  ▁                ▁                                          β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  336 ms           Histogram: frequency by time          342 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

## Rank 1
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  700.438 ms … 718.264 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     710.761 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   710.056 ms Β±   7.768 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ                       β–ˆ                     β–ˆ             β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  700 ms           Histogram: frequency by time          718 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

## Rank 2
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.063 s …   1.111 s  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.083 s              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.085 s Β± 21.831 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ        β–ˆ                         β–ˆ                    β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  1.06 s         Histogram: frequency by time        1.11 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

## Rank 3
BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.405 s …   1.501 s  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.461 s              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.457 s Β± 40.362 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ                         β–ˆ            β–ˆ                β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  1.41 s         Histogram: frequency by time         1.5 s <

 Memory estimate: 0 bytes, allocs estimate: 0.
4 Likes

Thanks @jipolanco! Sorry, I didn’t give enough details in my original post. For the case I was looking at I’d already set the number of evals to 1 (because I was benchmarking a code that takes a fairly long time (~100ms) to run. I’d tried having an MPI.Barrier() in (a function called in) teardown. Most of the time the @benchmark run was OK, but especially on larger numbers of processes (I was going up to 48), it would occasionally hang - for those runs I’d called something like

@benchmark(my_long_mpi_function(),
           setup=my_setup(),
           teardown=my_teardown(), # Note - includes at least one MPI.Barrier()
           seconds=60,
           samples=100,
           evals=1)

I guess what happened is that just occasionally some rank(s) would try to run one more sample than the others and cause a deadlock…

I tried to make a Package that would monkey-patch BenchmarkTools.jl to make the places where a timer is evaluated MPI-compatible (by broadcasting the timing from rank-0 to all processes): https://github.com/johnomotani/BenchmarkToolsMPI.jl

Unfortunately that approach seems not to work, trying it on my test case gave strange errors, so I think trying to overwrite functions from another module breaks compilation. The β€˜fix’ for the BenchmarkTools.jl functions is pretty simple (modifies 1 line in 2 functions), but I don’t know how to inject it in a sensible way. Don’t think BenchmarkTools.jl should depend on MPI.jl, so can’t change upstream, but would rather not copy-paste the entire code just to slightly modify 2 small functions. Since I have a workaround, I’m going to give up here, unless anyone has a smart suggestion!