How to benchmark distributed function calls?

Hello Julia community,

I used pmap for some cross validation routine. Upon benchmarking on 4 cores, I noticed ~4 times speedup compared to single core. Then I wanted to know how much overhead is associated with running 4 cores, but somehow it says I’m only using 273KB, which is impossible.

On a single core:

BenchmarkTools.Trial:
  memory estimate:  243.36 MiB
  allocs estimate:  77524
  --------------
  minimum time:     42.674 s (0.04% GC)
  median time:      42.885 s (0.19% GC)
  mean time:        42.885 s (0.19% GC)
  maximum time:     43.097 s (0.34% GC)
  --------------
  samples:          2
  evals/sample:     1

On 4 cores:

BenchmarkTools.Trial:
  memory estimate:  274.38 KiB
  allocs estimate:  676
  --------------
  minimum time:     10.939 s (0.00% GC)
  median time:      11.268 s (0.00% GC)
  mean time:        11.251 s (0.00% GC)
  maximum time:     11.456 s (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1

Is there an easy way to know how much memory I used?

Most likely, you’re what you’re seeing here is only the memory consumption of some launching process (which has to wait for spawned tasks to complete, and can thus measure the correct elapsed time).

In order to get a total memory consumption, I don’t know of any other way than to benchmark individual spawned tasks.

Here is a complete example:

Sequential version

First, a plain, sequential version to be used as reference:

julia> using BenchmarkTools

julia> f_seq(n) = map(1:n) do i
           sleep(1)
           sum([0.1*j for j in 1:10_000*i])
       end
f_seq (generic function with 1 method)

julia> @btime f_seq(10)
  10.028 s (72 allocations: 4.20 MiB)
10-element Array{Float64,1}:
 5.0005e6  
 2.0001e7  
 4.50015e7 
 8.0002e7  
 1.250025e8
 1.80003e8 
 2.450035e8
 3.20004e8 
 4.050045e8
 5.00005e8 

pmap-based version

Like you showed, a simple parallel version allows measuring the speedup, but not the memory usage:

julia> f_par(n) = pmap(1:n) do i
           sleep(1)
           sum([0.1*j for j in 1:10_000*i])
       end
f_par (generic function with 1 method)

julia> @btime f_par(10)
  3.014 s (743 allocations: 30.97 KiB)
10-element Array{Float64,1}:
 5.0005e6  
 2.0001e7  
 4.50015e7 
 8.0002e7  
 1.250025e8
 1.80003e8 
 2.450035e8
 3.20004e8 
 4.050045e8
 5.00005e8 

Timed parallel version

If you time each task, then collect the statistics, you can retrieve total memory usage. In the following, the pmap function body is wrapped in @timed to get resources usage; the global result of pmap is then post-processed to retrieve calculation results and accumulate statistics:

julia> f_par_timed(n) = pmap(1:n) do i
           @timed begin
               sleep(1)
               sum([0.1*j for j in 1:10_000*i])
           end
       end |> post
f_par_timed (generic function with 1 method)

julia> post(res) = begin
           total_time = sum(x[2] for x in res)
           @show total_time
       
           total_bytes = sum(x[3] for x in res)
           @show total_bytes
       
           [x[1] for x in res]
       end
post (generic function with 1 method)
julia> f_par_timed(10)
total_time = 10.028683662
total_bytes = 4402720
10-element Array{Float64,1}:
 5.0005e6  
 2.0001e7  
 4.50015e7 
 8.0002e7  
 1.250025e8
 1.80003e8 
 2.450035e8
 3.20004e8 
 4.050045e8
 5.00005e8 

We get back here a cumulated elapsed time of ~10s and cumulated memory usage of ~4.4MB (as in the sequential version above). The ~31kB from the benchmarking of pmap above should probably be added to this.

I do not know whether there are other significant sources of memory usage which are neither captured by a benchmarking pmap itself nor the function bodies. And I should also add that these measurements are based on @timed, which means that they are probably less reliable than what BenchmarkTools would provide.

2 Likes