# How to benchmark distributed function calls?

Hello Julia community,

I used `pmap` for some cross validation routine. Upon benchmarking on 4 cores, I noticed ~4 times speedup compared to single core. Then I wanted to know how much overhead is associated with running 4 cores, but somehow it says I’m only using 273KB, which is impossible.

On a single core:

``````BenchmarkTools.Trial:
memory estimate:  243.36 MiB
allocs estimate:  77524
--------------
minimum time:     42.674 s (0.04% GC)
median time:      42.885 s (0.19% GC)
mean time:        42.885 s (0.19% GC)
maximum time:     43.097 s (0.34% GC)
--------------
samples:          2
evals/sample:     1
``````

On 4 cores:

``````BenchmarkTools.Trial:
memory estimate:  274.38 KiB
allocs estimate:  676
--------------
minimum time:     10.939 s (0.00% GC)
median time:      11.268 s (0.00% GC)
mean time:        11.251 s (0.00% GC)
maximum time:     11.456 s (0.00% GC)
--------------
samples:          6
evals/sample:     1
``````

Is there an easy way to know how much memory I used?

Most likely, you’re what you’re seeing here is only the memory consumption of some launching process (which has to wait for spawned tasks to complete, and can thus measure the correct elapsed time).

In order to get a total memory consumption, I don’t know of any other way than to benchmark individual spawned tasks.

Here is a complete example:

### Sequential version

First, a plain, sequential version to be used as reference:

``````julia> using BenchmarkTools

julia> f_seq(n) = map(1:n) do i
sleep(1)
sum([0.1*j for j in 1:10_000*i])
end
f_seq (generic function with 1 method)

julia> @btime f_seq(10)
10.028 s (72 allocations: 4.20 MiB)
10-element Array{Float64,1}:
5.0005e6
2.0001e7
4.50015e7
8.0002e7
1.250025e8
1.80003e8
2.450035e8
3.20004e8
4.050045e8
5.00005e8
``````

### pmap-based version

Like you showed, a simple parallel version allows measuring the speedup, but not the memory usage:

``````julia> f_par(n) = pmap(1:n) do i
sleep(1)
sum([0.1*j for j in 1:10_000*i])
end
f_par (generic function with 1 method)

julia> @btime f_par(10)
3.014 s (743 allocations: 30.97 KiB)
10-element Array{Float64,1}:
5.0005e6
2.0001e7
4.50015e7
8.0002e7
1.250025e8
1.80003e8
2.450035e8
3.20004e8
4.050045e8
5.00005e8
``````

### Timed parallel version

If you time each task, then collect the statistics, you can retrieve total memory usage. In the following, the pmap function body is wrapped in `@timed` to get resources usage; the global result of `pmap` is then post-processed to retrieve calculation results and accumulate statistics:

``````julia> f_par_timed(n) = pmap(1:n) do i
@timed begin
sleep(1)
sum([0.1*j for j in 1:10_000*i])
end
end |> post
f_par_timed (generic function with 1 method)

julia> post(res) = begin
total_time = sum(x[2] for x in res)
@show total_time

total_bytes = sum(x[3] for x in res)
@show total_bytes

[x[1] for x in res]
end
post (generic function with 1 method)
``````
``````julia> f_par_timed(10)
total_time = 10.028683662
total_bytes = 4402720
10-element Array{Float64,1}:
5.0005e6
2.0001e7
4.50015e7
8.0002e7
1.250025e8
1.80003e8
2.450035e8
3.20004e8
4.050045e8
5.00005e8
``````

We get back here a cumulated elapsed time of ~10s and cumulated memory usage of ~4.4MB (as in the sequential version above). The ~31kB from the benchmarking of `pmap` above should probably be added to this.

I do not know whether there are other significant sources of memory usage which are neither captured by a benchmarking `pmap` itself nor the function bodies. And I should also add that these measurements are based on `@timed`, which means that they are probably less reliable than what `BenchmarkTools` would provide.

2 Likes