Pmap performance regression: pmap(x->f(x,y), X) creates copies of y

Updating some code from 0.5 to 1.0 massively slowed pmap calls for our use case.

Briefly, distributing the computation of f(x,arg) over the set X seems to copy and send arg during each iteration. This becomes a problem when the parameters in arg include large objects.

This can be reproduced in 0.6+ (tested 0.6.4 and 1.0.0). Benchmarks below are for a fresh 1.0 install on a windows machine (also reproduced on a linux HPC)

using BenchmarkTools
VERSION.major < 1 || using Distributed
addprocs() ##4
@everywhere begin
  bigarr = ones(10^8)
  f_passall(a,x) = length(x) + a
its = 1:20
julia> @btime map(x->f_passall(x,bigarr), its);
  940.280 ns (27 allocations: 736 bytes)
julia> @btime pmap(x->f_passall(x,bigarr), its);
  2.283 s (1560 allocations: 97.86 KiB)

Redefining f to use bigarr as a global variable seems to fix the issue, at a cost

 @everywhere f_globals(a) = length(bigarr) + a
  julia> @btime map(x->f_globals(x), its);
    1.391 ÎĽs (47 allocations: 1.03 KiB)
  julia> @btime pmap(x->f_globals(x), its);
    881.018 ÎĽs (1493 allocations: 96.64 KiB)

Increasing the number of iterations further slows down the pmap call, proportionally

its = 1:50;
  julia> @btime pmap(x->f_passall(x,bigarr), its);
    5.676 s (3834 allocations: 185.53 KiB)
  julia> @btime pmap(x->f_globals(x), its);
    2.169 ms (3658 allocations: 182.25 KiB)

The issue did not seem to occur as of 0.5.0: f_passall and f_globals have comparable performance, and most of the time is spent on overhead (remaining about constant with greater its).

  julia> @time pmap(x->f_passall(x,bigarr), 1:20);
    0.290894 seconds (422.72 k allocations: 17.810 MB, 2.42% gc time)

  julia> @time pmap(x->f_passall(x,bigarr), 1:50);
    0.290469 seconds (427.01 k allocations: 17.937 MB, 2.49% gc time)

  julia> @time pmap(x->f_globals(x), 1:20);
    0.276240 seconds (422.46 k allocations: 17.765 MB)

  julia> @time pmap(x->f_globals(x), 1:50);
    0.288293 seconds (426.70 k allocations: 17.921 MB, 2.39% gc time)

Happy to create an issue if this is unintended behavior by pmap.

I don’t have 0.5 or 1.0 installed, so I can’t test your code, but is it possible that most of the time in execution is coming from the fact that you’re benchmarking with globals rather than interpolating when using @btime?

For example, you should probably do:

@btime pmap(x->f_passall(x,$bigarr), $its);

This will get you more accurate timing metrics.

Hello, I only used @btime to document this post; using @time will create similar benchmarks. I’ll edit the benchmarks to correct my @btime usage later, but that shouldn’t be at the root of the issue.

My understanding was that you shouldn’t benchmark with @time period because you’re picking up overhead that you would never have in a function (where everything should happen anyways). It’s possible that what you’re picking up is a difference in overhead for globals between the two versions (which doesn’t really matter).

For what it’s worth, the slowdown was obvious (couple orders of magnitude) when upgrading an application to 1.0, long before I ran benchmarks using this simplified example. Given the stark pattern of slow execution of pmap calls applying f_passall, when other benchmarks are unaffected (compare with map and pmap(f_globals,...)), I am not sure measurement error is a likely culprit. I can re-run benchmarks as needed tomorrow, though hopefully someone will be able to reproduce this by then,

You should be able to reproduce this is in 0.6 and 0.7 as well

Regarding interpolation of globals with @btime: This leads to a x10 drop in recorded speed

In v0.6.4

julia> @btime pmap(x->f_passall(x,$bigarr), $its);
  56.612 s (2201 allocations: 149.03 KiB)

julia> @btime pmap(x->f_globals(x), $its);
  10.034 ms (1997 allocations: 134.09 KiB)

In v1.0.0

julia> @btime pmap(x->f_passall(x,$bigarr), $its);
  39.688 s (1699 allocations: 108.81 KiB)
julia> @btime pmap(x->f_globals(x), $its);
  910.485 ÎĽs (1496 allocations: 97.09 KiB)

Have you tried specifically using the cache pool? I seem to remember having to explicitly create one and telling pmap to use it. Maybe that would help.

FWIW, I think I encountered this problem in v0.6 and just didn’t really care much to dive deeper into it. I figured I was just using pmap wrong haha

1 Like

I haven’t tinkered too much with it; ultimately, I retooled the affected application using lower-level Tasks/Remote channels. There are probably a number of good solutions that still use pmap, like passing a pointer. That said, I would argue that If pmap is supposed to be a quick-and-easy tool to implement distributed computing, then these kinds of performance issues are probably undesirable, or at least worth documenting.

I think this is a good suggestion.

For me the CachingPool version is comparable to the global version.

pool = CachingPool(workers())
pmap(pool, x->f_passall(x,bigarr), its)

Here’s my timings using Julia v0.6.4:

Julia-0.6.4> @btime pmap(x->f_passall(x,$bigarr), its);
  52.705 s (2176 allocations: 148.63 KiB)

Julia-0.6.4> @btime pmap(pool, x->f_passall(x,$bigarr), its);
  1.591 ms (1937 allocations: 124.23 KiB)

Julia-0.6.4> @btime pmap(x->f_globals(x), its);
  1.931 ms (1998 allocations: 134.13 KiB)

Thanks for testing this (and @tbeason for suggesting), using the cache pool does seem like the way to go. Based on this thread, it seems like there was some debate to implement pmap with CachingPool by default.