Could you please explain the reason why this code is failing and how I could fix it? This is an example of an embarrassingly parallel loop with no data movement, the easiest thing possible in parallel programming:
function bigfunc(iterator)
localvar = 2.5
@everywhere function smallfunc(elm)
# hypothetical (expensive) operation with elm
# (in my code this is solving a linear system)
localvar * elm
end
pmap(smallfunc, iterator)
end
bigfunc(1:10)
ERROR: UndefVarError: localvar not defined
Nevermind, I had a typo in the code. But still I have a question, do we have to add a @everywhere in front of each variable used inside of smallfunc? Is this copying the object to other processes or just creating a reference? How expensive is this?
addprocs()
function bigfunc(iterator)
localvar = 2.5
smallfunc = function (elm)
# hypothetical (expensive) operation with elm
# (in my code this is solving a linear system)
localvar * elm
end
pmap(smallfunc, iterator)
end
bigfunc(1:10)
Note that if you do this, the data will be sent each time. To make this more efficient, you should use a CachingPool.
Example:
addprocs()
function bigfunc(iterator)
localvar = 2.5
smallfunc = function (elm)
# hypothetical (expensive) operation with elm
# (in my code this is solving a linear system)
localvar * elm
end
wp = CachingPool(workers())
pmap(wp,smallfunc, iterator)
end
bigfunc(1:10)
Note that this should become the default soon. Details:
Awesome @ChrisRackauckas, are anonymous functions equivalent to lambda expressions x -> 2x?
The trick is that anonymous functions capture the variables in this so called closure, right? Will the CachingPool be available in Julia v0.7 do you think?
mean the same thing, with the second being the multi-line version of the first (like the difference between f(x) = 2x vs function f(x); 2x; end, except with anonymous functions).
Yes, for sure. The question is if this will be done automatically or still be required by the user to do for performance. Looks like it will be the latter, but your code will still work.
This also runs fine for me, even though “smallfunc” isn’t anonymous:
julia> function bigfunc(iterator)
localvar = 2.5
function smallfunc(elm)
# hypothetical (expensive) operation with elm
# (in my code this is solving a linear system)
localvar * elm
end
pmap(smallfunc, iterator)
end
bigfunc (generic function with 1 method)
julia> bigfunc(1:10)
10-element Array{Float64,1}:
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
22.5
25.0
There is definitely lots to ask about the behavior of pmap, bellow are three working versions for which the runtime is approximately the same:
using BenchmarkTools
function bigfunc1(iterator, arg)
localvar = 2.5
smallfunc = function (elm)
localvar * elm + arg
end
wp = CachingPool(workers())
pmap(wp, smallfunc, iterator)
end
function bigfunc2(iterator, arg)
localvar = 2.5
smallfunc = function (elm)
localvar * elm + arg
end
pmap(smallfunc, iterator)
end
function bigfunc3(iterator, arg)
localvar = 2.5
function smallfunc(elm)
localvar * elm + arg
end
pmap(smallfunc, iterator)
end
@benchmark bigfunc1(1:1000, 1.)
@benchmark bigfunc2(1:1000, 1.)
@benchmark bigfunc3(1:1000, 1.)
If you can explain what is happening in each case, step by step, this would be very helpful to future readers. Right now I have little clue about what data is being copied or not.
I am trying to decide for one version of this pattern, but before I commit to one of them in my package, I was hoping for someone with more understanding of the internals to comment on the pros and cons.
How many workers do you have @Elrod? I am trying to setup a cluster to test the code, but didn’t have the time yet. My laptop only has 2 cores, so basically I cannot experience any speedup even if I use all of them (1 master + 1 worker).
I am yet to see a case where the function is the expensive part and the speed up is great. I am setting up the cluster soon to try this parallel pattern on my package on very large domains.
Well multiprocessing does have quite an overhead. I use it all the time for Monte Carlo simulations where each trajectory takes minutes, and the speedup is almost exactly Nx for N the number of cores. When it’s smaller, you need to play around with the buffer_size. And yes, CachingPool makes it send the data in the anonymous function exactly once. If you’re enclosing a lot of variables this makes a huge difference. If you’re not enclosing a lot of data, then it doesn’t matter all that much.
One thing to keep in mind is that you may not be close to optimal, in which case you may want to overload processors. Here’s a quick explanation and example (from a long time ago, but it’s still read-worthy):
Thanks @ChrisRackauckas, the issue is that I don’t have any data to move around, that is what is bothering me the most. I am only passing the estimator object that has all the data in its internal state, but the loop is blind to these internal arrays.
I will come back with more concrete numbers after I get to the cluster.