Suppose I have the following problem:
- I need to call a function
f
many times in an estimation procedure.
- computing
f
is very expensive. I usually use 6 workers to compute this in parallel and it takes about 2 minutes. The workers use a set S
of SharedArray
to do this job.
- Suppose I had access to several large machines. Is it possible to distribute this computation model across machines? My main doubt is how to make sure the
SharedArray
s of group one, say, S1
, is allocated on machine one, the SharedArray
s of group two are on machine two and so on. This is important, since they are large in memory.
- Just to be clear: Ideally I’d be using
Base.Threads
in step 2. I have been advised not to go down that route at this moment, since the feature is not mature enough yet. I think what I have in mind is sometimes called a hybrid distributed computation, using several CPUs on one machine, connecting multiple machines.
- thanks!
I am currently dealing with a similar situation (if I understood your problem correctly), running MCMC models in Julia. What I went for is “poor man’s mapreduce” using the filesystem for interprocess communication: doing the calculation on each server/core, save it (serialize
), then merge. All steps involve separate processes. I am sure you will get more elegant solutions, but this proved to be very robust and easy to develop: I just made it work on one process, then did the rest using scp/ssh. Maybe there is a library that would organize this?
I think you understood my problem perfectly well there.
I’m not sure, on the other hand, I understand your application. when you say
doing the calculation on each server/core, save it (serialize), then merge
do you have to communicate the results of each node back to the master process during computation, or is merge
part of post-computation analysis in your case? I can easily see how I would do this in the latter case.
If that is not your case (i.e. your dealing with the former), or otherwise, do you have a suggestion for how I could implement your strategy? Just a comment on what you think about down below would be very helpful - thanks!
Here is the outline of my situation.
Problem
Say I have 3 computers, each with multiple CPUs. call computer 1 the master
.
master
controls the computation of f(x)
on each of computers 2 and 3. In particular, it sends x2
to 2 and x3
to 3, and reads their respective results y2
and y3
. Think of f
as a loss function.
- based on
y2=f(x2)
and y3=f(x3)
, the master comes up with a new candidate for each computer, x2'
, x3'
.
- In MCMC terminology (I think), the
kernel
resides on the master and it comes up with a new candidate parameter for each participating computer.
- All computers have access to the same file system, e.g. all can read/write
~/
I would not know how to implement the fact that the master has to wait for new results y
to appear. I read about Julia tasks
and co, which seems to be just what I need, but that is only for actual julia workers
.
Proposition
I basically need to know how to do the following, although it may be quite time consuming in my case to initiate the model from scratch each time.
# pseudo code. ;-)
# imagine that compute_f.jl is a script to compute f
# the script takes a commandline arg `param`, which becomes `x` for `f(x)`
# this function here is running on computer 1 for the entire duration of the procedure.
function master_process()
converged = false
minf = Inf
x = initial_x()
while !converged
x2 = nextx(x) # random innovation to x
x3 = nextx(x)
run(`ssh computer2 julia compute_f.jl --param $x2`) # computes f(x2) and saves y2 as ~/result2.dat
run(`ssh computer3 julia compute_f.jl --param $x3`) # saves as ~/result3.dat
# wait...
while !isfile("~/result2.dat")
sleep(1)
println(".")
end
# as soon as the file ("~/result2.dat" appears, read result2.dat...
y2 = read("~/result2.dat")
y3 = read("~/result3.dat")
# find best value
y = min(y2,y3)
x = indmin(y2,y3)[x2,x3]
# test convergence
converged = conv(y,minf) # test convergence somehow...suppose its `false` for now :-)
# delete files
rm("~/result2.dat")
rm("~/result3.dat")
end # while
end
My task was very simple:
- master has the data and the code,
-
rsync
s it to nodes,
- executes a Julia process via
ssh
, which dumps the results in a file (serialize
)
- master collects the results via
rsync
.
This runs only once. Signalling when it was finished on nodes was indeed tricky, I would just check for the presence of the result file.
So basically like your control flow above. Now this is indeed quite inelegant, but very robust: if processes fail (which happened quite a bit during the development phase, because of a bug that made a NaN
which propagated and at some point lead to an error), I could just rerun them manually.
The right thing to do would probably be using cluster managers. I have to admit that I had a pressing deadline and doing it the low-tech way seemed like a viable option, so lacking a worked example I postponed learning about this part of Julia.
Hopefully someone more knowledgeable here can give you more help, I only replied because I saw that the topic was without an answer and I did solve a similar problem. But learning about this from me is like learning carpentry from someone who once cut a table in two with a angle grinder because they had to dispose of it.
2 Likes