I’m working with Distributed Julia on multiple hosts and want to write some code that does distributed training with a machine learning model and distributed data. I don’t want to use MPI. I have the ClusterManager working correctly and can use addproc_mysystem(n)
with out any issues.
The traditional way to do distributed training is to put an identical copy of the model on each worker, then partition the training data and give each worker one partition. Each worker will compute a gradient with respect to its portion of the training data, then call AllReduce(Add, gradients)/num_worker
. The gradients form AllReduce are what’s used to update the model.
Question 1 is, what’s the best strategy to implement an AllReduce with Julia’s Distributed module?
Question 2, @everywhere Z=1
will let me declare the variable Z
in each of the workers Main namespace. But the value of Z is not shared between works (right?). if I do @everwhere ch3 = RemoteChannel(()->Channel{Int}(10), 3)
how does Julia know that ch3
in each of the workers namespace references the same underlying RemoteChannel?
As a more complicated example, in order to implement a RingAllReduce, I’m doing
@everywhere channelTable = [(RemoteChannel(()->Channel{Int}(10), w_i), RemoteChannel(()->Channel{Int}(10), w_i)) for w_i in workers()]
. channelTable[2][1]
indeed seems to reference the same Remote channel no matter where @spawnat 2 put!(channelTable[2][1], 7)
and @spawnat 3 take!(channelTable[2][1])
are run. Is this really doing what I think it is? That is passing a value from worker 2 to worker 3.