Distributed, AllReduce, and Distributed Training

asaxton · February 26, 2021, 6:36pm

Continuing the discussion from Julia Distributed, AllReduce, and Distributed Training:

I’m working with Distributed Julia on multiple hosts and want to write some code that does distributed training with a machine learning model and distributed data. I don’t want to use MPI. I have the ClusterManager working correctly and can use addproc_mysystem(n) with out any issues.

The traditional way to do distributed training is to put an identical copy of the model on each worker, then partition the training data and give each worker one partition. Each worker will compute a gradient with respect to its portion of the training data, then call AllReduce(Add, gradients)/num_worker . The gradients form AllReduce are what’s used to update the model.

Question 1 is, what’s the best strategy to implement an AllReduce with Julia’s Distributed module?

Question 2, @everywhere Z=1 will let me declare the variable Z in each of the workers Main namespace. But the value of Z is not shared between works (right?). if I do @everwhere ch3 = RemoteChannel(()->Channel{Int}(10), 3) how does Julia know that ch3 in each of the workers namespace references the same underlying RemoteChannel?
As a more complicated example, in order to implement a RingAllReduce, I’m doing
@everywhere channelTable = [(RemoteChannel(()->Channel{Int}(10), w_i), RemoteChannel(()->Channel{Int}(10), w_i)) for w_i in workers()] . channelTable[2][1] indeed seems to reference the same Remote channel no matter where @spawnat 2 put!(channelTable[2][1], 7) and @spawnat 3 take!(channelTable[2][1]) are run. Is this really doing what I think it is? That is passing a value from worker 2 to worker 3.

Edit: I didn’t know how is move or link a post between categories. If someone wants to let me know the preferred way of doing this I’ll be happy to do that.

matthiasbe · March 18, 2021, 9:56am

I’m currently trying to implement an equivalent to the MPI_Reduce operation in Julia. You may want to check out this similar thread. How to sum the chunks of a distributed array using a binary reduction tree?.

I’m not sure either if I used the correct section to post it…

Topic		Replies	Views
Julia Distributed, AllReduce, and Distributed Training General Usage parallel , distributed , machine-learning	3	739	February 26, 2021
Getting started with distributed Julia computations on a cluster Julia at Scale	1	581	September 27, 2020
@distributed (op) for with mutable types New to Julia distributed , loops	9	935	May 6, 2020
Basic of `@everywhere` and `@distributed` macro New to Julia question	5	1305	June 23, 2023
How to design distributed (CPU/GPU) reinforcement learning methods? Julia at Scale question	0	501	February 7, 2019

Distributed, AllReduce, and Distributed Training

Related topics