Data-parallel training with conv nets in Julia

jstrube · July 20, 2018, 4:41am

I have a problem that lends itself well to data-parallel training (see, e.g., CHEP 2018 Conference, Sofia, Bulgaria (9-13 July 2018): Scaling studies for deep learning in LArTPC event classification · Indico). I’ve been using a library that goes on top of tensorflow (GitHub - matex-org/matex: Machine Learning Toolkit for Extreme Scale (MaTEx)), but I’m facing some issues with tensorflow on a new machine, and I’m wondering if Flux or one of the other deep learning libs in julia couldn’t do the job as well.
Does anybody have experience with training on multiple nodes in julia? Can I just use DistributedArrays for my data sets somehow and everything magically works?
Any pointers would be appreciated.

Tomas_Pevny · July 20, 2018, 5:04am

I have did that with Flux, running a separate copy of each model on different thread and than averaging the results. It was relatively easy and it was on the end couple lines of code.

I did that, since I need multiplication with large sparse matrices. On the end, I was about three times faster on CPU than Tensorflow on GPU .

jstrube · July 20, 2018, 5:12am

Cool! Do you have an example that you could share?

Tomas_Pevny · July 20, 2018, 6:21am

I can share a snippet, but it was like this.

say model is your model, I did

function _back!(model,loss,ds)
  l = loss(model,ds)
  isinf(l.tracker.data) && error("inf in the model");
  isnan(l.tracker.data)&& error("nan in the model");
  Flux.Tracker.back!(l)
  l.tracker.data
end

function Flux.Tracker.back!(pars,models,parss,dss,loss)
  foreach(s -> copy!(s,pars),parss)
  l = zeros(nthreads())
  @threads for i in 1:length(dss) 
    l[i] += _back!(models[i],loss,dss[i])
  end
  mean!(pars,parss...)
  l
end

models = [deepcopy(model) for i in 1:nthreads()]
parss = map(params,models)
pars = params(model)
Flux.Tracker.back!(pars,models,parss,dss,loss)

dss was a vector with data for each thread and and loss was the loss function.

You might need to implement some of the missing functions, like taking average of parameters or copying them, but you should have the idea, how I did it.

Hope this has helped.

jstrube · July 20, 2018, 6:28am

Thanks! I’m sure I’ll take a while to get up to speed (total Flux newbie), but I appreciate your help.

Topic		Replies	Views
Training Deep Neural Network using Data Parallel? Machine Learning parallel , flux	5	1317	January 24, 2022
Can Flux handle multiple GPUs? Machine Learning	16	2457	August 5, 2022
Flux parallel execution Machine Learning flux	3	2766	March 29, 2019
Flux data parallelism Machine Learning gpu , cuda , flux	2	244	April 30, 2024
Flux.jl and the state of multi-processing Machine Learning	2	1635	February 27, 2019

Data-parallel training with conv nets in Julia

Related topics