Horovod and Flux / Julia

Tomas_Pevny · January 18, 2021, 5:28pm

Hi all,

I wonder if anyone was trying to use horovod with Flux (or Julia in general).

Thanks for a answer.
Best wishes,
Tomas

ToucheSir · January 30, 2021, 7:28pm

I had a look through and Horovod is a C++ library under the hood, no? Unless somebody has wrapped it with a C API, I imagine the only straightforward way to use it would be via PyCall (if indeed that jives with all the Distributed machinery). AFAICT the only collective communication library that does expose a C interface would be NCCL. It would be nice to have something like this though, I’d rather not bother with getting CUDA-aware MPI working on my local machines.

Tomas_Pevny · January 31, 2021, 5:49am

Thanks,

wrapping c++ in c seems to be doable, pycall sshould be easy.

Palli · August 13, 2021, 4:49pm

Why Hovorod? It seems to me DeepSpeed (and DeeperSpeed fork of it) might be for a similar purpose, and I think it might be better. I’m might be wrong but at least it has intriguing 1-bit LAMB (and before 1-bit Adam) breakthroughs:

https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/

I just think if anyone is planning to support these tools, start with the best one. Are there more similar?

DeepSpeed addresses the underlying performance difficulties and improves the speed and scale of the training with only a few lines of code change to the PyTorch model.

Tomas_Pevny · August 13, 2021, 7:02pm

Agreed, one should go for the best.

I was asking at that time, as my former colleague was promoting to use horovod, but it never went beyond talks. We never moved to action. At the moment, many other things have piled up and I would not have time to do this.

ToucheSir · August 15, 2021, 4:14pm

Deepspeed is a C++ extension for a Python frontend. Horovod is a C++ library with a Python wrapper. Neither is terribly relevant for distributed training in Julia at the moment. What would be relevant is trying to port over some of the techniques used in ZeRO-{1,2,3,offload}, either in https://github.com/DhairyaLGandhi/DaggerFlux.jl or some alternative library.

Juan · August 15, 2021, 4:23pm

Why not FastAI?
https://github.com/FluxML/FastAI.jl

ToucheSir · August 15, 2021, 4:46pm

The high-level abstractions in FastAI.jl are almost completely orthogonal to how distributed training is conducted. What will most likely happen is that FastAI uses the distributed functionality when it stabilizes.

Topic		Replies	Views
Flux ready for a beginner deep learning project? Machine Learning flux	31	8790	June 20, 2019
How do Julia packages, e.g. Flux and Knet correspond to Python ecosystem Machine Learning	1	912	February 12, 2021
Data-parallel training with conv nets in Julia Machine Learning distributed	4	1024	July 20, 2018
Knet vs MXNet for programmer new to ML Machine Learning knet	25	6801	October 6, 2018
Is it a good time for a PyTorch developer to move to Julia? If so, Flux? Knet? Machine Learning	52	25247	January 11, 2021

Horovod and Flux / Julia

Related topics