Can Flux handle multiple GPUs?

I am playing around with ML in Julia and stumbled on a source of cheap nvidia Quadro K2000s. Far from modern, I know, but I can’t afford modern hardware for side projects. If I stuck 4 of them in 1 tower, would Flux/Cuda.jl take advantage of all 4?

Currently my models are training on a a single core of a 12-core machine because Flux.jl doesn’t do CPU parallelism. I’m sure any CUDA.jl compatible GPU would help a lot; would 4 of them help 4 times as much?

Many thanks! As always, I love Julia and everything it makes possible for me :smiley:

see also Multi-GPU single host training example - #2 by j_u


Looking up the quadro k2000 it is a 9 year old card with 2gb vram. You can find it in the table CUDA-Enabled NVIDIA Quadro and NVIDIA RTX on this page where it is classified as having “compute capability 3.0”. This means that a lot of deep learning frameworks wont support it, and even if they do it will be very slow. Having 4 of these will probably also introduce additional overhead related to piping data and accumulating gradients.

But the good news is that Flux is capable of using all physical cores:) Perhaps you need to set the JULIA_NUM_THREADS environmental variable. What output do you get when running Threads.nthreads() in the REPL?

(12 cores w/hyperthreading)

Julia CPU usage never rose above 5%. I never saw any use of >1 core while the model was training. I tried installing Octavian, then restarting the notebook and reloading packages. Nothing I did made any difference.

Clearly I am missing something. What is it?

Julia doessn’t do automatic threading, it’s all up to Flux I guess

I would also take a closer look at specification and software requirements in case of older GPU cards. In case of training on CPUs, my experience suggests that Julia Distributed capabilities could be more useful than multithreading. Once I attempted to train quite demanding deep learning model only on CPUs. Here is some info: link Disclaimer is that this was almost my very first contact with Julia so not everything might be correct but in general it should not be that bad. It depends how you look but I am recalling that the first, major part of the training took about the same time on high end CPUs (2 x Platinum 8358) as on high end GPU (V100). [I wrote it but due to some reasons it is a very simplified comparison, there were some GPU inefficiencies and probably the cost of the CPUs, board and RAM exceeded the cost of high end GPU]. I personally never tried FluxMPI.jl but my attempts with bare MPI.jl were not so great but it was probably the case of the model I used and my basic knowledge about MPI. There was a very interesting presentation at the last JuliaCon “Scaling up Training of Any Flux.jl Model Made Easy | Dhairya Gandhi | JuliaCon 2022” and as for DaggerFlux.jl and Dagger.jl I believe should you pursue this road you will receive excellent support especially at the public forums. As for directed acyclic graphs you may also take a look at the very interesting “Introduction to Graph Computing | JuliaCon 2022 | Yadong Li

I believe that the idea of CPU training is quite valid especially in case of large models, cloud environments and discounted pricing and should be even more realistic with the next CPU generation. However, in general this is a contrary view and probably a little futuristic. I have seen several threads on this topic recently and I have tried to discuss this subject several times by myself however as I wrote this (ML/DL on CPUs) seems to be a niche topic and in most cases, should not be probably a favored option currently.

So, how do I get Flux to use multiple CPU threads while training? Apparently there is a way, but it’s not shown in any docs that I have found. Do I have to write some kind of custom training loop and decorate it with @parallel? Is there a way to tell Flux.train!() that I want it to be multithreaded?

I am super new to the Julia ML ecosystem so still feeling my way forward in the dark lol

That is strange. The kernels Flux uses for matmul and conv should use multiple threads by default. And that is the behaviour I observe. Here is a screenshot of CPU utilization (htop) on my 8-core laptop while training the conv_mnist example from the model zoo. I saw similar CPU utilization when running the mnist_mlp example.

Please take a look at the links I provided above. There are some code snippets related to Distributed training of AlphaZero (in this case its more about processes than threads). Hope you may find something useful there. And again, I would especially recommend looking at the provided videos, as just to quote The Mandalorian, “This is the way”. (I mean directed acyclic graphs could become the predominant method in the future). :- ) EDIT: From my side, I would like to add that on Julia below 1.8 there is a caveat related to BLAS settings. I mean Julia starts by default with max 8 BLAS threads and almost whatever you do you can not go above 32. With 1.8 this restriction was removed.

That model from the model zoo started off like this:

I am guessing this was some kind of compilation step.

Now it looks more like this:

I am going to have to dig through the script and see what I did wrong in my code

I will do that, and probably come back with lots of new questions :smiley:

That looks like it is using 8 threads :slight_smile:
Dense layers mostly rely on matrix multiplication which is done using BLAS kernels.

the package LinearAlgebra has the functions BLAS.get_num_threads and BLAS.set_num_threads, which you can use to change how many threads BLAS uses.

You are always welcomed. I will be delighted to learn more about ML/DL. At the same time please be advised that I am not a long time Julia user, nor coding is my area of expertise. However, I would like to reiterate that as for Dagger and DaggerFlux I am pretty sure you receive a top notch support at public forums directly from the maintainers.

Exactly. 8 is progress! I seem to remember that BLAS is limited to 8 threads below Julia 1.8, right?

I don’t think so. I am pretty sure I tried using 16 threads on a 16 core machine a while ago on either Julia 1.5 or julia 1.6. But I don’t think runtime decreased much for me even though all threads were busy.

Its hard to refer without exact example however re 8 BLAS threads pls see: BLAS performance testing for Julia 1.8. Not sure if this could be to iterest, however, one may also easily change BLAS backend thanks to libblastrampoline and instead of default OpenBLAS use other libraries like i.e. MKL. Depending on the matrix size and the hardware sometimes this can provide significant benefits. BLAS performance can be checked with i.e. BLASBenchmarksCPU.jl or BLASBenchmarksGPU.jl. Hope it helps.

You can expect a minimum of 10s of compilation latency when taking the first gradient of Flux models (this 10s comes from the AD, not Flux) and possibly more depending on model complexity (>1min is not unheard of for complex vision models, some of this does come from Flux). One way to evaluate only runtime is to call gradient with your loss, model and some dummy data first and then do your perf measurements.