How can I make Flux use all my CPUs?

Hi all,

I already set export JULIA_NUM_THREADS=12 and in REPL I checked Threads.nthreads() =12. But when I ran the training, I only saw Julia(Flux) using 8 CPUs, and, they were not in full power.

How can I make Flux use all my CPUs and run in full power?

Thanks!

Not sure if this is relevant to your specific case, but this happens to me if my batch_size/sample_count aren’t tuned correctly; usually because the CPU is spending all of it’s time moving things around as opposed to actually computing.

Thank you and any guidance on how to tune it correctly?

Try to increase number of threads that openblas (OPENBLAS_NUM_THREADS) uses, since at the moment Flux does not have any paralelization support.

This is actually a point on which I don’t know if there is a ‘correct’ answer. Choosing batch and sample sizes is usually a bit of trial and error for me. I do have a heuristic that I try to stick to and it has served me well for my purposes. I do tend to use a GPU for my work, so take this with a grain of salt in that regard.

\text{Weight size (bits)}\ \cdot\ \text{#of weights in your model} \ \cdot\ nsamples\ \cdot\ batchsize\ \approx\ \text{Size of Memory}

With Flux, it seems that most of the defaults are Float32’s for ease of computing with GPUs, meaning your Weight Size parameter is 32bits.

If anyone has a better system, I’d be really curious.

Also @Tomas_Pevny is almost certainly correct with the solution to your initial problem but I didn’t want to leave you hanging on your followup.

Thank you both @Tomas_Pevny @uadjet . But I am afraid that Flux does not support multi-cpu parallelization.
I set both

JULIA_NUM_THREADS=12
OPENBLAS_NUM_THREADS=12

and I change the batch size.
Still Flux does not use 12 CPUs, and, what’s worse, single CPU sometimes.

Correct, Flux does not itself support multi-CPU parallelism (yet). However, because it uses OpenBLAS for its BLAS operations (like the rest of Julia), and because OpenBLAS does do multithreading sometimes, then you sometimes see multiple CPUs getting used. Whether all 12 of them will get used is up to what BLAS routines are called, and what OpenBLAS thinks is efficient (sometimes adding more CPUs might not help compute something faster, or can even slow things down).

So the bottom line is that you shouldn’t always expect Flux to use 100% of all your available CPUs; for that to be the case requires the right set of circumstances and the “right” calls into BLAS.

@jpsamaroo The problem is I want to run my program in the workstation and Flux does not use all of the CPU power. I found that when I ran Flux on a 64-core machine, it used only 8 of them, making the training process very very slow.
I also can only use Flux for very small data sets.

Any chance you can either share your code here, or run your code under profiling (Profile.@profile)? That way we can start to investigate where your bottlenecks are and determine if there’s anything that can be done to improve the CPU utilization.

1 Like

@jpsamaroo Sure. That would be great help. It’s a very simple regression problem:
y=f(x_1,x_2,x_3,x_4,x_5,x_6)
The code is simple too:

using Flux
using Flux: throttle
using Base.Iterators: repeated
using Pkg
using DelimitedFiles

X_train = readdlm("X_train.txt", ' ', Float64)
y_train = readdlm("y_train.txt", ' ', Float64)
X_valid = readdlm("X_valid.txt", ' ', Float64)
y_valid = readdlm("y_valid.txt", ' ', Float64)

X_train = transpose(X_train)
y_train = transpose(y_train)

X_valid = transpose(X_valid)
y_valid = transpose(y_valid)

dataset = repeated((X_train, y_train), 5000)

m = Chain(Dense(6, 256, relu),
                BatchNorm(256, relu),
                Dense(256, 256, relu),
                BatchNorm(256, relu),
                Dense(256, 1, relu))
println(m)
loss(x, y) = Flux.mse(m(x), y)
evalcb = () -> @show(loss(X_valid, y_valid))
opt = ADAM(0.02)
Flux.train!(loss, params(m), dataset, opt, cb = throttle(evalcb, 10))

I didn’t know this before, but apparently Julia’s OpenBLAS is hard-capped to 16 threads (which may show up as 800% CPU on your system): Julia slower than R