Why do some Flux models train in parallel but not others?

Why does training this model use as many CPUs as I ask it to:

Chain(
             Conv((5, 5), imgsize[end]=>6, relu),
             MaxPool((2, 2)),
             Conv((5, 5), 6=>16, relu),
             MaxPool((2, 2)),
             flatten,
             Dense(prod(out_conv_size), 120, relu), 
             Dense(120, 84, relu), 
             Dense(84, nclasses)
)

And training this model use only 1?

Chain(
        #28x28 to 14x14
        Conv((5,5), 1=>8, pad = 2, stride = 2, relu),
        #14x14 to 7x7
        Conv((3,3), 8=>16, pad = 1, stride = 2, relu),
        #7x7 to 4x4
        Conv((3,3), 16=>32, pad = 1, stride = 2, relu),
    
        #Average pooling on each width x height feature map
        
        GlobalMeanPool(),
        Flux.flatten,
        Dense(32,10),
        softmax
    )

I paste both into the MNIST example from the Flux model zoo so all else should be equal.

I’m sure there is a reason; I’m not sure what it is

I gues this is because Dense layers hit Blas’ MatMul, which is multi-threaded whereas Conv is likely single-threaded.

Unless you are getting warnings about incompatible types from NNlib, convs should absolutely be multi-threaded. What does run single-threaded are (most) pooling operations, which the first model uses extensively and the second does not.

I’m not getting any warnings. What am I doing wrong that Conv layers are single-threaded?

Again, the conv layers are not. They likely just run fast enough that you only see the single threaded max-pooling layers (and/or other single-threaded parts of the training loop, like data loading) in whatever monitoring tool you’re using. You can confirm for yourself that the conv layers are indeed using multiple threads by removing said pooling layers and benchmarking on a fixed dummy input to remove any data loading overhead.

So if pooling is single-threaded and thereby masks the multi-threaded nature of the Conv layers, why does the model with more pooling layers use 8 cores most of the time and the one with only one pooling layer use 1 core almost all of the time?

After actually running these models locally, turns out it was the simplest answer and I was barking up the wrong tree :slight_smile:

By default, Julia allocates but a single thread to the default thread pool (you can check with Threads.nthreads()). Because Flux conv layers use this thread pool, they end up running (mostly, more on that below) single-threaded. To make Julia use multiple threads, either pass -t [nthreads] or -t auto at startup. If you’re using VS Code, this is also exposed via the “Julia: Num Threads” option.

Now if conv layers are running single threaded, why does model1 appear to use multiple? That’s because the matrix multiplication calls in Dense layers use a separate, BLAS threadpool which is >1 by default (you can check this with using LinearAlgebra; BLAS.get_num_threads()). Because model1 has two large dense layers to model2’s one small one, it spends a lot more time here and thus a lot more time in multi-threaded code. Conv layers also use matmuls under the hood, but these are generally smaller and need the aforementioned default thread pool for any significant parallelism.

1 Like

Now I can shave a whole 8% of training time by running 8 threads.

The training of Conv layers is acting pretty memory bound, so having lots of threads seems pretty pointless. Empirically it mostly makes training slower

Hi, out of curiosity, am I right that you have 12 cores on your machine? Are they real or Hyper Threaded? One or two sockets? Are you happy with this 8% increase if I may ask? Just wanted to mention ThreadPinning.jl package as well as STREAMBenchmark.jl and BandwidthBenchmark.jl. I believe they give additional insights, particularly useful for ML applications.

I should have been more specific. There are 12 real cores with hyperthreading, so 24 virtual cores.

I have never found much performance gain using hyperthreading in memory-bound applications so I seldom exceed the number of real cores

My point here was rather related to dual socket systems and the way threads are affinitized by the OS or Julia. You are not providing any details about the architecture, the way how Julia was set and I have not run those examples by myself so its quite hard to refer. As for HT, AFAIK, you are probably very right; in case of dense computations its particularly visible due to the limited number of CPU vector units.

I’m fine with the 8% gain for now; I will likely want more in the future. But for now I have the main thing I wanted, which is the ability to train with more than 1 thread if I want to.

If anyone cares, the architecture is an old Dell PowerEdge R610 with dual Intel Xeon E5-2630s. that have 6 threads/CPU plus hyperthreading. It has been a while since I checked, but I think the motherboard has 4 RAM channels per CPU, hence my choice of 8 threads

And yes, I know this hardware is old and slow by modern standards. I don’t make money with my machine learning / scientific computing/ data analysis projects so fancy new hardware is hard to justify over $32 servers

I use different computers as well, with my main being released around the same time as yours. In general, I was referring to the architecture and Julia setup and some of the packages I found interesting. I believe the setup in some cases may provide additional benefits / speedups. Tried to share some of my own experiences with Julia, BLAS and ML/AI models.

SimpleChains.jl might be useful if you are limited to CPU. The authors show some benchmarks in this blog post.

1 Like

Interesting article about SimpleChains.jl. I was not aware about it. Thanks. I am wondering: a) Do you think that such or similar techniques could be used for networks like AlphaZero.jl? and b) As for PyTorch comparison, was it with IPEX / oneDNN or it is not applicable at all in this case?

NNlib is definitely oversubscribing threads. I’m not sure how much it has an impact in your case because the default im2col algorithm is memory-intensive and GC isn’t great at keeping up with heavily allocating multi-threaded code, but https://github.com/FluxML/NNlib.jl/pull/395 suggests some impact.

As noted in that PR, the biggest blocker for toying around with threading in NNlib is proper benchmarking code + infrastructure. If anyone is interested in that, please reach out! Until then, I second the suggestion to check out SimpleChains if your model and inputs are sufficiently small (e.g. MNIST-sized).

I did some additional reading. Sorry about my previous questions … too focused on one area.