Julia distributed and multithreaded

In my code I leverage multithreading through the use of things like Threads.@threads and by starting julia with Julia --threads=auto. Alongside my code I now wish to utilise Juniper.jl which utilises distributed computing. The docs suggested launching Julia with

julia -p n provides n worker processes on the local machine. Generally it makes sense for n to equal the number of CPU threads (logical cores) on the machine.

I’m running this on an M1 Max which has 10 cores (although only 8 are “high performance”), note that I believe ARM CPUs only have one thread per core. If I launch Julia with Julia -p 8 --threads=8 is this suitable? Or am I somehow saying I have 8 threads and each thread has 8 threads?

Note on my machine Julia --threads=auto gives me 8 threads. I presume it only chooses the “high performance” ones.

Eh no just use 8 threads OR 8 process.

8 process * 8 threads means 64 threads you absolutely don’t want that.

4 Likes

Are you sure? This means my multi threading code is bound to a single thread.

➜  tmp julia -p 8  --threads=1                                                      [13/Oct/22 | 9:47]
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.2 (2022-09-29)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> Threads.@threads for ii in 1:8
           @show Threads.threadid() myid()
       end
Threads.threadid() = 1
myid() = 1
Threads.threadid() = 1
myid() = 1
Threads.threadid() = 1
myid() = 1
Threads.threadid() = 1
myid() = 1
Threads.threadid() = 1
myid() = 1
Threads.threadid() = 1
myid() = 1
Threads.threadid() = 1
myid() = 1
Threads.threadid() = 1
myid() = 1

-p will literally spawn multiple entire copies of julia. This is multiprocessing, in which each process has its own memory etc. Parallel processing will use communication to distribute work between the different processes, which do not share memory. Multi threading is more lightweight and all threads can access the same memory. On a single machine, multithreading is almost always the better choice.

Sometimes it can be useful to combine mutliprocessing and multithreading, but this is more the case when you are scaling across an entire cluster with many machines, but you should only use as many threads as you have cores(/threads for SMT processors) or they will be oversubscribed, as stated earlier you would have 64 total threads which is far too many.

3 Likes

Thank you, I had some awareness of this which is why in my work I use multi-threading. My question is what to do if my code is multi-threading and one of my dependencies leverages multi-processing? Here is some pseudo code of what I’m trying to run

a = 3

b = my_multi_threading_function(a)

c = junipers_multi_processing_function(b)

@show c

The mentioned oversubscription was my initial concern.

If you are sure that the juniper code will not use multiple threads then this is probably fine. But maybe consider opening an issue on their Github page requesting an option to add in multithreading instead of forcing a distributed model.

I see so you’re suggesting Julia -p 8 --threads=8 will be fine because:

  • My code only uses multi-threading
  • And, Juniper only uses multi-processing

I’m sure they paid attention to this design decision, I’ll try to find the reasoning why they chose this. I know that issues have arisen in the past with multi-threading and one of their dependencies (Ipopt).

1 Like

Yes, it’s easier to mess something up in multithreading, due to shared state and race conditions, global variables etc so I wouldn’t be surprised if they ran into issues.

And yes, I don’t see why you would run into problems, and if you did, it would probably just be performance related and not actually give you an error.

1 Like

but doesn’t your code run inside Juniper optimization routine? if so, you’re oversubscribing by a factor of 8 still

No, it’s more like the code snippet above. I run my code, then pass the results of this to Juniper. There’s probably a separate issue here in that the solvers Juniper calls will multithread outside of my Julia instance (e.g. regardless of Threads.nthreads() Gurobi always picks up 10 cores on my laptop).

Most yes (and I believe still all of Apple’s), not all:

Arm’s first-ever simultaneous multithreaded CPU core. […]

This is significant because Arm has resisted simultaneous multithreading (SMT), instead opting to lash together lots of cores in its big.LITTLE arrangement: a cluster of small cores running apps, and a cluster of larger cores powering up to take on bursts of intensive work. […]

Arm has toyed with SMT, mulling adding it to its blueprints on and off publicly since around 2010, though it always discarded the idea and settled on multiple single-threaded cores instead. It produced a paper in 2013 [PDF] setting out why it wasn’t happy with SMT: for mobile apps, it doesn’t make sense in terms of performance gain and power usage, although it noted other settings could benefit from it.

You see, not all applications are boosted by SMT, and while some gain performance increases from running multiple threads through each available core, some programs do not benefit at all or are penalized by it.
[…]
Amusingly, just as Arm is embracing SMT, not only is Intel cooking up its own version of big.LITTLE for its future x86-64 chips, but some folks recommend disabling Intel’s Hyper-Threading feature for security reasons – particularly if your software doesn’t benefit from it.

It’s not clear to me if that one (seemingly more capable, though announce earlier) is also multi-threaded, or if it’s done for more ARM CPUs (besides for automotive):

a safety feature normally reserved for real-time CPUs into their highest-end application processor core, in a bid to lure system-on-chip designers and automakers to use the technology to literally steer future self-driving cars.

Specifically, Arm will today announce it has added its Split-Lock feature, found in its Cortex-R 32-bit cores used in real-time and safety-critical systems, to the 64-bit Cortex-A76. The result is the Cortex-A76AE. The AE stands for “automotive enhanced,” indicating it’s aimed at running code controlling self-driving road vehicles.

1 Like

Or some combination n x m = 8 (here for 8-core CPU), e.g. julia -t 4 -p 2? I think, if you can use threads that’s better, and I’m not sure about this, maybe it’s then best to max out on threads, tough your code might not be that scalable. I’m not sure if -p will help you then… [If you have many CPUs, as in HPC, then you can max out on -t and usefully have -p.]

I did wonder about splitting it, although this would be a little annoying though as both processes can be long. I’m currently doing some work to get things running on our HPC, once this is complete things should make sense.

Thanks for the ARM SMT links, they were interesting.

I think in your case, as the workloads are completely separate (one after the other), there’s no reason you can’t have 8 threads and 8 processes at the same time, especially as it’s unlikely that your external library will use multithreading and multiprocessing at the same time.

Also, you don’t have to start the processes from the command line:

julia -t 8

Then:

using Distributed
addprocs(8,exeflags=["--threads=1"])

Which makes the additional processes have only a single thread. You will likely only ever be using multiprocessing or multithreading, and not both and so can benefit from parallelism in your code and the library code.

1 Like

I like that idea of setting up the multi processors to prevent oversubscription, I’ll use that.

Cheers.

1 Like