Thanks a lot. If I have a single processor with 8 logical cores, can I parallelize with multiple processes? I thought that I was forced to do multithreading when there is only one processor with shared memory. Is that right?
Your operating system will happily run multiple processes, each with their own memory address space, on a shared-memory machine. (Even with a single core, for that matter!)
Thanks. I think I am understanding. What is the benefit of multithreading over distributed in the case of shared memory node in a general setting? And in the case of using PyCall?
There are many books and other resources on the different forms of parallelism and their pros and cons. You should do a bit of reading on the basic principles before trying anything (I like this book). Shared memory threading can be easier to implement because you don’t need to decide which process stores what data, or pass data between processes explicitly. Distributed memory parallelism scales better to large compute resources, though.
In the case of PyCall or PythonCall, the key point is that CPython is not threadsafe, so only one thread can access the Python interpreter at a time. Whereas multi-process parallelism is running multiple copies of the Python interpreter in parallel, which works just fine.
Thank you very much!
I have looked at the book. I haven’t been able to understand if in the case of, for example, a CPU with 4 cores with 2 threads each, if I run multithreading I will really be able to run 8 independent tasks (single code, multiple data) in parallel, or with multithreading I can only run up to 4 tasks in parallel. I ask this because I have read that multithreading is not really parallel. I do not know if I have understood well.
And if I understood correctly, with distributed I would be able to run up to 4 tasks exactly in parallel?
I think you’re mixing the general concept of multithreading with hardware SMT or “hyper-threading”, where the latter is a way to rapidly switch between multiple threads on a single CPU in order to improve efficiency by masking latency.
There are four distinct concepts here:
- How many threads you run, which is independent of the number of cores. You can spawn 1000 threads on a single core if you want, and the the operating system will happily interleave their execution (give them each little slices of CPU time). (Same for processes: the difference compared with threads is whether they share an address space.)
- How many CPU cores you have: this limits the number of computations that you can actually do simultaneously.
- SMT, which allows a single CPU core to be rapidly switched between two or more threads to mask latency (while one thread is waiting on memory, the other thread can be executing). This can give greater efficiency (reducing the idle time where the CPU is doing nothing), but doesn’t actually perform multiple computations simultaneously. How much it actually helps you depends on the application.
- Parallel speedup: how much performance improvement you get from parallelism. This is typically less than the number of cores or threads.
If you have 4 cores, where each core has 2-way SMT, you might typically use either 4 or 8 threads (or processes … depends on whether you want them to share an address space). Probably you will get < 4x speedup in any case. 8 threads definitely won’t give you 8x speedup, but might give a small performance boost over 4 threads in problems where SMT helps.
(In some problems involving lots of I/O, like a web server, you may use vastly more threads than you have cores, because lots of threads spend lots of time doing nothing while they wait for I/O. Such cases often use cooperative multithreading or “green threads”, which Julia supports too, rather than (or in addition to) “true” hardware threads. There are lots of kinds of threads!)
Ok, thanks. Now it is more clear. I have a problem of --single code-multiple data type-- where each of 1000 tasks is completely independent of the other. I could use bash to run that in parallel, but prefer to do it with Julia in a single node of a cluster as later I make a serial code analysis of all the runs and save all the information together in a single data frame. Should I use multithreading or distributed?
I understand that if I decided to use two+ nodes (without shared memory) then I would be forced to use distributed scheme, discarding the mutithreading.
Going back to my example with 4 physical cores with 2 threads each (8 logical cores) how many indepent tasks can be processed really in parallel, 4 or 8, if I use distributed scheme ?
Up to you. I mentioned some pros and cons above.
A relevant question is: How much are you allocating in each task? (And are the allocations necessary?)