I applied multithreading on a for loop that results in runtime speedup. I have 12 cores on my system and by adjusting the number of threads (e.g. 4, 8, and 12), I observe an improvement in runtime. However, the runtime improvement is not linear. For example, by using 4 threads, the runtime speedup is not 4 times better than using a single thread. Or, when I use 8 threads the runtime improvement is not twice as using 4 threads. Is there any way to find out how the threads are assigned to the tasks and how I can manage them to get a better runtime speedup?
Not an answer to your question, but there’s some discussion on (lack of ) speedup here: Again on reaching optimal parallel scaling - #5 by carstenbauer
Unfortunately, without further information it is hardly possible to help you here. How does your loop kernel look like? Do you use
@spawn? And, most importantly, do you have a good reason to expect linear scaling in the first place? Your kernel could easily be memory bound (quite common) in which case using more threads might not help you much. So, you shouldn’t expect linear scaling (even not theoretically) for an arbitrary loop-computation.
Regarding your question, note that for
@spawn tasks might not be sticky, that is, they can, in principle, be moved around between different Julia threads. (
@threads gives you sticky behavior) In any case, you can use
threadid() to see on which thread a task is running on. (Perhaps https://nbviewer.org/github/carstenbauer/JuliaNRWSS21/blob/main/backup/load_balancing.ipynb might be an instructive example).