Is the best number of threads used in parallel computing by using distribute 4?

I am doing a numerical calculation using Julia, and recently I just tested the speed of my code using parallel computing by applying @distributed in the summation:
(It is just part of my code. The attached part is enclosed in another loop so Sm is a local variable and can be used in \eta loop directly. I am sure my code is correct and gives correct results.)
Then I collected the running time for using different number of threads (1, 4, 8, 16 and 32):
(9:52, 7:14, 6:54, 6:50 and 7:34 in terms of “HH:MM”, respectively)
It looks for me that when I just from 1 thread to 4 threads the improvement is best. With the threads number increasing, the performance doesn’t increase that much and for using 32 threads, it even took longer time than using 4 threads.
I am sure that I allocated enough number of threads when I running this code on my cluster. I used “export JULIA_NUM_THREADS=Value” (Value = 1, 4, 8, 16, 32) in my scripts to ensure that Julia will start with this many threads. I also used:
"N_t = nthreads()
println("Number of Threads = “,N_t)”
in my code to check and confirm that I did allocate this much threads.
So my question is:

  1. Does anyone else test or feel that the parallel computing by @distributed in Julia is not linear related to the number of threads? Is my case special or common?
  2. If my case is a common case, is there any possible explanation on why the running time is not 1/4 with 4 times of threads and why when using 4 threads it seems to get the best improvement?
  3. I am very new to Julia and even for programming. May I know if another languages have similar performance in parallel computing?

Any answer is welcome. Thanks for your reading and replying

Linear scaling is the best possible outcome (not really, it is possible to get superlinear scaling). It is rarely observed. The speedup you will see depends on your architecture and the nature of your program. There are not really any universal, clear-cut truths to parallel programming that I know of.


Thank you tbeason! I used to think that linear scaling is universal.

@distributed is for multiprocessing, whereas the JULIA_NUM_THREADS environment variable is for multithreading.
For multiprocessing with @distributed, use addprocs() to add worker processes.
Alternatively, use Threads.@threads or Threads.@spawn for multithreading.


Thank you lungben! I rewrite my code with addprocs() to add processors and my code become much faster. I just noticed that previously, I ran my code on only one processor.
Also, I realized that there are 2 ways of parallel computing in Julia. One is Threads.@threads or Threads.@spawn used with $env:JULIA_NUM_THREADS = <nthreads> which with shared memory. Another one is Distributed@distributed with addprocs() for multiprocessing which will distribute the job into several processors who has their only memories.