I am doing a numerical calculation using Julia, and recently I just tested the speed of my code using parallel computing by applying @distributed in the summation:
(It is just part of my code. The attached part is enclosed in another loop so Sm is a local variable and can be used in \eta loop directly. I am sure my code is correct and gives correct results.)
Then I collected the running time for using different number of threads (1, 4, 8, 16 and 32):
(9:52, 7:14, 6:54, 6:50 and 7:34 in terms of “HH:MM”, respectively)
It looks for me that when I just from 1 thread to 4 threads the improvement is best. With the threads number increasing, the performance doesn’t increase that much and for using 32 threads, it even took longer time than using 4 threads.
I am sure that I allocated enough number of threads when I running this code on my cluster. I used “export JULIA_NUM_THREADS=Value” (Value = 1, 4, 8, 16, 32) in my scripts to ensure that Julia will start with this many threads. I also used:
"N_t = nthreads()
println("Number of Threads = “,N_t)”
in my code to check and confirm that I did allocate this much threads.
So my question is:
- Does anyone else test or feel that the parallel computing by @distributed in Julia is not linear related to the number of threads? Is my case special or common?
- If my case is a common case, is there any possible explanation on why the running time is not 1/4 with 4 times of threads and why when using 4 threads it seems to get the best improvement?
- I am very new to Julia and even for programming. May I know if another languages have similar performance in parallel computing?
Any answer is welcome. Thanks for your reading and replying