I want to know what is the correct way of running a julia script (which has some part that requires parallelization) on a HPC cluster remote machine.
My PBS script looks like:
#PBS -l walltime=01:00:00,nodes=1:ppn=24,mem=62gb
and test.jl looks like:
@everywhere function sqr(x)
y = pmap(x->sqr(x), 1:1e5)
I am using 1 node with 24 cores as is depicted in the PBS script above.
My question is do I need to declare
export JULIA_NUM_THREADS=24 (since the default threads is 1 as is given by
Threads.nthreads() ) in the PBS script or the specification
ppn=24 is sufficient?
I tried with and without
export JULIA_NUM_THREADS=24 and didn’t notice any difference in performance of my actual code in terms of execution time. I am not sure if my code is getting properly parallelized or not.
Any suggestions on the correct way to specify threads and processes in PBS and/or julia script when using
pmap for parallelization.
I am using Julia 1.5.0 .
You’re using distributed memory parallelism, so you do not need threads.
pmap will automatically run the function on free workers. However for such a small function you’ll not find any advantage by using more workers, as communication time will swamp gains from parallel execution.
By the way since you’re using PBS, consider looking into
ClusterManagers.jl if you want to use more than one node.
you might actually want to use 24 threads on each node. but then you’ll have to call some threading operations…
like within your real function you might do
@threads to parallelize across threads
Thank you @jishnub. Yes, for now I simply request 1 node and a certain number of cores on that node through PBS script. Then, I simply load processes using
addprocs in my julia code. This seems to be working well for now. Also, yes for this small function
sqr this will not be much helpful, but in my actual code I have another function that does large computation.
Regarding using multiple nodes, I do plan to use more than one node, but since I am not sure how
pmap actually behaves when the cores are present on different nodes, thus so far I have been avoiding it. As you said, I will probably look into
ClusterManagers.jl for this.
I guess what you have mentioned is a different method of parallelization using
@threads macro. This falls under multi-threading and not distributed computing as far as I know (please correct me if I am wrong). And for using
@threads macro, I indeed need to specify
JULIA_NUM_THREADS in my before script as pointed by @jishnub. Right?
Additionally, I have tried using
@threads macro but couldn’t get better parallelization as I do with
pmap for some reason, so I have been sticking to
pmap for now.
Is multi-threading preferred over distributed processing (or the other way round) in certain scenarios?
pmap will map your problem across multiple machines… Within that machine you can map your problem across multiple threads. They complement each other.
Okay. I get what you’re saying but I am not sure how will I achieve this in an actual code. For example I requested 2 nodes with 24 cores each and then I added 48 threads using
JULIA_NUM_THREADS=48. Also, I loaded some processes using
addprocs. Now, how will I use
@threads together. Should I just write them both in front of my function that needs to be parallelized. Can you please give a simple example of doing so?
first since your nodes have 24 CPUs you should make the NUM_THREADS be close to 24, like at most maybe 26 (sometimes it can help to have some oversubscription)
then, how to use?
pmap will map across machines… so write the function that you pmap to use @threads, and then each machine will use multiple threads.
Okay. I will probably try this out on some of my code and see how it works. Thank you!