Correct way of parallelizing on a HPC remote cluster machine

I want to know what is the correct way of running a julia script (which has some part that requires parallelization) on a HPC cluster remote machine.
My PBS script looks like:

#!/bin/bash -l
#PBS -l walltime=01:00:00,nodes=1:ppn=24,mem=62gb
export JULIA_NUM_THREADS=24
julia test.jl

and test.jl looks like:

using Distributed
addprocs(24)

@everywhere function sqr(x)
    return x^2
end

y = pmap(x->sqr(x), 1:1e5)

I am using 1 node with 24 cores as is depicted in the PBS script above.
My question is do I need to declare export JULIA_NUM_THREADS=24 (since the default threads is 1 as is given by Threads.nthreads() ) in the PBS script or the specification ppn=24 is sufficient?
I tried with and without export JULIA_NUM_THREADS=24 and didn’t notice any difference in performance of my actual code in terms of execution time. I am not sure if my code is getting properly parallelized or not.
Any suggestions on the correct way to specify threads and processes in PBS and/or julia script when using pmap for parallelization.

I am using Julia 1.5.0 .

You’re using distributed memory parallelism, so you do not need threads. pmap will automatically run the function on free workers. However for such a small function you’ll not find any advantage by using more workers, as communication time will swamp gains from parallel execution.

By the way since you’re using PBS, consider looking into ClusterManagers.jl if you want to use more than one node.

1 Like

you might actually want to use 24 threads on each node. but then you’ll have to call some threading operations…

like within your real function you might do @threads to parallelize across threads

Thank you @jishnub. Yes, for now I simply request 1 node and a certain number of cores on that node through PBS script. Then, I simply load processes using addprocs in my julia code. This seems to be working well for now. Also, yes for this small function sqr this will not be much helpful, but in my actual code I have another function that does large computation.

Regarding using multiple nodes, I do plan to use more than one node, but since I am not sure how pmap actually behaves when the cores are present on different nodes, thus so far I have been avoiding it. As you said, I will probably look into ClusterManagers.jl for this.

Hi Daniel,

I guess what you have mentioned is a different method of parallelization using @threads macro. This falls under multi-threading and not distributed computing as far as I know (please correct me if I am wrong). And for using @threads macro, I indeed need to specify JULIA_NUM_THREADS in my before script as pointed by @jishnub. Right?

Additionally, I have tried using @threads macro but couldn’t get better parallelization as I do with pmap for some reason, so I have been sticking to pmap for now.
Is multi-threading preferred over distributed processing (or the other way round) in certain scenarios?

pmap will map your problem across multiple machines… Within that machine you can map your problem across multiple threads. They complement each other.

Okay. I get what you’re saying but I am not sure how will I achieve this in an actual code. For example I requested 2 nodes with 24 cores each and then I added 48 threads using JULIA_NUM_THREADS=48. Also, I loaded some processes using addprocs. Now, how will I use pmap and @threads together. Should I just write them both in front of my function that needs to be parallelized. Can you please give a simple example of doing so?

first since your nodes have 24 CPUs you should make the NUM_THREADS be close to 24, like at most maybe 26 (sometimes it can help to have some oversubscription)

then, how to use?

pmap will map across machines… so write the function that you pmap to use @threads, and then each machine will use multiple threads.

Okay. I will probably try this out on some of my code and see how it works. Thank you!