Hello everyone. As the title of this post says, I have been trying to parallelize a for
loop that is included in one of the programs I’m developing for my PhD thesis. This code was originally written in Python, but at some point I decided to rewrite it on Julia and everything works OK, except for one thing: Julia version of my code is a lot slower than the parallelized Python version. So I imagine that I’m doing things really wrong. Just to show you in details, I have prepared a minimal working example of my problem:
Since I have to use the HPC facility of my university (not a lot of degrees of freedom for me I’m afraid) I need to use a PBS script like this one:
#PBS -N test_julia
#PBS -S /bin/bash
#PBS -j oe
#PBS -l nodes=1:ppn=4
#PBS -l walltime=1:00:00
# Cluster related stuff that I can't change
export CurrDir=$PBS_O_WORKDIR
source settmpdir
export LANG LC_ALL
export MODEL_NAME=JuliaFit
export NPROCS=`cat $PBS_NODEFILE | wc -l`
export NHOSTS=`cat $PBS_NODEFILE | uniq | wc -l`
cd $CurrDir || exit 2
rsync -avP * $TMPDIR/
# Run calculation
julia -p $NPROCS jpara.jl
rsync -avP * $CurrDir/
# Clean up
cd $CurrDir && rm -rf $TMPDIR
To submit the job with the minimal working example:
#!/usr/bin/env julia
println("Test calculation using $(nworkers()) procesors")
function fpar()
@sync @distributed for i = 1:10
function fnopar()
for i = 1:10
@everywhere function ext(i::Int64)
callmop = `/home/ramon/bin/MOPACMINE/MOPAC2016.exe ./inp_semp/geo_$(i).mop`
val, t_par, bytes, gctime, memallocs = @timed fpar()
val, t_nopar, bytes, gctime, memallocs = @timed fnopar()
println("That took $(t_par) seconds for the parallel execution, and $(t_nopar) for the serial one")
In the example above, MOPAC2016.exe is a computational chemistry package, that takes as input some files of the type ./inp_semp/geo_$(i).mop
. As a result of this test, I got this:
Test calculation using 4 processors
That took 2.829207343 seconds for the parallel execution, and 1.232857263 for the serial one
So the parallel execution was slower! At this point, I don’t have any idea what I’m doing wrong. Would you be so kind of giving me any hint? Thank you very much in advance.
Edit: I have also profiled in Python, using a similar PBS script to launch the job, and the multiprocessing
module (I can show the code if needed). For the same operations, i.e. calling a function including the loop, it took 0.161532