Distributed for loop slower than serial?

Hello everyone. As the title of this post says, I have been trying to parallelize a for loop that is included in one of the programs I’m developing for my PhD thesis. This code was originally written in Python, but at some point I decided to rewrite it on Julia and everything works OK, except for one thing: Julia version of my code is a lot slower than the parallelized Python version. So I imagine that I’m doing things really wrong. Just to show you in details, I have prepared a minimal working example of my problem:

Since I have to use the HPC facility of my university (not a lot of degrees of freedom for me I’m afraid) I need to use a PBS script like this one:

#!/bin/bash
#PBS -N test_julia
#PBS -S /bin/bash
#PBS -j oe
#PBS -l nodes=1:ppn=4
#PBS -l walltime=1:00:00

# Cluster related stuff that I can't change

export CurrDir=$PBS_O_WORKDIR
source settmpdir

LANG=C
LC_ALL=C
export LANG LC_ALL

export MODEL_NAME=JuliaFit
export NPROCS=`cat $PBS_NODEFILE | wc -l`
export NHOSTS=`cat $PBS_NODEFILE | uniq | wc -l`

cd $CurrDir || exit 2
rsync -avP * $TMPDIR/

# Run calculation

cd $TMPDIR

julia -p $NPROCS jpara.jl

rsync -avP * $CurrDir/

# Clean up

cd $CurrDir && rm -rf $TMPDIR

To submit the job with the minimal working example:

#!/usr/bin/env julia

println("Test calculation using $(nworkers()) procesors")

function fpar()
   @sync @distributed for i = 1:10
      ext(i)
   end
end

function fnopar()
   for i = 1:10
      ext(i)
   end
end

@everywhere function ext(i::Int64)
   callmop = `/home/ramon/bin/MOPACMINE/MOPAC2016.exe ./inp_semp/geo_$(i).mop`
   run(callmop)
end

val, t_par, bytes, gctime, memallocs = @timed fpar()
val, t_nopar, bytes, gctime, memallocs = @timed fnopar()

println("That took $(t_par) seconds for the parallel execution, and $(t_nopar) for the serial one")

In the example above, MOPAC2016.exe is a computational chemistry package, that takes as input some files of the type ./inp_semp/geo_$(i).mop. As a result of this test, I got this:

Test calculation using 4 processors
That took 2.829207343 seconds for the parallel execution, and 1.232857263 for the serial one

So the parallel execution was slower! At this point, I don’t have any idea what I’m doing wrong. Would you be so kind of giving me any hint? Thank you very much in advance.

Edit: I have also profiled in Python, using a similar PBS script to launch the job, and the multiprocessing module (I can show the code if needed). For the same operations, i.e. calling a function including the loop, it took 0.161532 seconds.

You are running MOPAC2016.exe on a series of different input files?
I would advise using a PBS array job here https://arc-ts.umich.edu/software/torque/job-arrays/
Dont use Julia for this.

So simply qsub -t 1:100:
In the script:
/home/ramon/bin/MOPACMINE/MOPAC2016.exe ./inp_semp/geo_$(ARRAYID).mop

One other small point - you are reading the executable file from your home directory. It really is site dependent, however if your home directory is on an NFS share this might be inefficient. As your systems guys if there is fast or scratch storage which should be used.

1 Like

Thanks for your reply @johnh

The thing is that after this FOR loop, which actually runs MOPAC2016.exe on a series of different input files, I parse all the output files obtained and then compute an RMS value. The whole process is included in a function that I optimize using NLopt. Since the input files get modified in each iteration of the optimization, I don’t see how to apply the solution you proposed. Thanks for the second advice, I will check it.

Aha. You are doing something which I think is termed computational steering’.
My bad for not realising there was further processing after those mopac runs.

The home directory comment was simply an aside comment. When running a code on many compute nodes in parallel if you launch the code at the same time then reading executables and libraries all at the same time can be a bottleneck.

@Panadestein Please run that loop a SECOND time in the same job script.

Even better use Benchmarktools.jl https://github.com/JuliaCI/BenchmarkTools.jl