Distributed for loop slower than serial?

Panadestein · August 19, 2018, 6:02pm

Hello everyone. As the title of this post says, I have been trying to parallelize a for loop that is included in one of the programs I’m developing for my PhD thesis. This code was originally written in Python, but at some point I decided to rewrite it on Julia and everything works OK, except for one thing: Julia version of my code is a lot slower than the parallelized Python version. So I imagine that I’m doing things really wrong. Just to show you in details, I have prepared a minimal working example of my problem:

Since I have to use the HPC facility of my university (not a lot of degrees of freedom for me I’m afraid) I need to use a PBS script like this one:

#!/bin/bash
#PBS -N test_julia
#PBS -S /bin/bash
#PBS -j oe
#PBS -l nodes=1:ppn=4
#PBS -l walltime=1:00:00

# Cluster related stuff that I can't change

export CurrDir=$PBS_O_WORKDIR
source settmpdir

LANG=C
LC_ALL=C
export LANG LC_ALL

export MODEL_NAME=JuliaFit
export NPROCS=`cat $PBS_NODEFILE | wc -l`
export NHOSTS=`cat $PBS_NODEFILE | uniq | wc -l`

cd $CurrDir || exit 2
rsync -avP * $TMPDIR/

# Run calculation

cd $TMPDIR

julia -p $NPROCS jpara.jl

rsync -avP * $CurrDir/

# Clean up

cd $CurrDir && rm -rf $TMPDIR

To submit the job with the minimal working example:

#!/usr/bin/env julia

println("Test calculation using $(nworkers()) procesors")

function fpar()
   @sync @distributed for i = 1:10
      ext(i)
   end
end

function fnopar()
   for i = 1:10
      ext(i)
   end
end

@everywhere function ext(i::Int64)
   callmop = `/home/ramon/bin/MOPACMINE/MOPAC2016.exe ./inp_semp/geo_$(i).mop`
   run(callmop)
end

val, t_par, bytes, gctime, memallocs = @timed fpar()
val, t_nopar, bytes, gctime, memallocs = @timed fnopar()

println("That took $(t_par) seconds for the parallel execution, and $(t_nopar) for the serial one")

In the example above, MOPAC2016.exe is a computational chemistry package, that takes as input some files of the type ./inp_semp/geo_$(i).mop. As a result of this test, I got this:

Test calculation using 4 processors
That took 2.829207343 seconds for the parallel execution, and 1.232857263 for the serial one

So the parallel execution was slower! At this point, I don’t have any idea what I’m doing wrong. Would you be so kind of giving me any hint? Thank you very much in advance.

Edit: I have also profiled in Python, using a similar PBS script to launch the job, and the multiprocessing module (I can show the code if needed). For the same operations, i.e. calling a function including the loop, it took 0.161532 seconds.

johnh · August 19, 2018, 7:06pm

You are running MOPAC2016.exe on a series of different input files?
I would advise using a PBS array job here https://arc-ts.umich.edu/software/torque/job-arrays/
Dont use Julia for this.

So simply qsub -t 1:100:
In the script:
/home/ramon/bin/MOPACMINE/MOPAC2016.exe ./inp_semp/geo_$(ARRAYID).mop

One other small point - you are reading the executable file from your home directory. It really is site dependent, however if your home directory is on an NFS share this might be inefficient. As your systems guys if there is fast or scratch storage which should be used.

Panadestein · August 19, 2018, 7:32pm

Thanks for your reply @johnh

The thing is that after this FOR loop, which actually runs MOPAC2016.exe on a series of different input files, I parse all the output files obtained and then compute an RMS value. The whole process is included in a function that I optimize using NLopt. Since the input files get modified in each iteration of the optimization, I don’t see how to apply the solution you proposed. Thanks for the second advice, I will check it.

johnh · August 19, 2018, 7:58pm

Aha. You are doing something which I think is termed computational steering’.
My bad for not realising there was further processing after those mopac runs.

The home directory comment was simply an aside comment. When running a code on many compute nodes in parallel if you launch the code at the same time then reading executables and libraries all at the same time can be a bottleneck.

johnh · August 20, 2018, 7:28am

@Panadestein Please run that loop a SECOND time in the same job script.

Even better use Benchmarktools.jl GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language

Topic		Replies	Views
Us novices also want moar speed Performance distributed	5	2188	December 26, 2021
Questions about getting started with parallel computing Julia at Scale	18	5778	June 22, 2019
Trying to write a parallel for loop in Julia New to Julia	2	3177	June 6, 2020
Parallel is very slow General Usage parallel	16	4682	March 9, 2018
Performance issues with parallel Julia code Julia at Scale performance , parallel , distributed , scientific-computing	2	1132	October 29, 2021

Distributed for loop slower than serial?

Related topics