Julia Distributed, redundant iterations appearing

WouterJRV · March 22, 2021, 4:44am

I ran

mpiexec -n $nprocs julia --project myfile.jl

on a cluster, where myfile.jl has the following form

using Distributed; using Dates; using JLD2;  using LaTeXStrings
@everywhere begin
using SharedArrays; using QuantumOptics; using LinearAlgebra; using Plots; using Statistics; using DifferentialEquations; using StaticArrays
#Defining some other functions and SharedArrays to be used later e.g.
MySharedArray=SharedArray{SVector{Nt,Float64}}(Np,Np)
end
@sync @distributed for pp in 1:Np^2
  for jj in 1:Nj 
  #do some stuff with local variables
  for tt in 1:Nt
  #do some stuff with local variables
  end
  end
  MySharedArray[pp]=... #using linear indexing
  println("$pp finished")
end

timestr=Dates.format(Dates.now(), "yyyy-mm-dd-HH:MM:SS")
filename="MyName"*timestr

@save filename*".jld2"

#later on, some other small stuff like making and saving a figure. (This does give an error "no method matching heatmap_edges(::Surface{Array{Float64,2}}, ::Symbol)" but I think that this is a technical thing about Plots so not very related to the bigger issue here)

However, when looking at the output, there are a few issues that make me conclude that something is wrong

The “$pp finished” output is repeated many times for each value of pp. It seems that this amount is actually equal to 32=$nprocs
Despite the code not being finished, “MyName” files are generated. It should be one, but I get a dozen of them with different timestr component

Two more things that I can add

earlier, I wrote that the walltime was exceeded, but this turns out not to be true. The .o file ends with “BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES … EXIT CODE :9”, pretty shortly after the last output file.
the output of the different “MyName” files is not identical, but this is expected since random numbers are used in the inner loops. There are 28 of them, a number that I don’t easily recognize except that its again close to the 32 $nprocs.

$nprocs is obtained in the pbs script through

#PBS -l select=1:ncpus=32:mpiprocs=32
nprocs= `cat $PBS_NODEFILE|wc -l`

(also posted as mpi - Julia Distributed, redundant iterations appearing - Stack Overflow)

WouterJRV · March 23, 2021, 3:37am

I just thought of a possible explanation.

I thought that all variables used in the loops should have been declared under the @everywhere macro, so that’s what I did, including the initialization of the sharedarrays.

In this example differential equations - writing into shared arrays within a distributed for loop in JULIA - Stack Overflow,
they only declare functions and packages @everywhere instead.

Would it make sense if this is my problem?

adamslc · March 23, 2021, 4:34am

You shouldn’t be mixing mpiexec and Distributed. As you have written your script now, every node starts a julia process, and executes every element in your for loop serially. Instead (assuming that you want to use Distributed), you need to invoke your script like normal, and call addprocs to tell julia about other nodes.

ClusterManagers.jl might be useful:

WouterJRV · March 23, 2021, 5:06am

Hi, thanks for your answer

I´m only using cores on a single node, does it make any difference ? I remember having tried starting julia with the -p flag as would work on a desktop and it gave some errors that I don’t remember.
I will give this one a try.

WouterJRV · March 23, 2021, 11:36am

First starting julia with one core, then adding more for within the session worked. Thanks

Topic		Replies	Views
Help for basic usage of @distributed for Julia at Scale question	1	500	October 1, 2020
Huge distributed overhead Performance question , parallel , memory-allocation , distributed , sharedarrays	2	189	June 19, 2024
Memory trouble with @distributed New to Julia parallel	4	647	June 8, 2021
Getting started with distributed Julia computations on a cluster Julia at Scale	1	580	September 27, 2020
Multi-dimensional SharedArrays.jl don't work in @distributed loops Julia at Scale distributed	2	787	January 23, 2019

Julia Distributed, redundant iterations appearing

Related topics