Julia Distributed, redundant iterations appearing

I ran

mpiexec -n $nprocs julia --project myfile.jl 

on a cluster, where myfile.jl has the following form

using Distributed; using Dates; using JLD2;  using LaTeXStrings
@everywhere begin
using SharedArrays; using QuantumOptics; using LinearAlgebra; using Plots; using Statistics; using DifferentialEquations; using StaticArrays
#Defining some other functions and SharedArrays to be used later e.g.
MySharedArray=SharedArray{SVector{Nt,Float64}}(Np,Np)
end
@sync @distributed for pp in 1:Np^2
  for jj in 1:Nj 
  #do some stuff with local variables
  for tt in 1:Nt
  #do some stuff with local variables
  end
  end
  MySharedArray[pp]=... #using linear indexing
  println("$pp finished")
end

timestr=Dates.format(Dates.now(), "yyyy-mm-dd-HH:MM:SS")
filename="MyName"*timestr

@save filename*".jld2"

#later on, some other small stuff like making and saving a figure. (This does give an error "no method matching heatmap_edges(::Surface{Array{Float64,2}}, ::Symbol)" but I think that this is a technical thing about Plots so not very related to the bigger issue here)

However, when looking at the output, there are a few issues that make me conclude that something is wrong

  • The “$pp finished” output is repeated many times for each value of pp. It seems that this amount is actually equal to 32=$nprocs
  • Despite the code not being finished, “MyName” files are generated. It should be one, but I get a dozen of them with different timestr component

Two more things that I can add

  • earlier, I wrote that the walltime was exceeded, but this turns out not to be true. The .o file ends with “BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES … EXIT CODE :9”, pretty shortly after the last output file.
  • the output of the different “MyName” files is not identical, but this is expected since random numbers are used in the inner loops. There are 28 of them, a number that I don’t easily recognize except that its again close to the 32 $nprocs.

$nprocs is obtained in the pbs script through

#PBS -l select=1:ncpus=32:mpiprocs=32
nprocs= `cat $PBS_NODEFILE|wc -l`

(also posted as https://stackoverflow.com/q/66739636/2283226)

I just thought of a possible explanation.

I thought that all variables used in the loops should have been declared under the @everywhere macro, so that’s what I did, including the initialization of the sharedarrays.

In this example differential equations - writing into shared arrays within a distributed for loop in JULIA - Stack Overflow,
they only declare functions and packages @everywhere instead.

Would it make sense if this is my problem?

You shouldn’t be mixing mpiexec and Distributed. As you have written your script now, every node starts a julia process, and executes every element in your for loop serially. Instead (assuming that you want to use Distributed), you need to invoke your script like normal, and call addprocs to tell julia about other nodes.

ClusterManagers.jl might be useful:

2 Likes

Hi, thanks for your answer

I´m only using cores on a single node, does it make any difference ? I remember having tried starting julia with the -p flag as would work on a desktop and it gave some errors that I don’t remember.
I will give this one a try.

First starting julia with one core, then adding more for within the session worked. Thanks