Adding @sync or (vcat) to @distributed produces error

bojeryd91 · June 29, 2021, 10:07pm

I am preparing my code to run it on a cluster using SLURM and exploiting multiple cores using the Distributed package. It’s a bit involved and I haven’t been able to reproduce the error in an MWE.

The point is to run a number of simulations using the command simulate but from different initial states (randomly sampled in this part of the project). I then want to stack the DataFrames each simulation produces as one (therefore, I use (vcat) and “df =” )

This fails and produces an error that randomInitialState() is not defined for worker 2.
However, if I run the for loop inline (in Juno), the code compiles as expected.
Also, if I remove @sync, vcat, and saving the result into df, the code runs.

I find this behavior surprising, and I haven’t found anyone who breaks their code by adding @sync. I wonder if anyone have suggestions what might be going on.

A sketch of the program

pids = addprocs(2)
@everywhere begin
   using DataFrames
   include("loadsDataAndCreates_randomPath_below.jl")
   parameters = 1.0
end
df =
@sync @distributed (vcat) for i in eachindex(1:num_simulations)
   state0_i = randomInitialState()
   df_i = simulate(state0_i, parameters) # This returns a DataFrame
end
rmprocs(pids)

The error is long, but starts with
LoadError: TaskFailedException
nested task error: On worker 2:
UndefVarError: #randomInitialState#229 not defined
Stacktrace:
[1] deserialize_datatype
@ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Serialization\src\Serialization.jl:1280
[2] handle_deserialize

I am at a loss to Google how serialization (a new topic for me) is interfering with my code.

bojeryd91 · July 1, 2021, 11:36am

I wasn’t able to resolve the problem here, but by converting to the following structure (after reading Parallel computing in Julia: Case study from Dept. Automatic Control, Lund University — Lund University), I now have a functioning program.

pids = addprocs(2)
@everywhere begin
   using DataFrames
   include("loadsDataAndCreates_randomPath_below.jl")
   parameters = 1.0
end
futures = Vector{Future}(undef, num_simulations)
for i in eachindex(1:num_simulations)
   state0_i = randomInitialState()
   futures[i] = @spawnat :any simulate(state0_i, parameters) # This returns a DataFrame
end
simulations = fetch.(futures)
df = [append!(simulations[1], simulations[i]) for i in 2:num_simulations]
rmprocs(pids)

I am not certain this performs as @sync @distributed would (I was surprised how quickly the code ran when doing it inline as described in the original question). But at least this runs quicker than what it did before.

Topic		Replies	Views
@Distributed: On worker 2 UndefVarError {{Module}} General Usage question	8	2838	July 14, 2020
@distributed @sync @async General Usage question	5	2806	May 16, 2020
Julia 1.0 Example of @distributed and pmap Julia at Scale	1	3817	August 26, 2018
Code that works fine distributed across processes on one node using slurm seems to fail when trying to generate workers across many Julia at Scale question	2	1389	May 19, 2022
Error when using @distributed for on cluster with multiple nodes Julia at Scale cluster	4	1768	August 31, 2018

Adding @sync or (vcat) to @distributed produces error

Related topics