Problem with using Distributed for a Monte Carlo simulation

greivin · March 3, 2023, 4:32pm

I’m using the Distributed package to parallelize a for loop during a Monte Carlo simulation, this for loop is running over several initial conditions to generate a CSV file of L/2 x N entries where L is the size of the system I’m considering and N is the number of different initial conditions.
for L = 64 and N = 40 the code works perfectly well, however, as soon as I increase the L (for example to L = 128) it throws an error for each worker “Worker # terminated. Unhandled Task ERROR: EOFError: read end of file” or "Worker # terminated. Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET) " that for every worker until a final error “LoadError: TaskFailedException nested task error: ProcessExitedException(2)” is displayed.

I really have no idea what could be going wrong, it seems to be a memory overflow but in principle the clusters in which I’m working on should handle the quantities I’m inserting (on average they have 32 Gb of RAM and 2200-4500 cpu GHz)

quinnj · March 3, 2023, 6:08pm

In my experience, those kinds of error messages (without an explicit segfault stacktrace from the dead worker process) are indeed due to OOM.

greivin · March 8, 2023, 4:40pm

at the end it seems that an array of ones is consuming a lot of memory, and it is an array of ones of 1048576 times 1048576 that is huge, but at the end every entry in it are just 1s, is there a way of saving memory with an alternative to saving this array explicitly?

Oscar_Smith · March 8, 2023, 4:53pm

yeah, why do you have a giant array of 1s?

quinnj · March 8, 2023, 5:02pm

I think FillArrays.jl does this?

greivin · March 8, 2023, 5:13pm

However when I use Ones (from FillArrays) I get that the max size it can get is 100

using FillArrays
a = Ones(1024, 1024)
size(a, 1) outputs 100

quinnj · March 11, 2023, 4:07pm

Hmmm, I don’t see the same:

julia> using FillArrays

julia> a = Ones(1024, 1024)
1024×1024 Ones{Float64}

julia> a[1]
1.0

julia> a[1, 1]
1.0

julia> a[1, :]
1024-element Ones{Float64}

julia> size(a, 1)
1024

Topic		Replies	Views
Error using sum in DistributedArrays Julia at Scale	2	1136	November 11, 2017
Parallelizing Monte Carlo - using buffers, avoiding global variables New to Julia hpc , parallel , distributed , sharedarrays , simulations	3	754	August 26, 2022
What causes the Distributed.ProcessExitedException(3) error in Julia and how can I resolve it in my Pluto notebook? Pluto distributed , julia	4	971	April 27, 2023
"ERROR: LoadError: TaskFailedException: IOError: stream is closed or unusable" with @distributed on multinode cluster Julia at Scale cluster	3	1276	March 19, 2020
@distributed fails for many workers? Julia at Scale	21	1650	May 14, 2019

Problem with using Distributed for a Monte Carlo simulation

Related topics