Problem with using Distributed for a Monte Carlo simulation

I’m using the Distributed package to parallelize a for loop during a Monte Carlo simulation, this for loop is running over several initial conditions to generate a CSV file of L/2 x N entries where L is the size of the system I’m considering and N is the number of different initial conditions.
for L = 64 and N = 40 the code works perfectly well, however, as soon as I increase the L (for example to L = 128) it throws an error for each worker “Worker # terminated. Unhandled Task ERROR: EOFError: read end of file” or "Worker # terminated. Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET) " that for every worker until a final error “LoadError: TaskFailedException nested task error: ProcessExitedException(2)” is displayed.

I really have no idea what could be going wrong, it seems to be a memory overflow but in principle the clusters in which I’m working on should handle the quantities I’m inserting (on average they have 32 Gb of RAM and 2200-4500 cpu GHz)

1 Like

In my experience, those kinds of error messages (without an explicit segfault stacktrace from the dead worker process) are indeed due to OOM.

at the end it seems that an array of ones is consuming a lot of memory, and it is an array of ones of 1048576 times 1048576 that is huge, but at the end every entry in it are just 1s, is there a way of saving memory with an alternative to saving this array explicitly?

yeah, why do you have a giant array of 1s?

I think FillArrays.jl does this?

However when I use Ones (from FillArrays) I get that the max size it can get is 100

using FillArrays
a = Ones(1024, 1024)
size(a, 1) outputs 100

Hmmm, I don’t see the same:

julia> using FillArrays

julia> a = Ones(1024, 1024)
1024×1024 Ones{Float64}

julia> a[1]
1.0

julia> a[1, 1]
1.0

julia> a[1, :]
1024-element Ones{Float64}

julia> size(a, 1)
1024