Attaching workers to cores

Hi,

I am trying to run a program using Distributed. The workflow is to have
each remote worker execute the same function (with different arguments)
that is called using remotecall.
The problem is that initially all workers seem to be running on different cores
but suddenly, all the execution switches to using a single core. Could you please
help me keep the different workers on different logical cores?

Here is what I am doing:

using Distributed
addprocs(15)
@everywhere include("rijke_tangent_state_estimation.jl")
f = Array{Future,1}(undef, nworkers())
for i = 1:nworkers()
        filename = string("../data/rijke_exp3_", string(i), ".jld") 
        put!(RemoteChannel(i), filename) 
        @async f[i] = remotecall(assimilate_parameter_and_trajectory, i, filename)
        println("sent remote call to ", i)
end
for i = 1:nworkers()
        println(fetch(f[i]))
        println(myid(), "is done..")
end

How quick does your function run? Does using pmap instead make more sense?

It’s an expensive function, each evaluation takes about 3 hours on a single core i7 1.8GHz CPU.
The real pressing question is why does the above code switch to a single core evaluation, given that, when it started, it was using up as many cores as workers?
Thanks for letting me know about pmap! I can try it and let you know.
Does pmap have a different behavior for parallelism?

I don’t know what the internals are like for pmap, but the idea is that it automatically takes care of the scheduling for the long running functions.

Thanks, I just tried pmap and it ends up doing the same thing.
For the first few seconds, 16 cores are being used, and afterward, only one is being used.

It could be that your functions are returning quicker than you thought. Are you sure the function takes about 3 hours? Can you post more code and also how you called pmap? Thanks

Thanks for responding!

I am sure because I tried the function serially and it works perfectly fine.
It takes as long as I mentioned.
The problem is definitely in how I am parallelizing. Here is how I use pmap:

using Distributed
addprocs(15)
@everywhere include("rijke_tangent_state_estimation.jl")
filenames = Array{String,1}(undef, nworkers())
for i = 1:nworkers()
        filenames[i] = string("../data/rijke_exp3_", string(i), ".jld")
end
pmap(assimilate_parameter_and_trajectory, filenames)

OK I noticed that with pmap or using remotecall as in my original post, the code stops with a segmentation fault after a while. This is due to an out-of-memory exception which should not occur because the function does allocate memory less than the memory per core of my cluster node.
However, if all 16 workers run on the same core, which is what was happening, I would not be surprised if there is a seg fault due to an out of memory error.

Yes, that was my next question. Make sure there is enough memory for each julia worker process running the code. Furthermore, all the results from pmap are returned back to the head node (or the main core it was run on) so you must make sure there is available memory there as well to collect the results.

For example, my head node where I launch Julia and submit my pmap code only has 32 gb of memory. My compute nodes have 128gb of memory. So I have to make sure that when my workers still the objects back to pmap, it’s less than 32 gb in memory.

1 Like

Thank you so much @affans!
This was super useful:

I removed the return values of the function (just wrote the arrays to disk)
and this seemed to fix everything. I think returning all the arrays at once made the main core go out of RAM. Thanks for your help! :slight_smile:

In HPC the normal method to pin processes is to use the numactl tools (linux specific).
The slurm scheduler can do this.
This actually leads me to ask how is process pinning accomplished in Julia? You do ask a good question!

As you say it is important. Also if processes switch between cores it can cause cache invalidation.

1 Like