Adding data to worker processes via @everywhere

colintbowers · January 9, 2019, 3:48am

Hi all,

I know next to nothing about parallel processing, so apologies if this question is obvious. I start julia with julia -p 2. Here is a broken simplified version of what my code does:

d = read(some_filepath) # d is really big and takes up most of my RAM
some_vector = read(some_other_filepath)
@everywhere n = 1
while n <= N 
    @everywhere current_d_small = some_function_available_everywhere(d, n)
    @everywhere anonymous_function = (x -> another_function_available_everywhere(x, current_d_small)
    y = pmap(anonymous_function, some_vector)
    n += 1
end

This code obviously doesn’t work, because of the line:

 @everywhere current_d_small = some_function_available_everywhere(d, n)

This line errors because d is not available everywhere. But I can’t make d available everywhere, since a single copy of it takes up most of my RAM.

For the actual work done in pmap I only need current_d_small to be available to all workers, which is much smaller than d. I’ve read the docs, but can’t seem to work out how, on each outer iteration, to get a copy of current_d_small to each of the workers so I can make the pmap call.

Any help would be much appreciated. Also, bear in mind, I know very little about parallel processing, so maybe I should be using something completely different to pmap here.

Thanks,

Colin

baggepinnen · January 9, 2019, 10:26am

If all workers are located on the same machine, you might want to consider threading instead of the functionality in Distributed. Threading uses shared memory and is in general more efficient due to less overhead, but also slighly trickier since you must ensure that all operations are thread safe. For example, IO is not currently thread safe. See the @threads macro to parallelize a for-loop. Also see the manual on threading.

Your example code does not make sense since you are not using y anywhere, but using threads might look something like

Threads.@threads for n = 1:N
    current_d_small = some_function_available_everywhere(d, n)
    anonymous_function = (x -> another_function_available_everywhere(x, current_d_small)
    y = map(anonymous_function, some_vector) # What to do with y?
end

You also have a threaded map in ThreadedMap.jl and KissThreading.jl

zgornel · January 9, 2019, 2:10pm

You could also check SharedArrays. Your function some_function_available_everywhere as sell as another_function should be able to work on chunks of data so there will be manual work involved …

andrewsteck · January 9, 2019, 4:21pm

If the above approaches don’t work for you, I’d also recommend checking out ParallelDataTransfer.jl.

affans · January 9, 2019, 4:33pm

Try @everywhere current_d_small = some_function_available_everywhere($d, n)

colintbowers · January 10, 2019, 1:46am

Thanks for responding. I’ll look into threading. I’d deliberately avoided it up to this point because it was still considered experimental. It sounds like this is only the case for IO now, which doesn’t apply to my current situation.

Regarding y, I store it in a pre-allocated array defined outside the outer loop. I omitted that part for simplicity, but perhaps it affects the ability to apply @threads?

Thanks again.

colintbowers · January 10, 2019, 1:47am

I was already starting to wonder if I should learn about these. Thanks for confirming that I need to

colintbowers · January 10, 2019, 1:50am

Oh wow that seems to do exactly what I was asking for! Many thanks. Chris’s contributions pop up everywhere!

colintbowers · January 10, 2019, 2:01am

Thanks for responding. I’m afraid this doesn’t work since it just results in Julia checking for an interpolation of d on the other workers (and it isn’t there).

colintbowers · January 10, 2019, 6:27am

Unfortunately the loop I want to do in parallel includes a lot of random number generation at different points. The docs suggest a way around this when using @threads, but it doesn’t look that fun for a newbie to working in parallel. I might try one of the other solutions. Thanks again.

baggepinnen · January 10, 2019, 6:33am

Unless you access the same part of the array simultaneously from different threads, there should be no problems with this pattern.

baggepinnen · January 10, 2019, 6:35am

ParallelDataTransfer.jl is great, but beware that it does copy the sent variable to the memory of each worker, which could be a problem if I understand OP.

colintbowers · January 11, 2019, 2:53am

No, I think this will work fine, as long as I can send current_d_small without having to also send d. I only need current_d_small for the parallel routine and it is much smaller than d. Am looking into all the various options now

Topic		Replies	Views
Help with parallelism General Usage question , parallel	9	1678	May 9, 2017
Parallel computing, sharing data with all workers General Usage	3	979	May 10, 2019
Pmap: does it copy or share arguments across processes? Julia at Scale parallel	9	1555	November 26, 2017
@everywhere and pmap inside of a function? Julia at Scale question , parallel	14	3851	August 17, 2017
Pmap() data inputs New to Julia	4	333	March 20, 2023

Adding data to worker processes via @everywhere

Related topics