Using pmap on Amazon EC2?

vvmisic · April 26, 2018, 9:57pm

Hi everyone - - I have some questions about using pmap. I’m trying to use pmap to run some embarrassingly parallel code on Amazon EC2. I’ve been running into a problem where it seems each worker keeps using up more and more memory, to the point where one of the workers terminates, and the others then stop as well. I’d like to understand how this happens and what I can do to better manage the execution.

Here is an example. I start a t2.large instance (AMI: JuliaPro 0.6.2.1_mkl RHEL 7.4 (ami-01e5a57b)). I then run Julia:

$ JuliaPro-0.6.2.1/julia

Running top at this stage gives me the following:

10359 ec2-user  20   0  113128   1400   1204 S   0.0  0.0   0:00.00 julia 
10365 ec2-user  20   0  479876 146188  48328 S   0.0  1.8   0:00.40 julia

I create the workers and dummy functions for pmap:

julia> addprocs(4)
4-element Array{Int64,1}:
 2
 3
 4
 5

julia> @everywhere f = function(x)
               A = rand(x*10000,x*10000)
               return 0 
       end

julia> err_fn = function(e)
               @show e
               @show myid()
       end
(::#7) (generic function with 1 method)

julia> @show nprocs()
nprocs() = 5
5

Running top now shows the Julia processes of the workers:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND        
10359 ec2-user  20   0  113128   1400   1204 S   0.0  0.0   0:00.00 julia
10365 ec2-user  20   0  648556 172100  49648 S   0.0  2.1   0:01.72 julia                                      
10368 ec2-user  20   0  479380 145884  48312 S   0.0  1.8   0:00.44 julia
10369 ec2-user  20   0  578408 147696  48540 S   0.0  1.8   0:00.59 julia
10371 ec2-user  20   0  578312 150800  48512 S   0.0  1.9   0:00.59 julia
10373 ec2-user  20   0  578864 149972  48512 S   0.0  1.9   0:00.57 julia

When I now run pmap, some of the workers give OutOfMemory() errors, and the process eventually exits:

julia> out = pmap(f, 1:15, on_error = err_fn)
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 2:	e = OutOfMemoryError()
	From worker 2:	myid() = 2
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 2:	e = OutOfMemoryError()
	From worker 2:	myid() = 2
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 2:	e = OutOfMemoryError()
	From worker 2:	myid() = 2
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5






Worker 4 terminated.
ERROR: ProcessExitedException()ERROR (unhandled task failure): EOFError: read end of file

Stacktrace:
 [1] #571 at ./asyncmap.jl:178 [inlined]
 [2] foreach(::Base.##571#573, ::Array{Any,1}) at ./abstractarray.jl:1733
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64},N} where N) at ./asyncmap.jl:178
 [4] wrap_n_exec_twice(::Channel{Any}, ::Array{Any,1}, ::Base.Distributed.##204#207{WorkerPool}, ::Function, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64},N} where N) at ./asyncmap.jl:154
 [5] #async_usemap#556(::Function, ::Void, ::Function, ::Base.Distributed.##188#190, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64},N} where N) at ./asyncmap.jl:103
 [6] (::Base.#kw##async_usemap)(::Array{Any,1}, ::Base.#async_usemap, ::Function, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64},N} where N) at ./<missing>:0
 [7] (::Base.#kw##asyncmap)(::Array{Any,1}, ::Base.#asyncmap, ::Function, ::UnitRange{Int64}) at ./<missing>:0
 [8] #pmap#203(::Bool, ::Int64, ::Function, ::Array{Any,1}, ::Void, ::Function, ::WorkerPool, ::Function, ::UnitRange{Int64}) at ./distributed/pmap.jl:126
 [9] (::Base.Distributed.#kw##pmap)(::Array{Any,1}, ::Base.Distributed.#pmap, ::WorkerPool, ::Function, ::UnitRange{Int64}) at ./<missing>:0
 [10] #pmap#213(::Array{Any,1}, ::Function, ::Function, ::UnitRange{Int64}) at ./distributed/pmap.jl:156
 [11] (::Base.Distributed.#kw##pmap)(::Array{Any,1}, ::Base.Distributed.#pmap, ::Function, ::UnitRange{Int64}) at ./<missing>:0

julia> @show nprocs()
nprocs() = 4
4

Checking top now: (bolded worker process has a lot of memory)

10359 ec2-user  20   0  113128    196      0 S   0.0  0.0   0:00.00 julia
10365 ec2-user  20   0  713632 170536  15616 S   0.0  2.1   0:03.91 julia
10368 ec2-user  20   0  612844 102656    760 S   0.0  1.3   0:01.33 julia
**10369 ec2-user  20   0 3770456 3.082g    828 S   0.0 40.4   0:02.05 julia**
10373 ec2-user  20   0  711352 106108    764 S   0.0  1.3   0:01.38 julia

When I run @everywhere gc(), things become more normal:

10359 ec2-user  20   0  113128    196      0 S   0.0  0.0   0:00.00 julia
10365 ec2-user  20   0  713764 154368  15864 S   0.0  1.9   0:04.09 julia
10368 ec2-user  20   0  612844 113224   6304 S   0.0  1.4   0:01.44 julia
10369 ec2-user  20   0  645452 112880   6584 S   0.0  1.4   0:02.12 julia
10373 ec2-user  20   0  711352 111652   6308 S   0.0  1.4   0:01.49 julia

My questions are:

Is there a way to run pmap you would recommend that would be robust to these types of failures?
This is not shown in the example, but I have noticed that when I run my own code, the memory usage of each worker seems to keep increasing, despite trying to invoke the garbage collector by setting an array to 0 and invoking @everywhere gc(). I do not encounter this when trying to run the code locally. Is this normal? Is there a way to completely clear/reset a worker each time it executes the function given to pmap?
(More generally) Why are there two Julia processes when I start Julia on EC2?
(Related to 2) Does Julia tend to use more memory on EC2 than on “normal” machines?

I apologize if these are stupid questions - thank you in advance for any suggestions or information you can provide.

Topic		Replies	Views
Weird behavior of pmap General Usage	5	1415	July 2, 2019
Lack of improvement from distributed pmap, understanding a simple example New to Julia distributed , pmap	6	146	October 29, 2024
Using Julia with @parallel pmap or blank makes no difference in speed. Julia at Scale	3	852	March 22, 2018
Problems using pmap(), and doubt about the number of workers/processes to use General Usage pmap	3	1152	February 7, 2019
Pmap usage Performance question , parallel	1	356	December 13, 2020

Using pmap on Amazon EC2?

Related topics